Deep Reinforcement Learning
with Applications in Transportation
A tutorial at KDD 2019, 8:00 AM – 12:00 AM, August 4, 2019,
Tubughnenq 4- Level 2, Dena’ina, Anchorage, Alaska, USA


Transportation, particularly the mobile ride-sharing domain has a number of traditionally challenging dynamic decision problems that have long threads of research literature and readily stand to benefit tremendously from artificial intelligence (AI). Some core examples include online ride order dispatching, which matches available drivers to trip requesting passengers on a ride-sharing platform in real-time; route planning, which plans the best route between the origin and destination of a trip; and traffic signals control, which dynamically and adaptively adjusts the traffic signals within a region to achieve low delays. All of these problems have a common characteristic that a sequence of decisions is to be made while we care about some cumulative objectives over a certain horizon. Reinforcement learning (RL) is a machine learning paradigm that trains an agent to learn to take optimal actions (as measured by the total cumulative reward achieved) in an environment through interactions with it and getting feedback signals. It is thus a class of optimization methods for solving sequential decision-making problems. Thanks to the rapid advancement in deep learning re-search and computing capabilities, the integration of deep neural networks and RL has generated explosive progress in the latter for solving complex large-scale learning problems, attracting huge amount of renewed interests in the recent years. The combination of deep learning and RL has even been considered as a path to true AI. It presents a tremendous potential to solve some hard problems in transportation in an unprecedented way.


Part I: Basics and value-based methods

  • Machine learning paradigms: supervised, unsupervised, RL
  • RL basics
    • Markov decision process
    • Optimization problem, objective
    • Value function, policy
    • DP methods: value iterations, policy iterations
    • TD learning [14]
    • Q-learning [21], SARSA [15]
    • Example: TD (0) policy improvement for order dispatching [22]
  • Function approximation [18]
    • Linear/ Non-linear approximation, neural networks
    • DQN [10], double DQN [19], Deep SARSA [3,24]
    • Experience replay: prioritized experience replay [12]
    • Example: Deep value networks for dispatching [17, 20]
    • Example: DQN for dispatching and repositioning [4]
    • Example: DQN for driver repositioning fleetmanagement
    • Example: DQN for carpool decision-marking [5]

Part II: Policy-based methods and advanced topics

  • Policy optimization
    • Policy gradient: REINFORCE, DDPG [8], PPO [13]
    • Actor-critic: A2C, A3C [9]
    • Example: Autonomous driving control
    • Example: Route planning/navigation with and without maps Advanced topics
    • Transfer learning, Examples: Transfer among cities for dis-patching [20]; City navigations
    • Multi-agent RL: mean-field [23], Example: mean-field MARL for dispatching [6]

Part III: Practice

  • RL development frameworks
    • RL development frameworks
    • Application specific tools
      • Traffic Lights Control: SUMO, Flow
      • Autonomous Driving: TORCS, CARLA
    • Open data sets
In addition to the papers referenced above, the following text-books [1, 2, 11, 15, 16] and survey paper [7] form the general basic references for this tutorial.





  • Zhiwei (Tony) Qin, DiDi AI Labs, DiDi Labs
  • Jian Tang, DiDi AI Labs, Didi Chuxing & Syracuse University
  • Jieping Ye, DiDi AI Labs, Didi Chuxing & University of Michigan, Ann Arbor
  • Lulu Zhang, Research Outreach, Didi Chuxing


Dr. Zhiwei (Tony) Qin is a researcher in DiDi AI Labs and leads the reinforcement learning research at DiDi AI Labs. He received his Ph.D. in Operations Research from Columbia University and B.Sc. in Computer Science and Statistics from the University of British Columbia, Vancouver. Tony is broadly interested in research topics at the intersection of optimization and machine learning, and most recently in reinforcement learning and its applications in operational optimization, digital marketing, traffic signals control, and education. He has published in top-tier conferences and journals in machine learning and optimization, including ICML, IEEE ICDM, JMLR, and MPC. He has served as Senior PC/PC of AAAI, SDM, JMLR, TPAMI, TKDE, and other operations research journals.

Dr. Jian Tang is the chief scientist of intelligent control at DiDi AI Labs. He is also a professor of Syracuse University. He is a fellow of IEEE. His research interests lie in the areas of Machine Learning, Big Data, Cloud Computing and Wireless Networking. He has published over 130 papers in premier journals and conferences. He received an NSF CAREER award in 2009. He has served as an editor for a few IEEE journals. In addition, he served as a TPC co-chair for the 2018 International Conference on Mobile and Ubiquitous Systems: Computing; as the TPC vice chair for the 2019 IEEE International Conference on Computer Communications (INFOCOM); and as an area TPC chair for INFOCOM 2017-2018. He is currently the vice chair of the Communications Switching and Routing Committee of IEEE Communications Society.

Dr. Jieping Ye is head of DiDi AI Labs, a VP of Didi Chuxing, DiDi Fellow. He is also a professor of University of Michigan, Ann Arbor. His research interests include big data, machine learning, and data mining with applications in transportation and biomedicine. He has served as a Senior Program Committee/Area Chair/Program Committee Vice Chair of many conferences including NIPS, ICML, KDD, IJCAI, ICDM, and SDM. He serves as an Associate Editor of Data Mining and Knowledge Discovery, IEEE Transactions on Knowledge and Data Engineering. He won the NSF CAREER Award in 2010. His papers have been selected for the outstanding student paper at ICML in 2004, the KDD best research paper runner up in 2013, and the KDD best student paper award in 2014.


[1] Dimitri P Bertsekas, Dimitri P Bertsekas, Dimitri P Bertsekas, and Dimitri P Bertsekas. 2005. Dynamic programming and optimal control. Vol. 1. Athena scientific Belmont, MA.

[2] Dimitri P Bertsekas and John N Tsitsiklis. 1995. Neuro-dynamic programming: an overview. In Proceedings of the 34th IEEE Conference on Decision and Control, Vol. 1. IEEE Publ. Piscataway, NJ, 560–564.

[3] Michael Ganger, Ethan Duryea, and Wei Hu. 2016. Double Sarsa and Double Expected Sarsa with Shallow and Deep Learning. Journal of Data Analysis and Information Processing 4, 04 (2016), 159.

[4] J. Holler, Z. Qin, X. Tang, Y. Jiao, T. Jin, S. Singh, C. Wang, and J. Ye. 2018. Deep Q-Learning Approaches to Dynamic Multi-Driver Dispatching and Repositioning. In NeurIPS 2018 Deep Reinforcement Learning Workshop.

[5] Ishan Jindal, Zhiwei Tony Qin, Xuewen Chen, Matthew Nokleby, and Jieping Ye. 2018. Optimizing Taxi Carpool Policies via Reinforcement Learning and Spatio-Temporal Mining. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 1417–1426.

[6] Minne Li, Zhiwei Qin, Yan Jiao, Yaodong Yang, Zhichen Gong, Jun Wang, Chenxi Wang, Guobin Wu, and Jieping Ye. 2019. Efficient Ridesharing Order Dispatching with Mean Field Multi-Agent Reinforcement Learning. In To appear in Proceedings of the 2019 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee.

[7] Yuxi Li. 2017. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 (2017).

[8] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).

[9] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Tim-othy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asyn-chronous methods for deep reinforcement learning. In International conference on machine learning. 1928–1937.

[10] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.

[11] Warren B Powell. 2007. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John Wiley & Sons.

[12] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015).

[13] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).

[14] Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning 3, 1 (1988), 9–44.

[15] Richard S Sutton, Andrew G Barto, et al. 1998. Reinforcement learning: An intro-duction. MIT press.

[16] Csaba Szepesvári. 2010. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning 4, 1 (2010), 1–103.

[17] Xiaocheng Tang, Zhiwei (Tony) Qin, Fan Zhang, Zhaodong Wang, Zhe Xu, Yintai Ma, Hongtu Zhu, and Jieping Ye. 2019. A Deep Value-network Based Approach for Multi-Driver Order Dispatching. In To appear in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[18] John N Tsitsiklis and Benjamin Van Roy. 1997. Analysis of temporal-difference learning with function approximation. In Advances in neural information process-ing systems. 1075–1081.

[19] Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep Reinforcement Learning with Double Q-Learning. In AAAI. 2094–2100.

[20] Zhaodong Wang, Zhiwei Qin, Xiaocheng Tang, Jieping Ye, and Hongtu Zhu. 2018. Deep Reinforcement Learning with Knowledge Transfer for Online Rides Order Dispatching. In International Conference on Data Mining. IEEE.

[21] Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.

[22] Zhe Xu, Zhixin Li, Qingwen Guan, Dingshui Zhang, Qiang Li, Junxiao Nan, Chunyang Liu, Wei Bian, and Jieping Ye. 2018. Large-Scale Order Dispatch in On-Demand Ride-Hailing Platforms: A Learning and Planning Approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 905–913.

[23] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean Field Multi-Agent Reinforcement Learning. CoRR abs/1802.05438 (2018). arXiv:1802.05438

[24] Dongbin Zhao, Haitao Wang, Kun Shao, and Yuanheng Zhu. 2016. Deep reinforcement learning with experience replay based on SARSA. In Computational Intelligence (SSCI), 2016 IEEE Symposium Series on. IEEE, 1–6.