Deep Reinforcement Learning with Applications in Transportation
A tutorial at AAAI 2019, 8:30 – 12:30 PM, January 27, 2019,
Coral 2, Main level, Hilton Hawaiian Village, Honolulu, Hawaii, USA

Goal

Transportation, particularly the mobile ride-sharing domain has a number of traditionally challenging dynamic decision problems that have long threads of research literature and readily stand to benefit tremendously from artificial intelligence (AI). Some core examples include online ride order dispatching, which matches available drivers to trip-requesting passengers on a ride-sharing platform in real- time; route planning, which plans the best route between the origin and destination of a trip; and traffic signals control, which dynamically and adaptively adjusts the traffic signals within a region to achieve low delays. All of these problems have a common characteristic that a sequence of decisions is to be made while we care about some cumulative objectives over a certain horizon. Reinforcement learning (RL) is a machine learning paradigm that trains an agent to learn to take optimal actions (as measured by the total cumulative reward achieved) in an environment through interactions with it and getting feedback signals. It is thus a class of optimization methods for solving sequential decision-making problems. Thanks to the rapid advancement in deep learning research and computing capabilities, the integration of deep neural networks and RL has generated explosive progress in the latter for solving complex large-scale learning problems, attracting huge amount of renewed interests in the recent years. The combination of deep learning and RL has even been considered as a path to true AI. It presents a tremendous potential to solve some hard problems in transportation in an unprecedented way.

This tutorial is targeted to researchers and practitioners with a general machine learning background and are interested in working on applications of deep RL (DRL) in transportation. The goal of this tutorial is to provide the audience with a guided introduction to this exciting area of AI with specially curated application case studies in transportation. The tutorial covers both theory and practice, with more emphasis on the practical aspects of DRL that are pertinent to tackle transportation challenges. After the half-day of lectures, the audience would get an overview of the core DRL methods and their applications, particularly in transportation and ride-sharing domains. They will have a better understanding about the major challenges in transportation and how DRL can help solve those problems. They will also be introduced to several popular open-source DRL development and benchmarking frameworks to get a head-start in experimentation.

Outline

Part I: Basics: Values

  • Machine learning paradigms: supervised, unsupervised, RL
  • RL basics
    • Markov decision process (Puterman 2014)
    • Optimization problem, objective
    • Value function, policy
    • DP methods: value iterations, policy iterations
    • TD learning (Sutton 1988)
    • Q-learning (Watkins and Dayan 1992), SARSA (Sutton, Barto, and others 1998)
    • Example: Route planning
    • Example: TD (0) policy improvement for order dispatching (Xu et al. 2018)
  • Function approximation (Tsitsiklis and Van Roy 1997)
    • Linear/ Non-linear approximation, neural networks
    • DQN (Mnih et al. 2015), double DQN (Van Hasselt, Guez, and Silver 2016), Deep SARSA (Zhao et al.2016; Ganger, Duryea, and Hu 2016)
    • Experience replay: prioritized experience replay (Schaul et al. 2015)
    • Example: DQN with action search for dispatching (Wang et al. 2018)
    • Example: Global-view DQN for dispatching and repositioning

Part II: Advances: Policies

  • Policy optimization
    • Policy gradient: REINFORCE, DDPG (Lillicrap et al. 2015), PPO (Schulman et al. 2017)
    • Actor-critic: A2C, A3C (Mnih et al. 2016)
    • Example: PG methods for traffic signals control (Casas 2017; Ritcher 2007)
  • Advanced topics
    • Transfer learning, Example: Transfer among cities for dispatching (Wang et al. 2018)
    • Multi-agent RL: mean-field (Yang et al. 2018), Example: mean-field MARL for dispatching
    • Semi-MDP, Options (Sutton, Precup, and Singh 1999)

Part III: Practice

Material

In PDF

Slide Decks of Related Speeches

Presenters/Contributors

  • Zhiwei (Tony) Qin, DiDi AI Labs, DiDi Research America
  • Jian Tang, DiDi AI Labs, Didi Chuxing & Syracuse University
  • Jieping Ye, DiDi AI Labs, Didi Chuxing & University of Michigan, Ann Arbor
  • Lulu Zhang, Research Outreach, Didi Chuxing

Presenters' Bios

Dr. Zhiwei (Tony) Qin is a researcher in DiDi AI Labs and leads the reinforcement learning research at DiDi AI Labs. He received his Ph.D. in Operations Research from Columbia University and B.Sc. in Computer Science and Statistics from the University of British Columbia, Vancouver. Tony is broadly interested in research topics at the intersection of optimization and machine learning, and most recently in reinforcement learning and its applications in operational optimization, digital marketing, traffic signals control, and education. He has published in top-tier conferences and journals in machine learning and optimization, including ICML, IEEE ICDM, JMLR, and MPC. He has served as Senior PC/PC of AAAI, SDM, JMLR, TPAMI, TKDE, and other operations research journals.

Dr. Jian Tang is the chief scientist of intelligent control at DiDi AI Labs. He is also a professor of Syracuse University. His research interests lie in the areas of Machine Learning, Big Data, Cloud Computing and Wireless Networking. He has published over 130 papers in premier journals and conferences. He received an NSF CAREER award in 2009. He has served as an editor for a few IEEE journals. In addition, he served as a TPC co-chair for the 2018 International Conference on Mobile and Ubiquitous Systems: Computing; as the TPC vice chair for the 2019 IEEE International Conference on Computer Communications (INFOCOM); and as an area TPC chair for INFOCOM 2017-2018. He is currently the vice chair of the Communications Switching and Routing Committee of IEEE Communications Society.

Dr. Jieping Ye is head of DiDi AI Labs, a VP of Didi Chuxing. He is also an associate professor of University of Michigan, Ann Arbor. His research interests include big data, machine learning, and data mining with applications in transportation and biomedicine. He has served as a Senior Program Committee/Area Chair/Program Committee Vice Chair of many conferences including NIPS, ICML, KDD, IJCAI, ICDM, and SDM. He serves as an Associate Editor of Data Mining and Knowledge Discovery, IEEE Transactions on Knowledge and Data Engineering, and IEEE Transactions on Pattern Analysis and Machine Intelligence. He won the NSF CAREER Award in 2010. His papers have been selected for the outstanding student paper at ICML in 2004, the KDD best research paper runner up in 2013, and the KDD best student paper award in 2014.

References

[BertsekasandTsitsiklis1995] Bertsekas, D. P., and Tsitsiklis ,J.N. 1995. Neuro-dynamic programming: an overview. In Proceedings of the 34th IEEE Conference on Decision and Control, volume 1, 560–564. IEEE Publ. Piscataway, NJ.

[Bertsekasetal.2005] Bertsekas, D.P.; Bertsekas, D.P.; Bertsekas, D. P.; and Bertsekas, D. P. 2005. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA.

[Casas 2017] Casas, N. 2017. Deep deterministic policy gradient for urban traffic light control. arXiv preprint arXiv:1703.09035.

[Ganger, Duryea, and Hu 2016] Ganger, M.; Duryea, E.; and Hu, W. 2016. Double sarsa and double expected sarsa with shallow and deep learning. Journal of Data Analysis and Information Processing 4(04):159.

[Li 2017] Li, Y. 2017. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274.

[Lillicrap et al. 2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.

[Mnih et al. 2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.

[Mnih et al. 2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937.

[Powell 2007] Powell, W. B. 2007. Approximate Dynamic Pro- gramming: Solving the curses of dimensionality, volume 703. John Wiley & Sons.

[Puterman 2014] Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.

[Ritcher 2007] Ritcher, S. 2007. Traffic light scheduling using policy-gradient reinforcement learning. In The International Conference on Automated Planning and Scheduling., ICAPS.

[Schaul et al. 2015] Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952.

[Schulmanetal.2017] Schulman,J.;Wolski,F.;Dhariwal,P.;Rad- ford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[Sutton, Barto, and others 1998] Sutton, R. S.; Barto, A. G.; et al. 1998. Reinforcement learning: An introduction. MIT press.

[Sutton, Precup, and Singh 1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(1-2):181–211.

[Sutton 1988] Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine learning 3(1):9–44.

[Szepesva ́ri 2010] Szepesva ́ri, C. 2010. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning 4(1):1–103.

[Tsitsiklis and Van Roy 1997] Tsitsiklis, J. N., and Van Roy, B. 1997. Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing sys- tems, 1075–1081.

[Van Hasselt, Guez, and Silver 2016] Van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double q-learning. In AAAI, 2094–2100.

[Wang et al. 2018] Wang, Z.; Qin, Z.; Tang, X.; Ye, J.; and Zhu, H. 2018. Deep reinforcement learning with knowledge transfer for online rides order dispatching. In International Conference on Data Mining. IEEE.

[Watkins and Dayan 1992] Watkins, C. J., and Dayan, P. 1992. Q- learning. Machine learning 8(3-4):279–292.

[Xu etal.2018] Xu, Z.; Li, Z.; Guan, Q.; Zhang, D.; Li,Q.;Nan, J.; Liu, C.; Bian, W.; and Ye, J. 2018. Large-scale order dispatch in on- demand ride-hailing platforms: A learning and planning approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 905–913. ACM.

[Yang et al. 2018] Yang, Y.; Luo, R.; Li, M.; Zhou, M.; Zhang, W.; and Wang, J. 2018. Mean field multi-agent reinforcement learning. CoRR abs/1802.05438.

[Zhaoetal.2016] Zhao, D.; Wang, H.; Shao, K.; and Zhu, Y. 2016. Deep reinforcement learning with experience replay based on sarsa. In Computational Intelligence (SSCI), 2016 IEEE Sympo- sium Series on, 1–6. IEEE.