In last decade, Artificial Intelligence (AI) has changed the world significantly. Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. Spoken language processing is a wide area including speech recognition, conversational understanding, speech synthesis, speech emotion recognition, etc. Both NLP and speech fill the gap between human communication and computer understanding.

NLP has seen a number of breakthroughs in recent years, including the delivery of pre-trained word vectors that boost accuracy in many NLP tasks. These vectors make available a wealth of information about the meaning, grammar, sentiment, and topic of both words and sentences. We describe vector-training methods and provide many practical examples of processing unstructured text data with this new technology. We also describe cross-lingual representations and issues with social bias.

Though speech processing has been studied for decades, there have been many breakthroughs in recent years, coming along with the emergence of deep learning. We cover the methods of key areas in spoken language processing and provide practical applications of this new technology. We also discuss how to combine speech and text information for multimodal modeling.


The tutorial will consist of two lectures as a whole day event. 


Part I: Natural Language Processing

• Word representations.

• Sentence representations.

• NLP Benchmarks.

• Multilingual representations. Social bias.

• Text embedding applications.

• Graph embedding techniques and applications.

• Text/Graph embedding applications in customer service scenarios.

Part II: Speech

• Speech recognition: basic concepts and classic methods.

• Speech recognition: deep learning approaches, end-to-end approaches, and applications.

• Conversational understanding: dialogue intent and topic mining.

• Multimodal approach: speech and text for emotion recognition.


In addition, we plan to distribute the following materials:

– Lecture slides

– Demo

– Survey paper for details on the topic




  • Kun Han, DiDi AI Labs, DiDi Research America
  • Xiangang Li, DiDi AI Labs, Didi Chuxing
  • Zang Li, DiDi AI Labs, Didi Chuxing
  • Kevin Knight, DiDi AI Labs, Didi Chuxing & University of Southern California, Los Angeles
  • Jieping Ye, DiDi AI Labs, Didi Chuxing & University of Michigan, Ann Arbor
  • Cheng (Angel) Gong, Research Outreach, Didi Chuxing

Tutor's Bio

Dr. Kun Han is Senior Staff Researcher at DiDi Chuxing. He joined DiDi in 2018 and currently leads a team focusing on natural language and conversational understanding at Mountain View and Beijing. He received a PhD in computer science from The Ohio State University at 2014. Prior to DiDi, he was a research scientist at Facebook. He has co-authored over 20 academic papers on premier journals and conferences. His research interests include natural language processing, dialogue systems, speech recognition and processing, recommendation systems.


Dr. Xiangang Li is Principal Engineer and currently leads Speech team at DiDi Chuxing. He received his PhD in Machine intelligent Peking University. Prior to DiDi, he was a senior Research Scientist at Baidu. He has co-authored over 30 academic papers on speech and language processing for several conferences and journals. His research interests include speech recognition, speech synthesis and natural language processing.


Dr. Zang Li is Distinguished Engineer at Didi Chuxing. He currently leads the Data Mining group and the NLP-Beijing team. He received his PhD in college of Information Sciences and Technology from Pennsylvania State University. He has worked at Cisco, Linkedin and Zenefits. Dr. Li's main research interest includes recommendation systems, big data, machine learning, and natural language processing. He has co-authored over 20 academic papers on data mining and also served as a reviewer for several conferences and journals. He joined DiDi in 2015, working on data mining, big data platforms, natural language processing, knowledge graph, and growth strategies.


Prof. Kevin Knight is Chief Scientist for Natural Language Processing (NLP) at Didi Chuxing. He leads a DiDi lab in Los Angeles devoted to NLP research. He was previously Dean's Professor of Computer Science at the University of Southern California (USC) and a Research Director and Fellow at USC's Information Sciences Institute (ISI). He received a PhD in computer science from Carnegie Mellon University and a bachelor's degree from Harvard University. Dr. Knight's research interests include human-machine communication, machine translation, language generation, automata theory, and decipherment. Dr. Knight co-authored the widely-adopted textbook "Artificial Intelligence" (McGraw-Hill). In 2001, he co-founded the machine translation company Language Weaver, Inc. Dr. Knight served as President of the Association for Computational Linguistics (ACL) in 2011, and he is currently a Fellow of ACL, ISI, and AAAI (Association for the Advancement of Artificial Intelligence).


Prof. Jieping Ye is Head of Didi AI Labs, a VP of Didi Chuxing and a Didi Fellow. He is also an associate professor of University of Michigan, Ann Arbor. His research interests include big data, machine learning, and data mining with applications in transportation and biomedicine. He has served as a Senior Program Committee/Area Chair/Program Committee Vice Chair of many conferences including NIPS, ICML, KDD, IJCAI, ICDM, and SDM. He serves as an Associate Editor of Data Mining and Knowledge Discovery, IEEE Transactions on Knowledge and Data Engineering, and IEEE Transactions on Pattern Analysis and Machine Intelligence. He won the NSF CAREER Award in 2010. His papers have been selected for the outstanding student paper at ICML in 2004, the KDD best research paper runner up in 2013, and the KDD best student paper award in 2014.

Related Materials


  • P. D. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics. JAIR, 2010.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. Proc. ICLR, 2013.
  • Y. Meng, W. Wu, F. Wang, X. Li, P. Nie, F. Yin, M. Li, Q. Han, X. Sun, and J.Li. Glyce: Glyph-vectors for Chinese Character Representations.
  • Q. Le and T. Mikolov. Distributed representations of sentences and documents. Proc. ICML, 2014.
  • R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thought vectors. Proc. NIPS, 2015.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. Proc. NAACL, 2018.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. Proc. NAACL, 2019
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: A Multi-Task Benchmark and Analysis Platform. Proc. EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018.
  •  S. Ruder, I. Vulic, and A. Sogaard. A Survey of Cross-lingual Word Embedding Models.
  • J. Zhao, T. Wang, M. Yatskar, V. Ordonez, K. Chang. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. Proc. EMNLP, 2017.
  • L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," in Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb. 1989.
  • Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012
  • Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006
  • J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, Attention-based models for speech recognition, in Advances in neural information processing systems, pp. 577–585., 2015
  • W. Chan, N. Jaitly, Q. Le, and O. Vinyals, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., pp. 4960–4964, 2016
  • Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug): 2493–2537, 2011.
  • Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017.
  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016.
  • K. Han, D. Yu, I. Tashev, Speech emotion recognition using deep neural network and extreme learning machine., Interspeech, pp. 223–227., 2014
  • J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, Multimodal deep learning, in Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696, 2011