1、,Deep Learning for AI,from Machine Perception to Machine Cognition,Li Deng,Chief Scientist of AI,Microsoft Applications/Services Group (ASG) &MSR Deep Learning Technology Center (DLTC),Thanks go to many colleagues at DLTC & MSR, collaborating universities,and at Microsofts engineering groups (ASG+),
2、A Plenary Presentation at IEEE-ICASSP, March 24, 2016,DefinitionDeep learning is a class of machine learning algorithms that1(pp199200) use a cascade of many layers of nonlinear processing . are part of the broader machine learning field of learning,representations of data facilitating end-to-end op
3、timization., learn multiple levels of representations that correspond tohierarchies of concept abstraction , 2,3,Artificial intelligence (AI) is the intelligence exhibited bymachines or software. It is also the name of the academic fieldof study on how to create computers and computer softwarethat a
4、re capable of intelligent behavior.Artificial general intelligence (AGI) is the intelligence of a (hypothetical) machine that could successfully perform any intellectual task that a human being can. It is a primary goal of artificial intelligence research and an important topic for science fiction w
5、riters and futurists. Artificial general intelligence is also referred to as,“strong AI“,AI/(A)GI & Deep Learning: the main thesisAI/GI = machine perception (speech, image, video, gesture,touch.)+ machine cognition (natural language, reasoning,attention, memory/learning,knowledge, decision making, a
6、ction,interaction/conversation, )GI: AI that is flexible, general, adaptive, learning from 1st principles,Deep Learning + Reinforcement/Unsupervised Learning AI/GI,4,AI/GI & Deep Learning: how AlphaGo fitsAI/GI = machine perception (speech, image, video, gesture,touch.)+ machine cognition (natural l
7、anguage, reasoning,attention, memory/learning,knowledge, decision making, action,interaction/conversation, )AGI: AI that is flexible, general, adaptive, learning from 1st principles,Deep Learning + Reinforcement/Unsupervised Learning AI/AGI,5,Outline Deep learning for machine perception Speech Image
8、 Deep learning for machine cognition, ,Semantic modeling Natural language Multimodality Reasoning, attention, memory (RAM) Knowledge representation/management/exploitation Optimal decision making (by deep reinforcement learning), Three hot areas/challenges of deep learning & AI research6,Deep learni
9、ng Research: centered at NIPS(Neural Information Processing Systems),Dec 7-12, 2015,Zuckerberg & LeCun,2013,Hinton & ImageNet & “bidding” 2012 Hinton & MSR20097,Musk & RAM,& OpenAI,Deep,Learning,Tutorial,8,9,Microsoft Research,The UniversalTranslator comes true!Scientists See Promise in Deep-Learnin
10、g ProgramsJohn MarkoffNovember 23, 2012Tianjin, China, October, 25, 2012Deep learning technology enabled speech-to-speech translation,A voice recognition program translated a speech given by Richard F. Rashid, Microsofts top scientist, into Mandarin Chinese.,10,Microsoft Research,Deep belief network
11、s for phone recognition, NIPS, December 2009; 2012Investigation of full-sequence training of DBNs for speech recognition., Interspeech, Sept 2010 Binary coding of speech spectrograms using a deep auto-encoder, Interspeech, Sept 2010 Roles of Pre-Training & Fine-Tuning in CD-DBN-HMMs for Real-World A
12、SR, NIPS, Dec. 2010 Large Vocabulary Continuous Speech Recognition With CD-DNN-HMMS, ICASSP, April 2011 Conversational Speech Transcription Using Contxt-Dependent DNN,Interspeech, Aug. 2011Making deep belief networks effective for LVCSR, ASRU, Dec. 2011 Application of Pretrained DNNs to Large Vocabu
13、lary Speech Recognition., ICASSP, 2012 【胡郁】讯飞超脑 2.0 是怎样炼成的?2011, 2015CD-DNN-HMMinvented,2010,11,Microsoft Research,Across-the-Board Deployment of DNN in Speech Industry,(+ in university labs & DARPA programs),(2012-2014)12,13,Microsoft Research,In the academic world,14,“This joint paper (2012),from
14、the major speech recognition laboratories,details the first major,industrial application of,deep learning.”,15,State-of-the-Art Speech Recognition Today,(& tomorrow - roles of unsupervised learning),Single Channel:LSTM acoustic model trained withconnectionist temporal classification (CTC)Results on
15、a 2,000-hr English Voice Searchtask show an 11% relative improvementPapers: H. Sak et al - ICASSP 2015,Interspeech 2015, A. Senior et al - ASRU2015,ASR: Neural Network Architectures atMulti-Channel:,Multi-channel raw-waveform input for each channel Initial network layers factored to do spatial andsp
16、ectral filtering Output passed to a CLDNN acoustic model, entirenetwork trained jointly Results on a 2,000-hr English Voice Search taskshow more than 10% relative improvement Papers: T. N. Sainath et al - ASRU 2015, ICASSP2016,Modelraw-waveform, 1ch delay+sum, 8 channel MVDR, 8 channel,WER19.2 18.7
17、18.8,factored raw-waveform, 2ch,17.1,ModelLSTM w/ conventional modeling LSTM w/ CTC,WER14.012.9%,(Sainath, Senior, Sak, Vinyals),(Slide credit: Tara Sainath & Andrew Senior),Baidus Deep Speech 2 End-to-End DL System for Mandarin and EnglishPaper: bit.ly/deepspeech2,Human-level Mandarin recognition o
18、n short queries:, ,DeepSpeech: 3.7% - 5.7% CER Humans: 4% - 9.7% CER,Trained on 12,000 hours of conversational, read, mixed speech.9 layer RNN with CTC cost:2D invariant convolution7 recurrent layersFully connected outputTrained with SGD on heavily- optimized HPC system. “SortaGrad” curriculum learn
19、ing.,“Batch Dispatch” framework for low-latency production deployment.,(Slide credit: Andrew Ng & Adam Coates),Real-time reduction of 16% WER reduction of 10%,Learning transition probabilities in DNN-HMM ASRDNN outputs include not only state posterior outputs but also HMM transitionprobabilities,Mat
20、thias Paulik, “Improvements to the Pruning Behavior of DNN Acoustic,Models”. Interspeech 2015,Transition probs,State posteriors,Siri data,(Slide: Alex Acero),FSMN-based LVCSR System Feed-forward Sequential MemoryNetwork(FSMN) Results on 10,000 hours Mandarinshort message dictation task 8 hidden laye
21、rs Memory block with -/+ 15 frames CTC training criteria Comparable results to DBLSTM withsmaller model size Training costs only 1 day using 16 GPUsand ASGD algorithm,ModelReLU DNNLSTMBLSTMFSMN,#Para.(M)4027.54519.8,CER (%)6.405.254.674.61,Shiliang Zhang, Cong Liu, Hui Jiang, Si Wei, Lirong Dai, Yu
22、Hu. “Feedforward Sequential Memory Networks:,ANew Structure to Learn Long-term Dependency ”. arXiv:1512.08031, 2015.,(slide credit: Cong Liu & Yu Hu),English Conversational Telephone Speech Recognition*Key ingredients:, ,Joint RNN/CNN acoustic model trained on 2000 hours of publicly available audio
23、Maxout activations Exponential and NN language models,WER Results on Switchboard Hub5-2000:,hidden layer,hidden layer,conv. layerconv. layerCNN features,hidden layerrecurrent layerRNN features,output layerbottleneck bottleneckhidden layerhidden layer,Model,WER SWB,WER CH,CNNRNNJoint RNN/CNN+ LM resc
24、oring,10.49.99.38.0%,17.916.315.614.1,*Saon et al. “The IBM 2015 English Conversational Telephone Speech Recognition System”, Interspeech 2015.(Slide credit: G. Saon & B. Kingsbury), SP-P14.5: “SCALABLE TRAINING OF DEEP LEARNING MACHINES,BY INCREMENTAL BLOCK TRAINING WITH INTRA-BLOCK PARALLEL OPTIMI
25、ZATION AND BLOCKWISE MODEL-UPDATE,FILTERING,” by Kai Chen and Qiang Huo,(Slide credit: Xuedong Huang),*Google updated that TensorFlow can now scale to support multiple machines recently; comparisons have not been made yet, Recent Research at MS (ICASSP-2016):,-“SCALABLE TRAINING OF DEEP LEARNING MAC
26、HINES BY INCREMENTAL BLOCK TRAINING WITH INTRA- BLOCK PARALLEL OPTIMIZATIONAND BLOCKWISE MODEL-UPDATE FILTERING” -“HIGHWAY LSTM RNNs FOR DISTANCE SPEECH RECOGNITION” -”SELF-STABILIZED DEEP NEURAL NETWORKS”,CNTK/Phily,23,Deep Learning also Shattered Image Recognition (since 2012),24,Microsoft Researc
27、h,3.567%,3.581%,Super-deep: 152 layers,4th year,25,Microsoft Research,11x11conv,96,/4,pool/2,5x5conv,256,pool/2,3x3conv,384,3x3conv,384,3x3conv,256,pool/2,fc,4096,fc,4096,fc,1000,AlexNet, 8 layers,(ILSVRC 2012),3x3 conv, 64,3x3 conv, 64, pool/2,3x3 conv, 128,3x3 conv, 128, pool/2,3x3 conv, 256,3x3 c
28、onv, 256,3x3 conv, 256,3x3 conv, 256, pool/2,3x3 conv, 512,3x3 conv, 512,3x3 conv, 512,3x3 conv, 512, pool/2,3x3 conv, 512,3x3 conv, 512,3x3 conv, 512,3x3 conv, 512, pool/2,fc, 4096,fc, 4096fc, 1000,VGG, 19 layers,(ILSVRC 2014),LocalRespNormConv3x3+ 1(S)Conv1x1+ 1(V) LocalRespNormMaxPool3x3+ 2(S)Con
29、v7x7+ 2(S)input,MaxPool 3x3+ 2(S),Conv 1x1+ 1(S),Conv 3x3+ 1(S)Conv 1x1+ 1(S),Conv 5x5+ 1(S)Conv 1x1+ 1(S),Conv 1x1+ 1(S) MaxPool 3x3+ 1(S),DepthConcat,Conv,Conv,Conv,Conv,1x1+ 1(S),3x3+ 1(S)Conv 1x1+ 1(S),5x5+ 1(S)Conv 1x1+ 1(S),1x1+ 1(S) MaxPool 3x3+ 1(S),DepthConcat,MaxPool 3x3+ 2(S),Conv 1x1+ 1(
30、S),Conv 3x3+ 1(S)Conv 1x1+ 1(S),Conv 5x5+ 1(S)Conv 1x1+ 1(S),Conv 1x1+ 1(S) MaxPool 3x3+ 1(S),DepthConcat,Conv,Conv,Conv,Conv,1x1+ 1(S),3x3+ 1(S)Conv 1x1+ 1(S),5x5+ 1(S)Conv 1x1+ 1(S),1x1+ 1(S) MaxPool 3x3+ 1(S),DepthConcat,Conv 1x1+ 1(S),Conv 3x3+ 1(S)Conv 1x1+ 1(S),Conv 5x5+ 1(S)Conv 1x1+ 1(S),Con
31、v 1x1+ 1(S) MaxPool 3x3+ 1(S),Conv 1x1+ 1(S),Conv 1x1+ 1(S),Conv Conv 1x1+ 1(S) 1x1+ 1(S)DepthConcat,MaxPool 3x3+ 1(S),DepthConcatConv Conv 3x3+ 1(S) 5x5+ 1(S),Conv 1x1+ 1(S),Conv 3x3+ 1(S),Conv 5x5+ 1(S),Conv 1x1+ 1(S),Conv 1x1+ 1(S),Conv 1x1+ 1(S),MaxPool 3x3+ 1(S),AveragePool 5x5+ 3(V),DepthConca
32、t,MaxPool 3x3+ 2(S),Conv 1x1+ 1(S),Conv 1x1+ 1(S),Conv 1x1+ 1(S),Conv 1x1+ 1(S),MaxPool 3x3+ 1(S),DepthConcatConv Conv 3x3+ 1(S) 5x5+ 1(S),Conv,Conv,1x1+ 1(S),3x3+ 1(S)Conv 1x1+ 1(S),5x5+ 1(S)Conv 1x1+ 1(S),1x1+ 1(S) MaxPool 3x3+ 1(S),DepthConcat Conv Conv,Conv,1x1+ 1(S) AveragePool5x5+ 3(V),FC,Soft
33、maxActivationFC,softmax0,Conv 1x1+ 1(S),FC,FC,SoftmaxActivation,softmax1,Depth is of crucial importancesoftmax2SoftmaxActivationFCAveragePool7x7+ 1(V),GoogleNet, 22 layers,(ILSVRC 2014),ILSVRC (Large Scale Visual Recognition Challenge) (slide credit: Jian Sun, MSR),AlexNet, 8 layers,(ILSVRC 2012),Re
34、sNet, 152 layers,(ILSVRC 2015),3x3 conv, 64 3x3 conv, 64, pool/23x3 conv, 128 3x3 conv, 128, pool/23x3 conv, 2563x3 conv, 2563x3 conv, 256,3x3 conv, 256, pool/2,3x3 conv, 5123x3 conv, 5123x3 conv, 512 3x3 conv, 512, pool/23x3 conv, 5123x3 conv, 5123x3 conv, 512,3x3 conv, 512, pool/2fc, 4096fc, 4096f
35、c, 1000,11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2,3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2fc, 4096fc, 4096fc, 1000,1x1 conv, 512 1x1 conv, 256, /2,3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256,1x1 conv, 1024,1x1 conv, 256 3x3 conv
36、, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256,3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 con
37、v, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 co
38、nv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 c
39、onv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3
40、conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512
41、 3x3 conv, 512 1x1 conv, 2048,Depth is of crucial importance7x7 conv, 64, /2, pool/21x1 conv, 643x3 conv, 641x1 conv, 2561x1 conv, 643x3 conv, 641x1 conv, 2561x1 conv, 643x3 conv, 641x1 conv, 2561x2 conv, 128, /23x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv,
42、1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 128,VGG, 19 layers,(ILSVRC 2014),ILSVRC (Large Scale Visual Recognition Challenge),ave pool, fc 1000 (sli
43、de credit: Jian Sun, MSR),1x1 conv, 643x3 conv, 641x1 conv, 2561x1 conv, 643x3 conv, 641x1 conv, 2561x1 conv, 643x3 conv, 641x1 conv, 256 1x2 conv, 128, /23x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x
44、1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 5121x1 conv, 1283x3 conv, 1281x1 conv, 512 1x1 conv, 256, /2,Depth is of crucial importance7x7 conv, 64, /2, pool/2,ResNet, 152 layers(slide credit: Jian Sun, MSR),Outline Deep learning fo
45、r machine perception Speech Image Deep learning for machine cognition, ,Semantic modeling Natural language Multimodality Reasoning, attention, memory (RAM) Knowledge representation/management/exploitation Optimal decision making (by deep reinforcement learning), Three hot areas/challenges of deep le
46、arning & AI research29,dim = 100Ms: “racing car”,Bag-of-words vectorInput word/phrase,d=500,Letter-trigram embedding matrix,Letter-trigram encoding matrix (fixed),Semantic vector,d=300d=500,dim = 100Mt1: “formula one”,dim = 50K,d=500,d=300d=500,dim = 100Mt2: “racing to me”,dim = 50K,d=500,d=300d=500
47、,dim = 50K Ws,1,Ws,2,Ws,4Ws,3,Deep Semantic Model for Symbol Embedding,similar,apart,Wt,1,Wt,2,Wt,4Wt,3,Wt,1,Wt,2,Wt,4Wt,3,Huang, P., He, X., Gao, J., Deng, L.,Acero, A., and Heck, L. Learning deep structured semantic models for web search using clickthrough data. In ACM-CIKM, 2013.,Many application
48、s of Deep Semantic Modeling: Learning semantic relationship between “Source” and “Target”,31,Tasks Word semantic embeddingWeb search Query intent detection Question answeringMachine translation Query auto-suggestion Query auto-completion Apps recommendation Distillation of survey feedbacksAutomatic
49、image captioningImage retrieval Natural user interface Ads selection Ads click prediction Email analysis: people prediction Email search Email decluteringKnowledge-base constructionContextual entity search,Source contextsearch query Search query pattern / mention(in NL)sentencein languagea Search query Partial search query User profile Feedbacks in textimagetext query command(text / speech / gesture) search query search query Email content Search query Email contentsentity from sourcekey phrase / context,