1、Deep Learning Tutorial,李宏毅 Hung-yi Lee,Deep learning attracts lots of attention.,I believe you have seen lots of exciting results before.,This talk focuses on the basic techniques.,Deep learning trends at Google. Source: SIGMOD/Jeff Dean,Outline,Lecture I: Introduction of Deep Learning,Outline,Machi
2、ne Learning Looking for a Function,Speech RecognitionImage RecognitionPlaying GoDialogue System,“Cat”,“How are you”,“5-5”,“Hello”,“Hi”,(what the user said),(system response),(next move),Framework,A set of function,“cat”,“dog”,“money”,“snake”,Model,“cat”,Image Recognition:,Framework,A set of function
3、,“cat”,Image Recognition:,Model,Training Data,Goodness of function f,Better!,“monkey”,“cat”,“dog”,function input:,function output:,Supervised Learning,Framework,A set of function,“cat”,Image Recognition:,Model,Training Data,Goodness of function f,“monkey”,“cat”,“dog”,Pick the “Best” Function,Using,“
4、cat”,Training,Testing,Step 1,Step 2,Step 3,Three Steps for Deep Learning,Neural Network,Neural Network,bias,weights,Neuron,A simple function,Activation function,Neural Network,bias,Activation function,weights,Neuron,1,4,0.98,Neural Network,Different connections lead to different network structures,W
5、eights and biases are network parameters ,The neurons have different values of weights and biases.,Fully Connect Feedforward Network,1,-1,1,-2,1,-1,1,0,4,-2,0.98,0.12,Fully Connect Feedforward Network,1,-2,1,-1,4,-2,0.98,0.12,2,-1,-1,-2,3,-1,4,-1,0.86,0.11,0.62,0.83,1,-1,Fully Connect Feedforward Ne
6、twork,1,-2,1,-1,1,0,0.73,0.5,2,-1,-1,-2,3,-1,4,-1,0.72,0.12,0.51,0.85,0,0,-2,2, 0 0 = 0.51 0.85,Given parameters , define a function, 1 1 = 0.62 0.83,0,0,This is a function.,Input vector, output vector,Given network structure, define a function set,Output Layer,Hidden Layers,Input Layer,Fully Connec
7、t Feedforward Network,Input,Output,y1,y2,yM,Deep means many hidden layers,neuron,Why Deep? Universality Theorem,Reference for the reason: http:/ continuous function f,Can be realized by a network with one hidden layer,(given enough hidden neurons),Why “Deep” neural network not “Fat” neural network?,
8、Logic circuits consists of gates A two layers of logic gates can represent any Boolean function. Using multiple layers of logic gates to build some functions are much simpler,Neural network consists of neurons A hidden layer network can represent any continuous function. Using multiple layers of neu
9、rons to represent some functions are much simpler,Logic circuits,Neural network,less data?,More reason: https:/ Deep? Analogy,8 layers,19 layers,22 layers,AlexNet (2012),VGG (2014),GoogleNet (2014),16.4%,7.3%,6.7%,http:/cs231n.stanford.edu/slides/winter1516_lecture8.pdf,Deep = Many hidden layers,Ale
10、xNet (2012),VGG (2014),GoogleNet (2014),152 layers,3.57%,Residual Net (2015),16.4%,7.3%,6.7%,Deep = Many hidden layers,Special structure,Output Layer,Softmax layer as the output layer,Ordinary Layer,In general, the output of network can be any value.,May not be easy to interpret,Output Layer,Softmax
11、 layer as the output layer,Softmax Layer,3,-3,1,2.7,20,0.05,0.88,0.12,0,Probability: 1 0 =1,Example Application,Input,Output,16 x 16 = 256,Ink 1 No ink 0,Each dimension represents the confidence of a digit.,is 1,is 2,is 0,0.1,0.7,0.2,The image is “2”,Example Application,Handwriting Digit Recognition
12、,Machine,“2”,is 1,is 2,is 0,What is needed is a function ,Input: 256-dim vector,output: 10-dim vector,Neural Network,Output Layer,Hidden Layers,Input Layer,Example Application,Input,Output,“2”,is 1,is 2,is 0,A function set containing the candidates for Handwriting Digit Recognition,You need to decid
13、e the network structure to let a good function in your function set.,FAQ,Q: How many layers? How many neurons for each layer?Q: Can we design the network structure?Q: Can the structure be automatically determined? Yes, but not widely studied yet.,Convolutional Neural Network (CNN) in the next lectur
14、e,Highway Network,Residual Network,Highway Network,Deep Residual Learning for Image Recognition http:/arxiv.org/abs/1512.03385,Training Very Deep Networks https:/arxiv.org/pdf/1507.06228v2.pdf,+,copy,copy,Gate controller,Input layer,output layer,Input layer,output layer,Input layer,output layer,High
15、way Network automatically determines the layers needed!,Three Steps for Deep Learning,Training Data,Preparing training data: images and their labels,The learning target is defined on the training data.,“5”,“0”,“4”,“1”,“3”,“1”,“2”,“9”,Learning Target,16 x 16 = 256,Ink 1 No ink 0,y1,y2,y10,y1 has the
16、maximum value,The learning target is ,Input:,y2 has the maximum value,Input:,is 1,is 2,is 0,Softmax,Loss,y1,y2,y10,Loss ,“1”,Loss can be square error or cross entropy between the network output and target,target,Softmax,As close as possible,A good function should make the loss of all examples as sma
17、ll as possible.,Given a set of parameters,Total Loss,NN,NN,NN, 1, 2, , 1,NN, 3,For all training data ,= =1 ,Find the network parameters that minimize total loss L,Total Loss:, 2, 3, ,As small as possible,Find a function in function set that minimizes total loss L,Three Steps for Deep Learning,How to
18、 pick the best function,Find network parameters that minimize total loss L,Network parameters = 1 , 2 , 3 , 1 , 2 , 3 ,Enumerate all possible values,E.g. speech recognition: 8 layers and 1000 neurons each layer,1000 neurons,1000 neurons,106 weights,Gradient Descent,Total Loss ,Random, RBM pre-train,
19、Usually good enough,Network parameters = 1 , 2 , 1 , 2 ,Pick an initial value for w,Find network parameters that minimize total loss L,Gradient Descent,Total Loss ,Network parameters = 1 , 2 , 1 , 2 ,Pick an initial value for w,Compute ,Positive,Negative,Decrease w,Increase w,http:/ network paramete
20、rs that minimize total loss L,Gradient Descent,Total Loss ,Network parameters = 1 , 2 , 1 , 2 ,Pick an initial value for w,Compute , , is called “learning rate”, ,Repeat,Find network parameters that minimize total loss L,Gradient Descent,Total Loss ,Network parameters = 1 , 2 , 1 , 2 ,Pick an initia
21、l value for w,Compute , ,Repeat,Until is approximately small,(when update is little),Find network parameters that minimize total loss L,Gradient Descent,Color: Value of Total Loss L,Randomly pick a starting point,Gradient Descent,Hopfully, we would reach a minima ,Compute 1 , 2,( 1 , 2 ),Color: Valu
22、e of Total Loss L,Local Minima,Total Loss,The value of a network parameter w,Very slow at the plateau,Stuck at local minima, =0,Stuck at saddle point, =0, 0,Local Minima,Gradient descent never guarantee global minima, 1, 2,Different initial point,Reach different minima, so different results,Gradient
23、 Descent,This is the “learning” of machines in deep learning ,Even alpha go using this approach.,I hope you are not too disappointed :p,People image ,Actually ,Backpropagation,Backpropagation: an efficient way to compute in neural network,Ref: https:/ Steps for Deep Learning,Deep Learning is so simp
24、le ,Now If you want to find a function,If you have lots of function input/output (?) as training data,You can use deep learning,For example, you can do .,Image Recognition,Network,“cat”,“dog”,“monkey”,“cat”,“dog”,For example, you can do .,Spam filtering,(http:/spam-filter- (Yes),0 (No),“free” in e-m
25、ail,“Talk” in e-mail,For example, you can do .,http:/top-breaking- in document,“stock” in document,體育,政治,財經,Outline,Keras,keras,http:/speech.ee.ntu.edu.tw/tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html,http:/speech.ee.ntu.edu.tw/tlkagk/courses/MLDS_2015_2/Lecture/RNN%20training%2
26、0(v6).ecm.mp4/index.html,Very flexible,Need some effort to learn,Easy to learn and use,(still have some flexibility),You can modify it if you can write TensorFlow or Theano,Interface of TensorFlow or Theano,or,If you want to learn theano:,Keras,Franois Chollet is the author of Keras. He currently wo
27、rks for Google as a deep learning engineer and researcher. Keras means horn in Greek Documentation: http:/keras.io/ Example: https:/ Keras 心得,感謝 沈昇勳 同學提供圖檔,Example Application,Handwriting Digit Recognition,Machine,“1”,“Hello world” for deep learning,MNIST Data: http:/ provides data sets loading func
28、tion: http:/keras.io/datasets/,28 x 28,Keras,y1,y2,y10,Softmax,500,500,28x28,Keras,Keras,Step 3.1: Configuration,Step 3.2: Find the optimal network parameters, ,0.1,Training data (Images),Labels (digits),Keras,Step 3.2: Find the optimal network parameters,https:/www.tensorflow.org/versions/r0.8/tuto
29、rials/mnist/beginners/index.html,Number of training examples,numpy array,28 x 28 =784,numpy array,10,Number of training examples,Keras,http:/keras.io/getting-started/faq/#how-can-i-save-a-keras-model,How to use the neural network (testing):,case 1:,case 2:,Save and load models,Keras,Using GPU to spe
30、ed training Way 1 THEANO_FLAGS=device=gpu0 python YourCode.py Way 2 (in your code) import os os.environ“THEANO_FLAGS“ = “device=gpu0“,Demo,Three Steps for Deep Learning,Deep Learning is so simple ,Outline,Neural Network,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,NO,NO,Overf
31、itting!,Recipe of Deep Learning,Do not always blame Overfitting,Deep Residual Learning for Image Recognition http:/arxiv.org/abs/1512.03385,Testing Data,Overfitting?,Training Data,Not well trained,Neural Network,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,Recipe of Deep Lear
32、ning,Different approaches for different problems.,e.g. dropout for good results on testing data,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,Recipe of Deep Learning,Choosing Proper Loss,y1,y2,y10,loss,“1”,1,0,0,target,Softmax,=1 10 2,Square Error,Cross Entropy, =1 10 ,Which o
33、ne is better?, 1, 2, 10,1,0,0,=0,=0,Demo,Square Error,Cross Entropy,Several alternatives: https:/keras.io/objectives/,Demo,Choosing Proper Loss,Total Loss,w1,w2,Cross Entropy,Square Error,When using softmax output layer, choose cross entropy,http:/jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.p
34、df,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,Recipe of Deep Learning,Mini-batch,NN, 1, 1,NN, 31, 31,NN, 2, 2,NN, 16, 16,Pick the 1st batch,Randomly initialize network parameters,Pick the 2nd batch,Mini-batch,Mini-batch,= 1 + 31 +,= 2 + 16 +,Update parameters once,Update pa
35、rameters once,Until all mini-batches have been picked,one epoch,Repeat the above process,We do not really minimize total loss!,Mini-batch,100 examples in a mini-batch,Repeat 20 times,Mini-batch,Original Gradient Descent,With Mini-batch,Unstable!,The colors represent the total loss.,Mini-batch is Fas
36、ter,1 epoch,See all examples,See only one batch,Update after seeing all examples,If there are 20 batches, update 20 times in one epoch.,Original Gradient Descent,With Mini-batch,Not always true with parallel computing.,Can have the same speed (not super large data set),Mini-batch has better performa
37、nce!,Demo,NN, 1, 1,NN, 31, 31,NN, 2, 2,NN, 16, 16,Mini-batch,Mini-batch,Shuffle the training examples for each epoch,Epoch 1,NN, 1, 1,NN, 17, 17,NN, 2, 2,NN, 26, 26,Mini-batch,Mini-batch,Epoch 2,Dont worry. This is the default of Keras.,Good Results on Testing Data?,Good Results on Training Data?,YE
38、S,YES,Recipe of Deep Learning,Hard to get the power of Deep ,Deeper usually does not imply better.,Results on Training Data,Demo,Vanishing Gradient Problem,Larger gradients,Almost random,Already converge,based on random!?,Learn very slow,Learn very fast,Smaller gradients,Vanishing Gradient Problem,I
39、ntuitive way to compute the derivatives , =?,+,+, ,Smaller gradients,Hard to get the power of Deep ,In 2006, people used RBM pre-training. In 2015, people use ReLU.,ReLU,Rectified Linear Unit (ReLU),Reason:,1. Fast to compute,2. Biological reason,3. Infinite sigmoid with different biases,4. Vanishin
40、g gradient problem,Xavier Glorot, AISTATS11,Andrew L. Maas, ICML13,Kaiming He, arXiv15,ReLU,0,0,0,0,ReLU,A Thinner linear network,Do not have smaller gradients,Demo,ReLU - variant, also learned by gradient descent,Maxout,Learnable activation function Ian J. Goodfellow, ICML13,Max,Input,Max,7,1,Max,M
41、ax,2,4,ReLU is a special cases of Maxout,You can have more than 2 elements in a group.,neuron,Maxout,Learnable activation function Ian J. Goodfellow, ICML13 Activation function in maxout network can be any piecewise linear convex function How many pieces depending on how many elements in a group,ReL
42、U is a special cases of Maxout,2 elements in a group,3 elements in a group,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,Recipe of Deep Learning,Learning Rates,If learning rate is too large,Total loss may not decrease after each update,Set the learning rate carefully,Learning
43、Rates,If learning rate is too large,Set the learning rate carefully,If learning rate is too small,Training would be too slow,Total loss may not decrease after each update,Learning Rates,Popular & Simple Idea: Reduce the learning rate by some factor every few epochs. At the beginning, we are far from
44、 the destination, so we use larger learning rate After several epochs, we are close to the destination, so we reduce the learning rate E.g. 1/t decay: = +1 Learning rate cannot be one-size-fits-all Giving different parameters different learning rates,Adagrad,Parameter dependent learning rate,w ,constant, is obtained at the i-th update, = =0 2,Summation of the square of the previous derivatives,