ImageVerifierCode 换一换
格式:PPT , 页数:305 ,大小:23.65MB ,
资源ID:3535187      下载积分:10 金币
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.docduoduo.com/d-3535187.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录   微博登录 

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(一天搞懂深度学习-台湾大学-李宏毅.ppt)为本站会员(weiwoduzun)主动上传,道客多多仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知道客多多(发送邮件至docduoduo@163.com或直接QQ联系客服),我们立即给予删除!

一天搞懂深度学习-台湾大学-李宏毅.ppt

1、Deep Learning Tutorial,李宏毅 Hung-yi Lee,Deep learning attracts lots of attention.,I believe you have seen lots of exciting results before.,This talk focuses on the basic techniques.,Deep learning trends at Google. Source: SIGMOD/Jeff Dean,Outline,Lecture I: Introduction of Deep Learning,Outline,Machi

2、ne Learning Looking for a Function,Speech RecognitionImage RecognitionPlaying GoDialogue System,“Cat”,“How are you”,“5-5”,“Hello”,“Hi”,(what the user said),(system response),(next move),Framework,A set of function,“cat”,“dog”,“money”,“snake”,Model,“cat”,Image Recognition:,Framework,A set of function

3、,“cat”,Image Recognition:,Model,Training Data,Goodness of function f,Better!,“monkey”,“cat”,“dog”,function input:,function output:,Supervised Learning,Framework,A set of function,“cat”,Image Recognition:,Model,Training Data,Goodness of function f,“monkey”,“cat”,“dog”,Pick the “Best” Function,Using,“

4、cat”,Training,Testing,Step 1,Step 2,Step 3,Three Steps for Deep Learning,Neural Network,Neural Network,bias,weights,Neuron,A simple function,Activation function,Neural Network,bias,Activation function,weights,Neuron,1,4,0.98,Neural Network,Different connections lead to different network structures,W

5、eights and biases are network parameters ,The neurons have different values of weights and biases.,Fully Connect Feedforward Network,1,-1,1,-2,1,-1,1,0,4,-2,0.98,0.12,Fully Connect Feedforward Network,1,-2,1,-1,4,-2,0.98,0.12,2,-1,-1,-2,3,-1,4,-1,0.86,0.11,0.62,0.83,1,-1,Fully Connect Feedforward Ne

6、twork,1,-2,1,-1,1,0,0.73,0.5,2,-1,-1,-2,3,-1,4,-1,0.72,0.12,0.51,0.85,0,0,-2,2, 0 0 = 0.51 0.85,Given parameters , define a function, 1 1 = 0.62 0.83,0,0,This is a function.,Input vector, output vector,Given network structure, define a function set,Output Layer,Hidden Layers,Input Layer,Fully Connec

7、t Feedforward Network,Input,Output,y1,y2,yM,Deep means many hidden layers,neuron,Why Deep? Universality Theorem,Reference for the reason: http:/ continuous function f,Can be realized by a network with one hidden layer,(given enough hidden neurons),Why “Deep” neural network not “Fat” neural network?,

8、Logic circuits consists of gates A two layers of logic gates can represent any Boolean function. Using multiple layers of logic gates to build some functions are much simpler,Neural network consists of neurons A hidden layer network can represent any continuous function. Using multiple layers of neu

9、rons to represent some functions are much simpler,Logic circuits,Neural network,less data?,More reason: https:/ Deep? Analogy,8 layers,19 layers,22 layers,AlexNet (2012),VGG (2014),GoogleNet (2014),16.4%,7.3%,6.7%,http:/cs231n.stanford.edu/slides/winter1516_lecture8.pdf,Deep = Many hidden layers,Ale

10、xNet (2012),VGG (2014),GoogleNet (2014),152 layers,3.57%,Residual Net (2015),16.4%,7.3%,6.7%,Deep = Many hidden layers,Special structure,Output Layer,Softmax layer as the output layer,Ordinary Layer,In general, the output of network can be any value.,May not be easy to interpret,Output Layer,Softmax

11、 layer as the output layer,Softmax Layer,3,-3,1,2.7,20,0.05,0.88,0.12,0,Probability: 1 0 =1,Example Application,Input,Output,16 x 16 = 256,Ink 1 No ink 0,Each dimension represents the confidence of a digit.,is 1,is 2,is 0,0.1,0.7,0.2,The image is “2”,Example Application,Handwriting Digit Recognition

12、,Machine,“2”,is 1,is 2,is 0,What is needed is a function ,Input: 256-dim vector,output: 10-dim vector,Neural Network,Output Layer,Hidden Layers,Input Layer,Example Application,Input,Output,“2”,is 1,is 2,is 0,A function set containing the candidates for Handwriting Digit Recognition,You need to decid

13、e the network structure to let a good function in your function set.,FAQ,Q: How many layers? How many neurons for each layer?Q: Can we design the network structure?Q: Can the structure be automatically determined? Yes, but not widely studied yet.,Convolutional Neural Network (CNN) in the next lectur

14、e,Highway Network,Residual Network,Highway Network,Deep Residual Learning for Image Recognition http:/arxiv.org/abs/1512.03385,Training Very Deep Networks https:/arxiv.org/pdf/1507.06228v2.pdf,+,copy,copy,Gate controller,Input layer,output layer,Input layer,output layer,Input layer,output layer,High

15、way Network automatically determines the layers needed!,Three Steps for Deep Learning,Training Data,Preparing training data: images and their labels,The learning target is defined on the training data.,“5”,“0”,“4”,“1”,“3”,“1”,“2”,“9”,Learning Target,16 x 16 = 256,Ink 1 No ink 0,y1,y2,y10,y1 has the

16、maximum value,The learning target is ,Input:,y2 has the maximum value,Input:,is 1,is 2,is 0,Softmax,Loss,y1,y2,y10,Loss ,“1”,Loss can be square error or cross entropy between the network output and target,target,Softmax,As close as possible,A good function should make the loss of all examples as sma

17、ll as possible.,Given a set of parameters,Total Loss,NN,NN,NN, 1, 2, , 1,NN, 3,For all training data ,= =1 ,Find the network parameters that minimize total loss L,Total Loss:, 2, 3, ,As small as possible,Find a function in function set that minimizes total loss L,Three Steps for Deep Learning,How to

18、 pick the best function,Find network parameters that minimize total loss L,Network parameters = 1 , 2 , 3 , 1 , 2 , 3 ,Enumerate all possible values,E.g. speech recognition: 8 layers and 1000 neurons each layer,1000 neurons,1000 neurons,106 weights,Gradient Descent,Total Loss ,Random, RBM pre-train,

19、Usually good enough,Network parameters = 1 , 2 , 1 , 2 ,Pick an initial value for w,Find network parameters that minimize total loss L,Gradient Descent,Total Loss ,Network parameters = 1 , 2 , 1 , 2 ,Pick an initial value for w,Compute ,Positive,Negative,Decrease w,Increase w,http:/ network paramete

20、rs that minimize total loss L,Gradient Descent,Total Loss ,Network parameters = 1 , 2 , 1 , 2 ,Pick an initial value for w,Compute , , is called “learning rate”, ,Repeat,Find network parameters that minimize total loss L,Gradient Descent,Total Loss ,Network parameters = 1 , 2 , 1 , 2 ,Pick an initia

21、l value for w,Compute , ,Repeat,Until is approximately small,(when update is little),Find network parameters that minimize total loss L,Gradient Descent,Color: Value of Total Loss L,Randomly pick a starting point,Gradient Descent,Hopfully, we would reach a minima ,Compute 1 , 2,( 1 , 2 ),Color: Valu

22、e of Total Loss L,Local Minima,Total Loss,The value of a network parameter w,Very slow at the plateau,Stuck at local minima, =0,Stuck at saddle point, =0, 0,Local Minima,Gradient descent never guarantee global minima, 1, 2,Different initial point,Reach different minima, so different results,Gradient

23、 Descent,This is the “learning” of machines in deep learning ,Even alpha go using this approach.,I hope you are not too disappointed :p,People image ,Actually ,Backpropagation,Backpropagation: an efficient way to compute in neural network,Ref: https:/ Steps for Deep Learning,Deep Learning is so simp

24、le ,Now If you want to find a function,If you have lots of function input/output (?) as training data,You can use deep learning,For example, you can do .,Image Recognition,Network,“cat”,“dog”,“monkey”,“cat”,“dog”,For example, you can do .,Spam filtering,(http:/spam-filter- (Yes),0 (No),“free” in e-m

25、ail,“Talk” in e-mail,For example, you can do .,http:/top-breaking- in document,“stock” in document,體育,政治,財經,Outline,Keras,keras,http:/speech.ee.ntu.edu.tw/tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html,http:/speech.ee.ntu.edu.tw/tlkagk/courses/MLDS_2015_2/Lecture/RNN%20training%2

26、0(v6).ecm.mp4/index.html,Very flexible,Need some effort to learn,Easy to learn and use,(still have some flexibility),You can modify it if you can write TensorFlow or Theano,Interface of TensorFlow or Theano,or,If you want to learn theano:,Keras,Franois Chollet is the author of Keras. He currently wo

27、rks for Google as a deep learning engineer and researcher. Keras means horn in Greek Documentation: http:/keras.io/ Example: https:/ Keras 心得,感謝 沈昇勳 同學提供圖檔,Example Application,Handwriting Digit Recognition,Machine,“1”,“Hello world” for deep learning,MNIST Data: http:/ provides data sets loading func

28、tion: http:/keras.io/datasets/,28 x 28,Keras,y1,y2,y10,Softmax,500,500,28x28,Keras,Keras,Step 3.1: Configuration,Step 3.2: Find the optimal network parameters, ,0.1,Training data (Images),Labels (digits),Keras,Step 3.2: Find the optimal network parameters,https:/www.tensorflow.org/versions/r0.8/tuto

29、rials/mnist/beginners/index.html,Number of training examples,numpy array,28 x 28 =784,numpy array,10,Number of training examples,Keras,http:/keras.io/getting-started/faq/#how-can-i-save-a-keras-model,How to use the neural network (testing):,case 1:,case 2:,Save and load models,Keras,Using GPU to spe

30、ed training Way 1 THEANO_FLAGS=device=gpu0 python YourCode.py Way 2 (in your code) import os os.environ“THEANO_FLAGS“ = “device=gpu0“,Demo,Three Steps for Deep Learning,Deep Learning is so simple ,Outline,Neural Network,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,NO,NO,Overf

31、itting!,Recipe of Deep Learning,Do not always blame Overfitting,Deep Residual Learning for Image Recognition http:/arxiv.org/abs/1512.03385,Testing Data,Overfitting?,Training Data,Not well trained,Neural Network,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,Recipe of Deep Lear

32、ning,Different approaches for different problems.,e.g. dropout for good results on testing data,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,Recipe of Deep Learning,Choosing Proper Loss,y1,y2,y10,loss,“1”,1,0,0,target,Softmax,=1 10 2,Square Error,Cross Entropy, =1 10 ,Which o

33、ne is better?, 1, 2, 10,1,0,0,=0,=0,Demo,Square Error,Cross Entropy,Several alternatives: https:/keras.io/objectives/,Demo,Choosing Proper Loss,Total Loss,w1,w2,Cross Entropy,Square Error,When using softmax output layer, choose cross entropy,http:/jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.p

34、df,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,Recipe of Deep Learning,Mini-batch,NN, 1, 1,NN, 31, 31,NN, 2, 2,NN, 16, 16,Pick the 1st batch,Randomly initialize network parameters,Pick the 2nd batch,Mini-batch,Mini-batch,= 1 + 31 +,= 2 + 16 +,Update parameters once,Update pa

35、rameters once,Until all mini-batches have been picked,one epoch,Repeat the above process,We do not really minimize total loss!,Mini-batch,100 examples in a mini-batch,Repeat 20 times,Mini-batch,Original Gradient Descent,With Mini-batch,Unstable!,The colors represent the total loss.,Mini-batch is Fas

36、ter,1 epoch,See all examples,See only one batch,Update after seeing all examples,If there are 20 batches, update 20 times in one epoch.,Original Gradient Descent,With Mini-batch,Not always true with parallel computing.,Can have the same speed (not super large data set),Mini-batch has better performa

37、nce!,Demo,NN, 1, 1,NN, 31, 31,NN, 2, 2,NN, 16, 16,Mini-batch,Mini-batch,Shuffle the training examples for each epoch,Epoch 1,NN, 1, 1,NN, 17, 17,NN, 2, 2,NN, 26, 26,Mini-batch,Mini-batch,Epoch 2,Dont worry. This is the default of Keras.,Good Results on Testing Data?,Good Results on Training Data?,YE

38、S,YES,Recipe of Deep Learning,Hard to get the power of Deep ,Deeper usually does not imply better.,Results on Training Data,Demo,Vanishing Gradient Problem,Larger gradients,Almost random,Already converge,based on random!?,Learn very slow,Learn very fast,Smaller gradients,Vanishing Gradient Problem,I

39、ntuitive way to compute the derivatives , =?,+,+, ,Smaller gradients,Hard to get the power of Deep ,In 2006, people used RBM pre-training. In 2015, people use ReLU.,ReLU,Rectified Linear Unit (ReLU),Reason:,1. Fast to compute,2. Biological reason,3. Infinite sigmoid with different biases,4. Vanishin

40、g gradient problem,Xavier Glorot, AISTATS11,Andrew L. Maas, ICML13,Kaiming He, arXiv15,ReLU,0,0,0,0,ReLU,A Thinner linear network,Do not have smaller gradients,Demo,ReLU - variant, also learned by gradient descent,Maxout,Learnable activation function Ian J. Goodfellow, ICML13,Max,Input,Max,7,1,Max,M

41、ax,2,4,ReLU is a special cases of Maxout,You can have more than 2 elements in a group.,neuron,Maxout,Learnable activation function Ian J. Goodfellow, ICML13 Activation function in maxout network can be any piecewise linear convex function How many pieces depending on how many elements in a group,ReL

42、U is a special cases of Maxout,2 elements in a group,3 elements in a group,Good Results on Testing Data?,Good Results on Training Data?,YES,YES,Recipe of Deep Learning,Learning Rates,If learning rate is too large,Total loss may not decrease after each update,Set the learning rate carefully,Learning

43、Rates,If learning rate is too large,Set the learning rate carefully,If learning rate is too small,Training would be too slow,Total loss may not decrease after each update,Learning Rates,Popular & Simple Idea: Reduce the learning rate by some factor every few epochs. At the beginning, we are far from

44、 the destination, so we use larger learning rate After several epochs, we are close to the destination, so we reduce the learning rate E.g. 1/t decay: = +1 Learning rate cannot be one-size-fits-all Giving different parameters different learning rates,Adagrad,Parameter dependent learning rate,w ,constant, is obtained at the i-th update, = =0 2,Summation of the square of the previous derivatives,

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报