1、Lecture 3: CNN: Back-propagation,boris. ,Agenda,Introduction to gradient-based learning for Convolutional NN Backpropagation for basic layers Softmax Fully Connected layer Pooling ReLU Convolutional layer Implementation of back-propagation for Convolutional layer CIFAR-10 training,Good Links,http:/
2、http:/www.iro.umontreal.ca/pift6266/H10/notes/gradient.html#flowgraph,Gradient based training,Conv. NN is a just cascade of functions: f(x0,w) y, where x0 is image 28,28, w network parameters (weights, bias) y softmax output= probability that x belongs to one of 10 classes 09,Gradient based training
3、,We want to find parameters W, to minimize an error E (f(x0,w),y0) = -log (f(x0,w)- y0). For this we will do iterative gradient descent: w(t) = w(t-1) * (t) How do we compute gradient of E wrt weights? Loss function E is cascade of functions. Let s go layer by layer, from last layer back, and use th
4、e chain rule for gradient of complex functions: 1 = (, 1 ) 1 = (, 1 ) ,LeNet topology,FORWARD,BACKWARD,Data Layer,Convolutional layer 5x5,Convolutional layer 5x5,Pooling 2x2, stride 2,Pooling 2x2, stride 2,Inner Product,ReLUP,Inner Product,Soft Max + LogLoss,Layer: Backward( ),class Layer Setup (bot
5、tom, top); / initialize layer Forward (bottom, top); /compute : = , 1 Backward( top, bottom); /compute gradient Backward: We start from gradient from last layer and 1) propagate gradient back : 1 2) compute the gradient of E wrt weights wl: ,Softmax with LogLoss Layer,Consider the last layer (softma
6、x with log-loss ): = log 0 =log( 0 0 9 ) = 0+ log ( 0 9 ) For all k=09 , except k0 (right answer) we want to decrease pk: = 0 9 = , for k=k0 (right answer) we want to increase pk: 0 =1+ 0 See http:/ufldl.stanford.edu/wiki/index.php/Softmax_Regression,Inner product (Fully Connected) Layer,Fully conne
7、cted layer is just Matrix Vector multiplication: = 1 So 1 = and = 1 Notice that we need 1 , so we should keep these values on forward pass.,ReLU Layer,Rectified Linear Unit : = max (0, 1 ) so 1 = =0, ( 0) = , ,Max-Pooling Layer,Forward : for (p = 0; p k; p+) for (q = 0; q k; q+) yn (x, y) = max( yn
8、(x, y), yn-1(x + p, y + q);Backward: 1 (+,+)= =0, ( , != 1 +,+ ) = (,), Quiz: What will be gradient for Sum-pooling? What will be gradient if pooling areas overlap? (e.g. stride =1)?,Convolutional Layer : Backward,Let s use the chain rule for convolutional layer 1 = (, 1 ) 1 ; = (, 1 ) 1 3D - Convol
9、ution: for (n = 0; n N; n +)for (m = 0; m M; m +)for(y = 0; yY; y+) for(x = 0; xX; x+)for (p = 0; p K; p+) for (q = 0; q K; q+) yL (n; x, y) += yL-1(m, x+p, y+q) * w (n ,m; p, q);,Convolutional Layer : Backward,Example: M=1, N=2, K=2. Take one pixel in level (n-1). Which pixels in next level are inf
10、luenced by it?,2x2,2x2,Convolutional Layer : Backward,Let s use the chain rule for convolutional layer: Gradient 1 is sum of convolution with gradients over all feature maps from “upper” layer: 1 = (, 1 ) 1 = =1 _(, ) Gradient of E wrt w is sum over all “pixels” (x,y) in the input map : = (, 1 ) = 0
11、 0 , 1 (,),Convolutional Layer : Backward,How this is implemented: / im2col data to col_dataim2col_cpu( bottom_data , CHANNELS_, HEIGHT_, WIDTH_, KSIZE_, PAD_, STRIDE_, col_data);/ gradient w.r.t. weight.: caffe_cpu_gemm (CblasNoTrans, CblasTrans, M_, K_, N_, 1., top_diff, col_data , 1.,weight_diff
12、); / gradient w.r.t. bottom data: caffe_cpu_gemm (CblasTrans, CblasNoTrans, K_, N_, M_, 1., weight , top_diff , 0.,col_diff ); / col2im back to the data col2im_cpu(col_diff, CHANNELS_, HEIGHT_, WIDTH_, KSIZE_, PAD_, STRIDE_, bottom_diff );,Convolutional Layer : im2col,Implementation is based on redu
13、ction of convolution layer to matrix matrix multiply ( See Chellapilla et all , “High Performance Convolutional Neural Networks for Document Processing” ),Convolutional Layer: im2col,CIFAR-10 Training,http:/www.cs.toronto.edu/kriz/cifar.html https:/ 60000 32x32 colour images in 10 classes, with 6000 images per class. There are: 50000 training images 10000 test images.,Exercises,Look at definition of following layers (Backward) sigmoid, tanh Implement a new layer: softplus = log ( 1+ 1 ) Train CIFAR-10 with different topologies Port CIFAR-100 to caffe,