Deep Neural Networks [GBC] Chap. 6, 7, 8 CS 486/686 University of Waterloo Lecture 18: June 28, 2017
Outline Deep Neural Networks Gradient Vanishing Rectified linear units Overfitting Dropout Breakthroughs Acoustic modeling in speech recognition Image recognition CS486/686 Lecture Slides (c) 2017 P. Poupart 2
Deep Neural Network Definition: neural network with many hidden layers Advantage: high expressivity Challenges: How should we train a deep neural network? How can we avoid overfitting? CS486/686 Lecture Slides (c) 2017 P. Poupart 3
Expressivity Neural networks with one hidden layer of sigmoid/hyperbolic units can approximate arbitrarily closely neural networks with several layers of sigmoid/hyperbolic units However as we increase the number of layers, the number of units needed may decrease exponentially (with the number of layers) CS486/686 Lecture Slides (c) 2017 P. Poupart 4
Example Parity Function Single layer of hidden nodes inputs CS486/686 Lecture Slides (c) 2017 P. Poupart 5
Example Parity Function layers of hidden nodes 2 odd subsets 2 odd subsets 2 odd subsets CS486/686 Lecture Slides (c) 2017 P. Poupart 6
The power of depth (practice) Challenge: how to train deep NNs? CS486/686 Lecture Slides (c) 2017 P. Poupart 7
Speech 2006 (Hinton): first effective alg. for deep NN layerwise training of Stacked Restricted Boltzmann Machines (SRBM)s 2009: Breakthrough in acoustic modeling replace Gaussian Mixture Models by SRBMs Improved speech recognition at Google, Microsoft, IBM 2013-today: recurrent neural nets (LSTM) Google error rate: 23% (2013) 8% (2015) Microsoft error rate: 5.9% (Oct 17, 2016) same as human performance CS486/686 Lecture Slides (c) 2017 P. Poupart 8
Image Classification ImageNet Large Scale Visual Recognition Challenge Features + SVMs Deep Convolutional Neural Nets Classification error (%) 30 25 20 15 10 5 0 28.2 25.8 5 8 19 22 152 depth 16.4 11.7 7.3 6.7 3.57 5.1 3.07 CS486/686 Lecture Slides (c) 2017 P. Poupart 9
Vanishing Gradients Deep neural networks of sigmoid and hyperbolic units often suffer from vanishing gradients small gradient medium gradient large gradient CS486/686 Lecture Slides (c) 2017 P. Poupart 10
Sigmoid and hyperbolic units Derivative is always less than 1 sigmoid hyperbolic CS486/686 Lecture Slides (c) 2017 P. Poupart 11
Simple Example Common weight initialization in (-1,1) Sigmoid function and its derivative always less than 1 This leads to vanishing gradients: CS486/686 Lecture Slides (c) 2017 P. Poupart 12
Avoiding Vanishing Gradients Two popular solutions: Pre-training Rectified linear units and maxout units CS486/686 Lecture Slides (c) 2017 P. Poupart 13
Rectified Linear Units (ReLU) Rectified linear: Gradient is 0 or 1 Sparse computation Soft version: Softplus Softplus Warning: softplus does not prevent gradient vanishing (gradient < 1) Rectified Linear CS486/686 Lecture Slides (c) 2017 P. Poupart 14
Maxout Units Generalization of rectified linear units max identity identity identity CS486/686 Lecture Slides (c) 2017 P. Poupart 15
Overfitting High expressivity increases the risk of overfitting # of parameters is often larger than the amount of data Solution: Regularization Dropout Data augmentation CS486/686 Lecture Slides (c) 2017 P. Poupart 16
Dropout Idea: randomly drop some units from the network when training Training: at each iteration of gradient descent Each hidden unit is dropped with prob. 0.5 Each input unit is dropped with prob. 0.2 Prediction (testing): Multiply the output of each unit by one minus its drop probability CS486/686 Lecture Slides (c) 2017 P. Poupart 17
Intuition Dropout can be viewed as an approximate form of ensemble learning In each training iteration, a different subnetwork is trained At test time, these subnetworks are merged by averaging their weights CS486/686 Lecture Slides (c) 2017 P. Poupart 18
Robustness In sexual reproduction, half of the genes of two individuals are dropped and the remaining genes are merged to produce a new individual Genes are forced to evolve independently so that most combinations yield functional individuals Similarly, units in a neural net are forced to capture features that are largely independent of other units CS486/686 Lecture Slides (c) 2017 P. Poupart 19
Applications of Deep Neural Networks Speech Recognition Image recognition Machine translation Control Any application of shallow neural networks CS486/686 Lecture Slides (c) 2017 P. Poupart 20
Acoustic Modeling in Speech Recognition CS486/686 Lecture Slides (c) 2017 P. Poupart 21
Acoustic Modeling in Speech Recognition CS486/686 Lecture Slides (c) 2017 P. Poupart 22
Image Recognition Convolutional Neural Network With rectified linear units and dropout Data augmentation for transformation invariance CS486/686 Lecture Slides (c) 2017 P. Poupart 23
ImageNet Breakthrough Results: ILSVRC-2012 From Krizhevsky, Sutskever, Hinton CS486/686 Lecture Slides (c) 2017 P. Poupart 24
ImageNet Breakthrough From Krizhevsky, Sutskever, Hinton CS486/686 Lecture Slides (c) 2017 P. Poupart 25