Upload
ngokhuong
View
224
Download
5
Embed Size (px)
Citation preview
Tricks from Deep Neural Network
Tong Wang
AI Lab, University of Massachusetts Bostonhttp://www.cs.umb.edu/~twang/
Advisor: Dr. Ping Chen
November 17, 2016
Tong Wang Tricks from Deep Neural Network 1 / 29
Outline
1 Neural Network Architectures
2 Preprocessing and Initialization
3 Activation Function
4 Objective Function and Optimization
5 Regularization
6 Deep Learning Frameworks and Tools
Tong Wang Tricks from Deep Neural Network 2 / 29
Neural Network Architectures
Different neural network architectures apply to different tasks. There aretoo many neural network architectures, here are the most common andpopular architectures.
I Feedforward
Multilayer Perceptron (traditional classification or regression problem)Autoencoders (unsupervised learning, data representation,dimensionality reduction)Restricted Boltzmann machines (unsupervised or supervised learning,probablisitic generative model)
I Convolutional
Deep Convolutional Networks (image/video classification, objectdetection)
I Recurrent
Recurrent Neural Network (speech recogntion, language modeling)Long Short-Term Memory/GRU (RNN with memory cell)
Tong Wang Tricks from Deep Neural Network 4 / 29
More Advanced Architectures
I CNN+RNN (video classification, image question answering)
I Recursive (natural language processing)
I Sequence to Sequence Model (Sutskever et al. 2014) (machinetranslation)
I Neural Turing Machine (Graves et al. 2014) (copy, sorting...)
I Memory Network (Weston et al. 2014) (question answering, dialog)
I Generative Adversarial Network (Goodfellow et al. 2014) (generativemodel)
I Residual Network (He et al. 2015) (very deep neural networks)
I And more ...
Tong Wang Tricks from Deep Neural Network 5 / 29
Which architecture should we use?
I It depends on your task (e.g., use CNN for image classification, useRNN for speech recognition), the size of your dataset (the more data,the more layers, neurons and kernels)
I How many hidden layers?
In fact, two fully connected hidden layers can represent any functions.A few more layers will make the model learn more efficiently, too deepdoes not help. Convolutional layer can be very deep.
I How many neurons in each hidden layer?
Too few will result in underfitting, too much will result in overfitting.The number of hidden neurons should not be a lot more than neuronsin the input layer. (e.g., input layer: 10, hidden layer: 512 Not good!)Roughly compute the total number of parameters. The total number ofparameters should not be a lot more than the datasize (e.g., totalparameters: 1 million, data: 1000 Not good!)
Tong Wang Tricks from Deep Neural Network 6 / 29
AlexNet
I Dataset: ImageNet Large-Scale Visual Recognition Challenge, 1.2million training images, 50,000 validation images, and 150,000 testingimages, 1000 class labels.
I Architecture: Totally 8 layers, 5 convolutional and 3 fully connectedlayers, output of last fully-connected layer is fed to a 1000-waysoftmax.
I Parameters: 60 million parameters.
Figure 1 : AlexNet (Krizhevsky et al. 2012)
Tong Wang Tricks from Deep Neural Network 7 / 29
Preprocess Data
I Split data into train/val/test splits. Validation set is used for tuninghyper-parameters, which is extremely important for training neuralnetworks.
I Normalize the features in your data to have zero mean and unitvariance. It can make the features in the same scale
I If your data is very high-dimensional, consider using a dimensionalityreduction technique such as PCA.
I If the size of your dataset is small, you can also do data augmentation(e.g., CV: horizontally flipping, random crops and color jittering;NLP: synonym substitution).
Tong Wang Tricks from Deep Neural Network 9 / 29
Weight Initialization
I All zero initialization: not good, every neuron computes the sameoutput, have same gradients during back-propagation
I Zero-mean Gaussian with 0.01 stddev
I Uniform Distribution
I Glorot normal/uniform: gaussian initialization with variance scaled byfan in + fan out (Glorot et al., 2010), keeping the signal in areasonable range of values through many layers
I He normal/uniform: Gaussian initialization scaled by fan in (He et al.,2014)
I Orthogonal: The eigen values of an orthogonally initialized matrix areone. This helps in vanishing gradients as they don’t explode ordiminish. Especially good for RNN/LSTM
Tong Wang Tricks from Deep Neural Network 10 / 29
Fine-tune
We can use the well trained Convolutional Neural Network (like AlexNet)as an initialization or a fixed feature extractor, only replace the later fullyconnected layers (Transfer Learning).
I Retrain the whole neural network on your dataset
I Fix the convolutional layers and only retrain the fully connected layerson your dataset
I However, if your dataset is large and very different from the originaldataset, you have to build and train your own CNN from scratch
Tong Wang Tricks from Deep Neural Network 11 / 29
Activation Function
Figure 2 : Activation Function (figure from stanford cs231n)
I Each neuron performs a dot product with the input and its weights,adds the bias and applies the non-linearity (or activation function)
Tong Wang Tricks from Deep Neural Network 13 / 29
Sigmoid
Figure 3 : Sigmoid σ(x) = 11+e−x
Pros:
I The output is between 0 and 1, can be used as the output layer.I Easy to compute derivation: σ(x)(1− σ(x))
Cons:
I Saturating when receiving strong signals, have derivatives of 0 at bothends, drive other gradients in previous layers towards 0.
I Exploding gradient problems. Trick: clipping the gradients (if thegradient is exceeding a threshold, then pushed down to thatthreshold)
Tong Wang Tricks from Deep Neural Network 14 / 29
Tanh
Figure 4 : tanh(x) = 1−e−2x
1+e−2x = 2σ(2x)− 1
Pros:
I The output is centered around 0
I Easy to compute derivative (tanh(x))′ = 1− tanh2(x). Convergesfaster than Sigmoid
Cons:
I Saturating, vanishing or exploding gradient problems
Tong Wang Tricks from Deep Neural Network 15 / 29
Rectifier Linear Unit (ReLU)
Figure 5 : ReLU(x)=max(0,x)
Pros:
I Sparse representationI AlexNet paper reports ReLUs train six times faster than equivalent
network with tanh neuronsI Derivative is a constant of either 0 or 1, no vanishing gradient
problem.
Cons:
I Dead neurons. The weight will no longer update when a largegradient flowing through, the gradient is 0 forever.
Tong Wang Tricks from Deep Neural Network 16 / 29
Objective Function/Loss Function
You should choose the right loss function based on your problem and yourdata.
I Classification
Cross-entropy lossBinary cross-entropy lossHinge Loss (max-margin loss)
I Regression
Mean square lossMean absolute loss
If the loss is minimized but accuracy is low, you should check the lossfunction. Maybe it is not appropriate for your task.
Tong Wang Tricks from Deep Neural Network 18 / 29
Optimization
I Stochastic Gradient Descent
I Adagrad (Duchi 2011)
I RMSProp (Hinton)
I Adadelta (Zeiler 2012)
I Adam (Kingma et al. 2014)
Parameters: learning rate (the most important parameter), momentum(help to converge faster), decay, etc.
Tong Wang Tricks from Deep Neural Network 19 / 29
Learning rates
Figure 6 : Learning rates (From stanford cs231n)
Try big learning rate first, then decrease it.
Tong Wang Tricks from Deep Neural Network 20 / 29
Overfitting and Regularization
I Models with a large number of free parameters can easily agree wellwith the available training data, but it will fail to generalize to thetest data.
(a) Neural Network Capacity (b) Regularization
Figure 7 : We should use big neural network, and use regularization to controloverfitting. Figure from stanford cs231n
Tong Wang Tricks from Deep Neural Network 22 / 29
L1/L2 regularization
The basic idea is to penalize large weights and tend to improvegeneralization. The objective function becomes: E (x) + λLp(w)
I L1 regularization:m∑i=1|wi |, also known as Lasso, produce sparse
results.
I L2 regularization:m∑i=1
w2i , also known as ridge, or weight decay. The
most widely used regularization method.
I L12 regularization: λ1L1 + λ2L2, also known as elastic netregularization
Tong Wang Tricks from Deep Neural Network 23 / 29
Dropout
I Combining the predictions of many different models is a verysuccessful way to reduce test errors. e.g., Bagging
I Randomly set 50%(you can set it yourself) of the inputs to eachneuron to 0. The neurons which are ”dropped out” do not contributeto the forward pass and the backpropagation.
I Prevent overfitting, but double the convergence time.
Tong Wang Tricks from Deep Neural Network 24 / 29
Dropout
Figure 8 : Dropout (Srivastava et al. 2014)
Tong Wang Tricks from Deep Neural Network 25 / 29
Batch Normalization
The distribution of each layer’s inputs changes during training, the layersneed to continuously adapt to the new distribution. Batch Normalization(Ioffe, et al. 2015) is a good way to solve it.
I Normalizing each layer, for each mini-batch
I Greatly accelerate training
I Less sensitive to initialization
I Improve regularization
Remember to put the BatchNorm layer immediately after fully connectedlayers (or convolutional layers), and before activation.
Tong Wang Tricks from Deep Neural Network 26 / 29
Early stopping
Use parameters that give the best validation error. Stop the trainingbefore overfitting
(a) Training error (b) Validation error
Figure 9 : Training error and validation error
Tong Wang Tricks from Deep Neural Network 27 / 29
Deep Learning Frameworks and Tools
GPU is a must!
I Theano: Python, supported by university of montreal. Manyacademic researchers in the field of deep learning rely on Theano.
I Tensorflow: Python/C/C++, supported by Google. Become moreand more popular, many industries start to use it.
I Torch: Lua, supported by Facebook.
I Caffe: C/C++, popular framework, especially in computer visioncommunity.
I Keras/Lasagne/Blocks: Built on top of Theano or Tensorflow, highlevel wrappers.
Tong Wang Tricks from Deep Neural Network 29 / 29