Tricks from Deep Neural Network - UMass Boston …twang/file/tricks_from_dl.pdf · Tricks from Deep Neural Network Tong Wang ... Deep Convolutional Networks (image/video classi cation,

Tricks from Deep Neural Network

Tong Wang

AI Lab, University of Massachusetts Bostonhttp://www.cs.umb.edu/~twang/

Advisor: Dr. Ping Chen

November 17, 2016

Tong Wang Tricks from Deep Neural Network 1 / 29

http://www.cs.umb.edu/~twang/

Outline

1 Neural Network Architectures

2 Preprocessing and Initialization

3 Activation Function

4 Objective Function and Optimization

5 Regularization

6 Deep Learning Frameworks and Tools


Neural Network

Architectures


Neural Network Architectures

Different neural network architectures apply to different tasks. There aretoo many neural network architectures, here are the most common andpopular architectures.

I Feedforward

Multilayer Perceptron (traditional classification or regression problem)Autoencoders (unsupervised learning, data representation,dimensionality reduction)Restricted Boltzmann machines (unsupervised or supervised learning,probablisitic generative model)

I Convolutional

Deep Convolutional Networks (image/video classification, objectdetection)

I Recurrent

Recurrent Neural Network (speech recogntion, language modeling)Long Short-Term Memory/GRU (RNN with memory cell)


More Advanced Architectures

I CNN+RNN (video classification, image question answering)

I Recursive (natural language processing)

I Sequence to Sequence Model (Sutskever et al. 2014) (machinetranslation)

I Neural Turing Machine (Graves et al. 2014) (copy, sorting...)

I Memory Network (Weston et al. 2014) (question answering, dialog)

I Generative Adversarial Network (Goodfellow et al. 2014) (generativemodel)

I Residual Network (He et al. 2015) (very deep neural networks)

I And more ...


Which architecture should we use?

I It depends on your task (e.g., use CNN for image classification, useRNN for speech recognition), the size of your dataset (the more data,the more layers, neurons and kernels)

I How many hidden layers?

In fact, two fully connected hidden layers can represent any functions.A few more layers will make the model learn more efficiently, too deepdoes not help. Convolutional layer can be very deep.

I How many neurons in each hidden layer?

Too few will result in underfitting, too much will result in overfitting.The number of hidden neurons should not be a lot more than neuronsin the input layer. (e.g., input layer: 10, hidden layer: 512 Not good!)Roughly compute the total number of parameters. The total number ofparameters should not be a lot more than the datasize (e.g., totalparameters: 1 million, data: 1000 Not good!)


AlexNet

I Dataset: ImageNet Large-Scale Visual Recognition Challenge, 1.2million training images, 50,000 validation images, and 150,000 testingimages, 1000 class labels.

I Architecture: Totally 8 layers, 5 convolutional and 3 fully connectedlayers, output of last fully-connected layer is fed to a 1000-waysoftmax.

I Parameters: 60 million parameters.

Figure 1 : AlexNet (Krizhevsky et al. 2012)


Preprocessing and

Initialization


Preprocess Data

I Split data into train/val/test splits. Validation set is used for tuninghyper-parameters, which is extremely important for training neuralnetworks.

I Normalize the features in your data to have zero mean and unitvariance. It can make the features in the same scale

I If your data is very high-dimensional, consider using a dimensionalityreduction technique such as PCA.

I If the size of your dataset is small, you can also do data augmentation(e.g., CV: horizontally flipping, random crops and color jittering;NLP: synonym substitution).


Weight Initialization

I All zero initialization: not good, every neuron computes the sameoutput, have same gradients during back-propagation

I Zero-mean Gaussian with 0.01 stddev

I Uniform Distribution

I Glorot normal/uniform: gaussian initialization with variance scaled byfan in + fan out (Glorot et al., 2010), keeping the signal in areasonable range of values through many layers

I He normal/uniform: Gaussian initialization scaled by fan in (He et al.,2014)

I Orthogonal: The eigen values of an orthogonally initialized matrix areone. This helps in vanishing gradients as they don’t explode ordiminish. Especially good for RNN/LSTM


Fine-tune

We can use the well trained Convolutional Neural Network (like AlexNet)as an initialization or a fixed feature extractor, only replace the later fullyconnected layers (Transfer Learning).

I Retrain the whole neural network on your dataset

I Fix the convolutional layers and only retrain the fully connected layerson your dataset

I However, if your dataset is large and very different from the originaldataset, you have to build and train your own CNN from scratch


Activation Functions


Activation Function

Figure 2 : Activation Function (figure from stanford cs231n)

I Each neuron performs a dot product with the input and its weights,adds the bias and applies the non-linearity (or activation function)


Sigmoid

Figure 3 : Sigmoid σ(x) = 11+e−x

Pros:

I The output is between 0 and 1, can be used as the output layer.I Easy to compute derivation: σ(x)(1− σ(x))

Cons:

I Saturating when receiving strong signals, have derivatives of 0 at bothends, drive other gradients in previous layers towards 0.

I Exploding gradient problems. Trick: clipping the gradients (if thegradient is exceeding a threshold, then pushed down to thatthreshold)


Tanh

Figure 4 : tanh(x) = 1−e−2x

1+e−2x = 2σ(2x)− 1

Pros:

I The output is centered around 0

I Easy to compute derivative (tanh(x))′ = 1− tanh2(x). Convergesfaster than Sigmoid

Cons:

I Saturating, vanishing or exploding gradient problems


Rectifier Linear Unit (ReLU)

Figure 5 : ReLU(x)=max(0,x)

Pros:

I Sparse representationI AlexNet paper reports ReLUs train six times faster than equivalent

network with tanh neuronsI Derivative is a constant of either 0 or 1, no vanishing gradient

problem.

Cons:

I Dead neurons. The weight will no longer update when a largegradient flowing through, the gradient is 0 forever.


Objective Function and

Optimization


Objective Function/Loss Function

You should choose the right loss function based on your problem and yourdata.

I Classification

Cross-entropy lossBinary cross-entropy lossHinge Loss (max-margin loss)

I Regression

Mean square lossMean absolute loss

If the loss is minimized but accuracy is low, you should check the lossfunction. Maybe it is not appropriate for your task.


Optimization

I Stochastic Gradient Descent

I Adagrad (Duchi 2011)

I RMSProp (Hinton)

I Adadelta (Zeiler 2012)

I Adam (Kingma et al. 2014)

Parameters: learning rate (the most important parameter), momentum(help to converge faster), decay, etc.


Learning rates

Figure 6 : Learning rates (From stanford cs231n)

Try big learning rate first, then decrease it.


Regularization


Overfitting and Regularization

I Models with a large number of free parameters can easily agree wellwith the available training data, but it will fail to generalize to thetest data.

(a) Neural Network Capacity (b) Regularization

Figure 7 : We should use big neural network, and use regularization to controloverfitting. Figure from stanford cs231n


L1/L2 regularization

The basic idea is to penalize large weights and tend to improvegeneralization. The objective function becomes: E (x) + λLp(w)

I L1 regularization:m∑i=1|wi |, also known as Lasso, produce sparse

results.

I L2 regularization:m∑i=1

w2i , also known as ridge, or weight decay. The

most widely used regularization method.

I L12 regularization: λ1L1 + λ2L2, also known as elastic netregularization


Dropout

I Combining the predictions of many different models is a verysuccessful way to reduce test errors. e.g., Bagging

I Randomly set 50%(you can set it yourself) of the inputs to eachneuron to 0. The neurons which are ”dropped out” do not contributeto the forward pass and the backpropagation.

I Prevent overfitting, but double the convergence time.


Dropout

Figure 8 : Dropout (Srivastava et al. 2014)


Batch Normalization

The distribution of each layer’s inputs changes during training, the layersneed to continuously adapt to the new distribution. Batch Normalization(Ioffe, et al. 2015) is a good way to solve it.

I Normalizing each layer, for each mini-batch

I Greatly accelerate training

I Less sensitive to initialization

I Improve regularization

Remember to put the BatchNorm layer immediately after fully connectedlayers (or convolutional layers), and before activation.


Early stopping

Use parameters that give the best validation error. Stop the trainingbefore overfitting

(a) Training error (b) Validation error

Figure 9 : Training error and validation error


Deep Learning

Frameworks and Tools


Deep Learning Frameworks and Tools

GPU is a must!

I Theano: Python, supported by university of montreal. Manyacademic researchers in the field of deep learning rely on Theano.

I Tensorflow: Python/C/C++, supported by Google. Become moreand more popular, many industries start to use it.

I Torch: Lua, supported by Facebook.

I Caffe: C/C++, popular framework, especially in computer visioncommunity.

I Keras/Lasagne/Blocks: Built on top of Theano or Tensorflow, highlevel wrappers.


Documents

Tricks from Deep Neural Network - UMass Boston …twang/file/tricks_from_dl.pdf · Tricks from Deep Neural Network Tong Wang ... Deep Convolutional Networks (image/video classi cation,