29
Tricks from Deep Neural Network Tong Wang AI Lab, University of Massachusetts Boston http://www.cs.umb.edu/ ~ twang/ Advisor: Dr. Ping Chen November 17, 2016 Tong Wang Tricks from Deep Neural Network 1 / 29

Tricks from Deep Neural Network - UMass Boston …twang/file/tricks_from_dl.pdf · Tricks from Deep Neural Network Tong Wang ... Deep Convolutional Networks (image/video classi cation,

Embed Size (px)

Citation preview

Tricks from Deep Neural Network

Tong Wang

AI Lab, University of Massachusetts Bostonhttp://www.cs.umb.edu/~twang/

Advisor: Dr. Ping Chen

November 17, 2016

Tong Wang Tricks from Deep Neural Network 1 / 29

Outline

1 Neural Network Architectures

2 Preprocessing and Initialization

3 Activation Function

4 Objective Function and Optimization

5 Regularization

6 Deep Learning Frameworks and Tools

Tong Wang Tricks from Deep Neural Network 2 / 29

Neural Network

Architectures

Tong Wang Tricks from Deep Neural Network 3 / 29

Neural Network Architectures

Different neural network architectures apply to different tasks. There aretoo many neural network architectures, here are the most common andpopular architectures.

I Feedforward

Multilayer Perceptron (traditional classification or regression problem)Autoencoders (unsupervised learning, data representation,dimensionality reduction)Restricted Boltzmann machines (unsupervised or supervised learning,probablisitic generative model)

I Convolutional

Deep Convolutional Networks (image/video classification, objectdetection)

I Recurrent

Recurrent Neural Network (speech recogntion, language modeling)Long Short-Term Memory/GRU (RNN with memory cell)

Tong Wang Tricks from Deep Neural Network 4 / 29

More Advanced Architectures

I CNN+RNN (video classification, image question answering)

I Recursive (natural language processing)

I Sequence to Sequence Model (Sutskever et al. 2014) (machinetranslation)

I Neural Turing Machine (Graves et al. 2014) (copy, sorting...)

I Memory Network (Weston et al. 2014) (question answering, dialog)

I Generative Adversarial Network (Goodfellow et al. 2014) (generativemodel)

I Residual Network (He et al. 2015) (very deep neural networks)

I And more ...

Tong Wang Tricks from Deep Neural Network 5 / 29

Which architecture should we use?

I It depends on your task (e.g., use CNN for image classification, useRNN for speech recognition), the size of your dataset (the more data,the more layers, neurons and kernels)

I How many hidden layers?

In fact, two fully connected hidden layers can represent any functions.A few more layers will make the model learn more efficiently, too deepdoes not help. Convolutional layer can be very deep.

I How many neurons in each hidden layer?

Too few will result in underfitting, too much will result in overfitting.The number of hidden neurons should not be a lot more than neuronsin the input layer. (e.g., input layer: 10, hidden layer: 512 Not good!)Roughly compute the total number of parameters. The total number ofparameters should not be a lot more than the datasize (e.g., totalparameters: 1 million, data: 1000 Not good!)

Tong Wang Tricks from Deep Neural Network 6 / 29

AlexNet

I Dataset: ImageNet Large-Scale Visual Recognition Challenge, 1.2million training images, 50,000 validation images, and 150,000 testingimages, 1000 class labels.

I Architecture: Totally 8 layers, 5 convolutional and 3 fully connectedlayers, output of last fully-connected layer is fed to a 1000-waysoftmax.

I Parameters: 60 million parameters.

Figure 1 : AlexNet (Krizhevsky et al. 2012)

Tong Wang Tricks from Deep Neural Network 7 / 29

Preprocessing and

Initialization

Tong Wang Tricks from Deep Neural Network 8 / 29

Preprocess Data

I Split data into train/val/test splits. Validation set is used for tuninghyper-parameters, which is extremely important for training neuralnetworks.

I Normalize the features in your data to have zero mean and unitvariance. It can make the features in the same scale

I If your data is very high-dimensional, consider using a dimensionalityreduction technique such as PCA.

I If the size of your dataset is small, you can also do data augmentation(e.g., CV: horizontally flipping, random crops and color jittering;NLP: synonym substitution).

Tong Wang Tricks from Deep Neural Network 9 / 29

Weight Initialization

I All zero initialization: not good, every neuron computes the sameoutput, have same gradients during back-propagation

I Zero-mean Gaussian with 0.01 stddev

I Uniform Distribution

I Glorot normal/uniform: gaussian initialization with variance scaled byfan in + fan out (Glorot et al., 2010), keeping the signal in areasonable range of values through many layers

I He normal/uniform: Gaussian initialization scaled by fan in (He et al.,2014)

I Orthogonal: The eigen values of an orthogonally initialized matrix areone. This helps in vanishing gradients as they don’t explode ordiminish. Especially good for RNN/LSTM

Tong Wang Tricks from Deep Neural Network 10 / 29

Fine-tune

We can use the well trained Convolutional Neural Network (like AlexNet)as an initialization or a fixed feature extractor, only replace the later fullyconnected layers (Transfer Learning).

I Retrain the whole neural network on your dataset

I Fix the convolutional layers and only retrain the fully connected layerson your dataset

I However, if your dataset is large and very different from the originaldataset, you have to build and train your own CNN from scratch

Tong Wang Tricks from Deep Neural Network 11 / 29

Activation Functions

Tong Wang Tricks from Deep Neural Network 12 / 29

Activation Function

Figure 2 : Activation Function (figure from stanford cs231n)

I Each neuron performs a dot product with the input and its weights,adds the bias and applies the non-linearity (or activation function)

Tong Wang Tricks from Deep Neural Network 13 / 29

Sigmoid

Figure 3 : Sigmoid σ(x) = 11+e−x

Pros:

I The output is between 0 and 1, can be used as the output layer.I Easy to compute derivation: σ(x)(1− σ(x))

Cons:

I Saturating when receiving strong signals, have derivatives of 0 at bothends, drive other gradients in previous layers towards 0.

I Exploding gradient problems. Trick: clipping the gradients (if thegradient is exceeding a threshold, then pushed down to thatthreshold)

Tong Wang Tricks from Deep Neural Network 14 / 29

Tanh

Figure 4 : tanh(x) = 1−e−2x

1+e−2x = 2σ(2x)− 1

Pros:

I The output is centered around 0

I Easy to compute derivative (tanh(x))′ = 1− tanh2(x). Convergesfaster than Sigmoid

Cons:

I Saturating, vanishing or exploding gradient problems

Tong Wang Tricks from Deep Neural Network 15 / 29

Rectifier Linear Unit (ReLU)

Figure 5 : ReLU(x)=max(0,x)

Pros:

I Sparse representationI AlexNet paper reports ReLUs train six times faster than equivalent

network with tanh neuronsI Derivative is a constant of either 0 or 1, no vanishing gradient

problem.

Cons:

I Dead neurons. The weight will no longer update when a largegradient flowing through, the gradient is 0 forever.

Tong Wang Tricks from Deep Neural Network 16 / 29

Objective Function and

Optimization

Tong Wang Tricks from Deep Neural Network 17 / 29

Objective Function/Loss Function

You should choose the right loss function based on your problem and yourdata.

I Classification

Cross-entropy lossBinary cross-entropy lossHinge Loss (max-margin loss)

I Regression

Mean square lossMean absolute loss

If the loss is minimized but accuracy is low, you should check the lossfunction. Maybe it is not appropriate for your task.

Tong Wang Tricks from Deep Neural Network 18 / 29

Optimization

I Stochastic Gradient Descent

I Adagrad (Duchi 2011)

I RMSProp (Hinton)

I Adadelta (Zeiler 2012)

I Adam (Kingma et al. 2014)

Parameters: learning rate (the most important parameter), momentum(help to converge faster), decay, etc.

Tong Wang Tricks from Deep Neural Network 19 / 29

Learning rates

Figure 6 : Learning rates (From stanford cs231n)

Try big learning rate first, then decrease it.

Tong Wang Tricks from Deep Neural Network 20 / 29

Regularization

Tong Wang Tricks from Deep Neural Network 21 / 29

Overfitting and Regularization

I Models with a large number of free parameters can easily agree wellwith the available training data, but it will fail to generalize to thetest data.

(a) Neural Network Capacity (b) Regularization

Figure 7 : We should use big neural network, and use regularization to controloverfitting. Figure from stanford cs231n

Tong Wang Tricks from Deep Neural Network 22 / 29

L1/L2 regularization

The basic idea is to penalize large weights and tend to improvegeneralization. The objective function becomes: E (x) + λLp(w)

I L1 regularization:m∑i=1|wi |, also known as Lasso, produce sparse

results.

I L2 regularization:m∑i=1

w2i , also known as ridge, or weight decay. The

most widely used regularization method.

I L12 regularization: λ1L1 + λ2L2, also known as elastic netregularization

Tong Wang Tricks from Deep Neural Network 23 / 29

Dropout

I Combining the predictions of many different models is a verysuccessful way to reduce test errors. e.g., Bagging

I Randomly set 50%(you can set it yourself) of the inputs to eachneuron to 0. The neurons which are ”dropped out” do not contributeto the forward pass and the backpropagation.

I Prevent overfitting, but double the convergence time.

Tong Wang Tricks from Deep Neural Network 24 / 29

Dropout

Figure 8 : Dropout (Srivastava et al. 2014)

Tong Wang Tricks from Deep Neural Network 25 / 29

Batch Normalization

The distribution of each layer’s inputs changes during training, the layersneed to continuously adapt to the new distribution. Batch Normalization(Ioffe, et al. 2015) is a good way to solve it.

I Normalizing each layer, for each mini-batch

I Greatly accelerate training

I Less sensitive to initialization

I Improve regularization

Remember to put the BatchNorm layer immediately after fully connectedlayers (or convolutional layers), and before activation.

Tong Wang Tricks from Deep Neural Network 26 / 29

Early stopping

Use parameters that give the best validation error. Stop the trainingbefore overfitting

(a) Training error (b) Validation error

Figure 9 : Training error and validation error

Tong Wang Tricks from Deep Neural Network 27 / 29

Deep Learning

Frameworks and Tools

Tong Wang Tricks from Deep Neural Network 28 / 29

Deep Learning Frameworks and Tools

GPU is a must!

I Theano: Python, supported by university of montreal. Manyacademic researchers in the field of deep learning rely on Theano.

I Tensorflow: Python/C/C++, supported by Google. Become moreand more popular, many industries start to use it.

I Torch: Lua, supported by Facebook.

I Caffe: C/C++, popular framework, especially in computer visioncommunity.

I Keras/Lasagne/Blocks: Built on top of Theano or Tensorflow, highlevel wrappers.

Tong Wang Tricks from Deep Neural Network 29 / 29