44
Machine Learning for Language Modelling Marek Rei Part 4: Neural network LM optimisation

Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Machine Learning for Language Modelling

Marek Rei

Part 4: Neural network LM optimisation

Page 2: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Recap

Page 3: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Recap

Page 4: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Recap

Page 5: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Neural network training

How do we find values for the weight matrices?

Gradient descent trainingStart with random weight valuesFor each training example:

1. Calculate the network prediction2. Measure the error between prediction and

the correct label3. Update all the parameters, so that they

would make a smaller error on this training example

Page 6: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Loss function

Loss function is used to calculate the error between the predicted value and the correct value.

For example: mean squared error

- predicted value- correct value (gold standard)

- training examples

Page 7: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Loss function

We want to update our model parameters, so that the model has a smaller loss L

For each parameter w, we can calculate the derivative of L with respect to w:

Page 8: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Derivatives

Partial derivative - the derivative with respect to one parameter, while keeping the others constant.

- How much does L change, when we slightly increase w

- If we increase w, L will also increase

- If we increase w, L will decrease

Page 9: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Derivatives

Page 10: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Gradient descent

We always want to move any parameter w towards a smaller loss LThe parameters are updated following the rule:

- the learning rate, it controls how big is the update step that we take

- parameter w at time step t

Page 11: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Gradient descent

Let’s calculate derivative of L with respect to w0

Can use the chain rule:

Page 12: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Gradient descent

Page 13: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Gradient descent

We have calculated the partial derivative

Now we can update the weight parameter

We do this for every parameter in the neural net, although this process can be simplified

Page 14: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Gradient

Gradient - a vector containing partial derivatives for all the model parameters.Can also call it the “error” vector

In a high-dimensional parameter space, it shows us the way in which we want to move

Page 15: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Gradient descent

Move in the direction of the negative gradient

Page 16: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Language model loss function

- loss for training example k

- possible words in the output layer

- binary variable for the correct answer1 if j is the correct word0 otherwise

- out predicted probability for j

Page 17: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

Activation: we start with the input and move through the network to make a prediction

Backpropagation: we start from the output layer and move towards the input layer, calculating partial derivatives and weight updates

Page 18: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

We’ll go through backpropagation for a neural language model

The derivatives can be found the same way as we did for a simple neuron

The graph explicitly shows stages z and s for clarity

Page 19: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

1. Take a training example and perform a feedforward pass through the network.

Values in output vector o will be word probabilities.

Page 20: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

2. Calculate the error in the output layer

- Vector of correct answers.Position j will be 1 if word j is the correct word. 0 otherwise.

Page 21: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

3. Calculate the error for the output weights

Page 22: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

4. Calculate error at hidden layer z

- element-wise multiplication

derivative of sigmoid

partial derivative of h

Page 23: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

5. Calculate error for hidden weights

Page 24: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

6. Calculate error for input vectors

Page 25: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

7. Update weight matrices

Page 26: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation

8. Update the input vectors

Page 27: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

The general pattern

⚬ Error at a hidden layer⚬ = error at the next layer,⚬ multiplied by the weights between the layers,⚬ and element-wise multiplied with the derivative of

the activation function

⚬ Error on the weights⚬ = error at the next layer,⚬ multiplied by the values from the previous layer

Page 28: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Representation learning

Can think of word representations as just another weight matrix

We update E using the same logic as Wout, W0, W1, and W2

Page 29: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Backpropagation tips

⚬ The error matrices have the same dimensionality as the parameter matrices! This can help when debugging.

If is a 10x20 matrix, then is also a 10x20 matrix

⚬ Calculate the errors before updating the weights.Otherwise you’ll backpropagate with wrong (updated) weights.

Page 30: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Activation functions

There are different possible activation functions⚬ It should be differentiable⚬ We can choose the best one using a

development set

Linear

Page 31: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Activation functions

Logistic function (sigmoid)

Tanh

Page 32: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Activation functions

Rectifier (ReLU)

Softplus

Page 33: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Overfitting

Page 34: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Regularisation

One way to prevent overfitting is to motivate the model to keep all weight values small

We can add this into our loss function

The higher the weight values, the larger the loss that we want to minimise

Page 35: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

HyperparametersInitial weights⚬ Set randomly to some small values around 0⚬ Strategy 1: ⚬ Strategy 2:

Learning rate⚬ Choose an initial value. Often 0.1 or 0.01⚬ Decrease the learning rate during training⚬ Stop training when performance on the

development data has not improved

Page 36: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Deep learning

Deep learning - learning with many non-linear layers

Each consecutive layer will learn more complicated patterns of the output from the previous layer

We could add an extra hidden layer into our language model, and it would be a deep network

Page 37: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Deep learning

Page 38: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Neural nets vs traditional models

Negative:⚬ More unstable, settings need to be just right⚬ Less explainable (black box)⚬ Computationally demanding

Positive:⚬ Less feature engineering (representation

learning)⚬ More expressive and more flexible⚬ Have been shown to outperform traditional

models on many tasks

Page 39: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Gradient checkingA way to check that your gradient descent is implemented correctly. Two ways to calculate the gradient.

Method 1: from your weight updates

Method 2: actually changing the weight a bit and measuring the change in loss

Page 40: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

ReferencesPattern Recognition and Machine LearningChristopher Bishop (2007)

Machine Learning: A Probabilistic PerspectiveKevin Murphy (2012)

Machine LearningAndrew Ng (2012)https://www.coursera.org/course/ml

Using Neural Networks for Modelling and Representing Natural LanguagesTomas Mikolov (2014)http://www.coling-2014.org/COLING%202014%20Tutorial-fix%20-%20Tomas%20Mikolov.pdf

Deep Learning for Natural Language Processing (without Magic)Richard Socher, Christopher Manning (2013)http://nlp.stanford.edu/courses/NAACL2013/

Page 41: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Extra materials

Page 42: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Philipp Koehn, Tony Robinson

Google, University of Edinburgh, Cantab Research Ltd2014

Page 43: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Dataset

⚬ A new benchmark dataset for LM evaluation⚬ Cleaned and tokenised⚬ 0.8 billion words for training⚬ 160K words used for testing⚬ Discarding words with count < 3⚬ Vocabulary V = 793471⚬ Using <S>, </S> and <UNK> tokens

Page 44: Modelling Machine Learning for Language - Marek Rei · Marek Rei, 2015 References Pattern Recognition and Machine Learning Christopher Bishop (2007) Machine Learning: A Probabilistic

Marek Rei, 2015

Results