Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Machine LearningLecture 10: Recurrent Neural Networks

Nevin L. [email protected]

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

This set of notes is based on internet resources and references listed at theend.

Nevin L. Zhang (HKUST) Machine Learning 1 / 41

Introduction

Outline

1 Introduction

2 Recurrent Neural Networks

3 Long Short-Term Memory (LSTM) RNN

4 RNN Architectures

5 Attention


Introduction

Introduction

So far, we have been talking about neural network models for labelled data:

{xi , yi}Ni=1 −→ P(y |x),

where each training example consists of one input xi and one output yi .

Next, we will talk about neural network models for sequential data:

{(x(1)i , . . . , x

(τi )i ), (y

(1)i , . . . , y

(τi )i )}Ni=1 −→ P(y(t)|x(1), . . . , x(t)),

where each training example consists of a sequence of inputs

(x(1)i , . . . , x

(τi )i ) and a sequence of outputs (y

(1)i , . . . , y

(τi )i ),

and the current output y(t) depends not only on the current input, but alsoall previous inputs.


Introduction

Introduction


{xi , yi}Ni=1 −→ P(y |x),



{(x(1)i , . . . , x

(τi )i ), (y

(1)i , . . . , y

(τi )i )}Ni=1 −→ P(y(t)|x(1), . . . , x(t)),


(x(1)i , . . . , x


(1)i , . . . , y

(τi )i ),



Introduction

Introduction


{xi , yi}Ni=1 −→ P(y |x),



{(x(1)i , . . . , x

(τi )i ), (y

(1)i , . . . , y

(τi )i )}Ni=1 −→ P(y(t)|x(1), . . . , x(t)),


(x(1)i , . . . , x


(1)i , . . . , y

(τi )i ),



Introduction

Introduction: Language Modeling

Data: A collection of sentences.

For each sentence, create an output sequence by shifting it:

(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)

From the training pairs, we can learn a neural language model:

It is used to predict the next word: P(wk |w1, . . . ,wk−1).

It also defines a probability distribution over sentences:

P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .


Introduction










Introduction










Introduction










Introduction










Introduction

Introduction: Dialogue and Machine Translation

Data: A collection of matched pairs.

”How are you?” ; ”I am fine.”

We can still thinking of having an input and an output at each time point,except some inputs and outputs are dummies.

(”How”, ”are”, ”you”,−,−,−), (−,−,−, ”I ”, ”am”, ”fine”).

From the training pairs, we can learn neural model for dialogue or machinetranslation.


Introduction








Introduction








Recurrent Neural Networks

Outline

1 Introduction



4 RNN Architectures

5 Attention




A circuit diagram, aka recurrent graph, (left), where a black squareindicates time-delayed dependence, or

A unfolded computational graph, aka unrolled graph, (right). The lengthof the unrolled graph is determined by the length the input. In other words,the unrolled graphs for different sequences can be of different lengths.




A circuit diagram, aka recurrent graph, (left), where a black squareindicates time-delayed dependence, or

A unfolded computational graph, aka unrolled graph, (right). The lengthof the unrolled graph is determined by the length the input. In other words,the unrolled graphs for different sequences can be of different lengths.




The input tokens x(t) are represented as embedding vectors, which aredetermined together with other model parameters during learning.

The hidden states h(t) are also vectors.

The current state h(t) depends on the current input x(t) and the previousstate h(t−1) as follows:

a(t) = b + Wh(t−1) + Ux(t)

h(t) = tanh(a(t))

where b, W and U are model parameters. They are independent of time t.







a(t) = b + Wh(t−1) + Ux(t)

h(t) = tanh(a(t))

where b, W and U are model parameters. They are independent of time t.







a(t) = b + Wh(t−1) + Ux(t)

h(t) = tanh(a(t))

where b, W and U are model parameters.

They are independent of time t.







a(t) = b + Wh(t−1) + Ux(t)

h(t) = tanh(a(t))

where b, W and U are model parameters. They are independent of time t.Nevin L. Zhang (HKUST) Machine Learning 8 / 41



The output sequence is produced as follows:

o(t) = c + Vh(t)

y(t) = softmax(o(t))

where c and V are model parameters.

They are independent of time t.




The output sequence is produced as follows:

o(t) = c + Vh(t)

y(t) = softmax(o(t))

where c and V are model parameters. They are independent of time t.




This is the loss for one training pair:

L({x(1), . . . , x(τ)}, {y(1), . . . , y(τ)}) = −τ∑

t=1

log Pmodel(y(t)|x(1), . . . , x(t)),

where log Pmodel(y(t)|x(1), . . . , x(t)) is obtained by reading the entry for y(t)

from the models output vector y(t).

When there are multiple input-target sequence pairs, the losses are added up.

Training objective: Minimize the total loss of all training pairs w.r.t themodel parameters and embedding vectors:

W,U,V,b, c, θemwhere θem are the embedding vectors.





L({x(1), . . . , x(τ)}, {y(1), . . . , y(τ)}) = −τ∑

t=1

log Pmodel(y(t)|x(1), . . . , x(t)),










L({x(1), . . . , x(τ)}, {y(1), . . . , y(τ)}) = −τ∑

t=1

log Pmodel(y(t)|x(1), . . . , x(t)),








Training RNNs

RNNs are trained using stochastic gradient descent.

We need gradients:

∇WL,∇UL,∇VL,∇bL,∇cL,∇θemL

They are computed using Backpropagation Through Time (BPTT),which is an adaption of Backpropagation to the unrolled computationalgraph.

BPTT is implemented in deep learning packages such as Tensorflow.



Training RNNs

RNNs are trained using stochastic gradient descent.

We need gradients:

∇WL,∇UL,∇VL,∇bL,∇cL,∇θemL

They are computed using Backpropagation Through Time (BPTT),which is an adaption of Backpropagation to the unrolled computationalgraph.

BPTT is implemented in deep learning packages such as Tensorflow.



RNN and Self-Supervised Learning

Self-supervised learning is a learning technique where the training data isautomatically labelled.

It is still supervised learning, but the datasets do not need to bemanually labelled by a human, but they can e.g. be labelled by findingand exploiting the relations between different input signals.

RNN training is self-supervised learning.


Long Short-Term Memory (LSTM) RNN

Outline

1 Introduction



4 RNN Architectures

5 Attention



Basic Idea

Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.

The key idea is to use memory cells and gates:

c(t) = ftc(t−1) + ita(t)

c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.



Basic Idea







Basic Idea




c(t): memory state at t; a(t): new input at t.

If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.



Basic Idea




c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.

If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.



Basic Idea




c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.

If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.



Basic Idea







Basic Idea







Basic Idea

In the case of vectors, c(t) = ft ⊗ c(t−1) + it ⊗ a(t), where ⊗ meanspointwise product.

ft is called the forget gate vector because it determines which componentsof the previous state and how much of them to remember/forget.

it is called the input gate vector because it determines which componentsof the input from h(t−1) and x(t) and how much of them should go into thecurrent state.

If we can learn ft and it from data, then we can automatically determinewhich component to remember/forget and how much of them toremember/forget.



Basic Idea







Basic Idea







Basic Idea







LSTM Cell

In standard RNN, a(t) = b + Wh(t−1) + Ux(t),h(t) = tanh(a(t))

In LSTM, we introduce a cell state vector ct , and set

ct = ft ⊗ ct−1 + it ⊗ a(t)

h(t) = tanh(ct)

where ft and it are vectors.



LSTM Cell



ct = ft ⊗ ct−1 + it ⊗ a(t)

h(t) = tanh(ct)




LSTM Cell



ct = ft ⊗ ct−1 + it ⊗ a(t)

h(t) = tanh(ct)




LSTM Cell: Learning the Gates

ft is determined based on current input x(t) and previous hidden unit h(t−1):

ft = σ(Wf x(t) + Uf h(t−1) + bf ),

where Wf , Uf , bf are parameters to be learned from data.

it is also determined based on current input x(t) and previous hidden unith(t−1):

it = σ(Wix(t) + Uih

(t−1) + bi )

where Wi , Ui , bi are parameters to be learned from data.

Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1. In contrast, tanh is used for the output h(t)

so as to have strong gradient signal during backprop.





ft = σ(Wf x(t) + Uf h(t−1) + bf ),




(t−1) + bi )








ft = σ(Wf x(t) + Uf h(t−1) + bf ),




(t−1) + bi )


Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1.

In contrast, tanh is used for the output h(t)






ft = σ(Wf x(t) + Uf h(t−1) + bf ),




(t−1) + bi )






LSTM Cell: Output Gate

We can also have a output gate to control which components of the statevector ct and how much of them should be outputted:

ot = σ(Wqx(t) + Uqh(t−1) + bq)

where Wq, Uq, bq are the learnable parameters, and set

h(t) = ot ⊗ tanh(ct)



LSTM Cell: Output Gate

We can also have a output gate to control which components of the statevector ct and how much of them should be outputted:

ot = σ(Wqx(t) + Uqh(t−1) + bq)

where Wq, Uq, bq are the learnable parameters, and set

h(t) = ot ⊗ tanh(ct)



LSTM Cell: Summary

A standard RNN cell: a(t) = b + Wh(t−1) + Ux(t), h(t) = tanh(a(t))

An LSTM Cell:

ft = σ(Wf x(t) + Uf h(t−1) + bf ) (forget gate, 0© in figure)


(t−1) + bi ) (input gate, 1© in figure)

ot = σ(Wqx(t) + Uqh(t−1) + bq) (output gate, 3© in figure)

ct = ft ⊗ ct−1 + it ⊗ tanh(Ux(t) + Wh(t−1) + b) (update memory)

h(t) = ot ⊗ tanh(ct) (next hidden unit)



LSTM Cell: Summary

A standard RNN cell: a(t) = b + Wh(t−1) + Ux(t), h(t) = tanh(a(t))An LSTM Cell:

ft = σ(Wf x(t) + Uf h(t−1) + bf ) (forget gate, 0© in figure)


(t−1) + bi ) (input gate, 1© in figure)

ot = σ(Wqx(t) + Uqh(t−1) + bq) (output gate, 3© in figure)

ct = ft ⊗ ct−1 + it ⊗ tanh(Ux(t) + Wh(t−1) + b) (update memory)

h(t) = ot ⊗ tanh(ct) (next hidden unit)



Gated Recurrent Unit

The Gated Recurrent Unit (GRU) is another a gating mechanism to allowRNNs to efficiently learn long-range dependency.

It is similar to LSTM (no memory and no output unit) and hence has fewerparamters. is a simplified version of an LSTM unit with fewer parameters.

Performance also similar to LSTM, except better on small datasets.



Gated Recurrent Unit


RNN Architectures

Outline

1 Introduction



4 RNN Architectures

5 Attention


RNN Architectures

So Far ....

So far, we have concentrated on the following architecture that can be usedto model the relationship pairs of sequences (x(1), . . . , x(τ)) and(y(1), . . . , y(τ)) of the same length.

It is used for language modeling.


RNN Architectures

Deep RNNs

Sometimes, we might want to use multiple layers of hidden units. This leadsto deep recurrent neural network.

More layers usually implies better performance.


RNN Architectures

Bidirectional RNNs

In one directional RNN, h(t) only captureinformation from the past (x(1), . . . , x(t)),and we use h(t) to predict y(t)

Sometime, we might need informationfrom the future to predict y(t).

In such case, we can use bidirectionalRNNs.

Found useful in handwriting recognition,speech recognition, and bioinformatics.


RNN Architectures

The Encoder-Decoder Architecture

The encoder-decoder or sequence tosequence architecture is used for learningto generate a output sequence(y(1), . . . , y(ny )) in response to an inputsequence (x(1), . . . , x(nx )).

The context variable C represents asemantic summary of the input thesequence.

The decoder defines a conditionaldistribution P(y(1), . . . , y(ny )|C ) oversequences, from which output sequencesare generated.

The architecture is used in dialoguesystems and machine translation.


RNN Architectures

The Encoder-Decoder Architecture

The encoder-decoder or sequence tosequence architecture is used for learningto generate a output sequence(y(1), . . . , y(ny )) in response to an inputsequence (x(1), . . . , x(nx )).

The context variable C represents asemantic summary of the input thesequence.

The decoder defines a conditionaldistribution P(y(1), . . . , y(ny )|C ) oversequences, from which output sequencesare generated.

The architecture is used in dialoguesystems and machine translation.


RNN Architectures

Seq2seq for machine translation

The generated sequence can be a translation of the input sequence inmachine translation,

Note that this model assumes: p(s2|C, s1, y(1)) = p(s2|s1, y(1)), . . .


RNN Architectures

Seq2seq for machine translation

The generated sequence can be a translation of the input sequence inmachine translation,

Note that this model assumes: p(s2|C, s1, y(1)) = p(s2|s1, y(1)), . . .


RNN Architectures

Seq2seq for ChatBot

The generated sequence can also be a reply to the input sequence indialogue systems


RNN Architectures

Seq2seq in Action

http://jalammar.github.io/images/seq2seq 6.mp4


RNN Architectures

Map a Vector to A sequence

Maps a vector x to a sequence (y(1), . . . , y(τ))

Found useful in image captioning


Attention

Outline

1 Introduction



4 RNN Architectures

5 Attention


Attention

Motivation

In the seq2seq model, the context variableC is shared at all time steps of thedecoder.

So, we are look at the same summary ofthe input sequence when generating theoutput at all time steps.

This is not desirable.


Attention

Motivation





Attention

Motivation





Attention

Motivation

In fact, we might want to focus on different parts of the input at differenttimes:

Focus at h1 at time step 1,Focus at h4 at time step 2,Focus at h6 at time step 4,. . .


Attention

Motivation


Focus at h1 at time step 1,

Focus at h4 at time step 2,Focus at h6 at time step 4,. . .


Attention

Motivation


Focus at h1 at time step 1,Focus at h4 at time step 2,

Focus at h6 at time step 4,. . .


Attention

Motivation


Focus at h1 at time step 1,Focus at h4 at time step 2,Focus at h6 at time step 4,

. . .


Attention

Motivation


Focus at h1 at time step 1,Focus at h4 at time step 2,Focus at h6 at time step 4,. . .


Attention

Attention

So, we introduce a context variable Ct foreach time step of the decoder:

Ct =∑j

αt,jhj

It is a linear combinations of the hiddenstates from the encoder.

For different sequences, we use differentvalues for the weights αt,j , andconsequently focus on different parts ofthe input.


Attention

Attention


Ct =∑j

αt,jhj




Attention

Attention


Ct =∑j

αt,jhj




Attention

Attention


Ct =∑j

αt,jhj




Attention

Attention Example


Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding.

Whatinformation do we use to determine the attention weights αt,j for step t?

ct−1: Context vectors for step t − 1. It tells us what to look for next.

If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.

In general, ct−1 is known as the query.

hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.

The attention weight is to be determined by assessing how compatible the queryand the key are.


Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?







Attention

Attention








Attention

Attention



If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.

If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.





Attention

Attention








Attention

Attention








Attention

Attention





hj : Latent state of step j of encoding. It tells us whether hj is what we neednext.

In general, it is known as the key.



Attention

Attention








Attention

Compatibility Function

There are several possible ways to compute the compatibility function.

αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality.

This iscalled dot-product attention).

αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.

αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.


Attention



αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality. This iscalled dot-product attention).




Attention







Attention







Attention







Attention

Attention

The weight αt,j is obtained using thequery ct−1 and the key hj?

It tells us how relevant theinformation at step j .

Then we retrieve the value hj at step jand computed their weighted sum to get:

Ct =∑j

αt,jhj

Note that, for different input sequences,the weights αt,j are different. So,attention means dynamic weights. (Instandard NN, weights are static and donot depend on input.)


Attention

Attention




Ct =∑j

αt,jhj



Attention

Attention




Ct =∑j

αt,jhj



Attention

Alignments between source and target sentences


Attention

Attention in Action

http://jalammar.github.io/images/attention tensor dance.mp4


Attention

References

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine translation byjointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).

Devlin, Jacob, et al. ”Bert: Pre-training of deep bidirectional transformers for languageunderstanding.” arXiv preprint arXiv:1810.04805 (2018).

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MITpress.www.deeplearningbook.org

Jay Alammar: http://jalammar.github.io/

Vaswani, Ashish, et al. ”Attention is all you need.” Advances in neural informationprocessing systems. 2017.


Documents

Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L