97
Machine Learning Lecture 10: Recurrent Neural Networks Nevin L. Zhang [email protected] Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and references listed at the end. Nevin L. Zhang (HKUST) Machine Learning 1 / 41

Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Machine LearningLecture 10: Recurrent Neural Networks

Nevin L. [email protected]

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

This set of notes is based on internet resources and references listed at theend.

Nevin L. Zhang (HKUST) Machine Learning 1 / 41

Page 2: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Outline

1 Introduction

2 Recurrent Neural Networks

3 Long Short-Term Memory (LSTM) RNN

4 RNN Architectures

5 Attention

Nevin L. Zhang (HKUST) Machine Learning 2 / 41

Page 3: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction

So far, we have been talking about neural network models for labelled data:

{xi , yi}Ni=1 −→ P(y |x),

where each training example consists of one input xi and one output yi .

Next, we will talk about neural network models for sequential data:

{(x(1)i , . . . , x

(τi )i ), (y

(1)i , . . . , y

(τi )i )}Ni=1 −→ P(y(t)|x(1), . . . , x(t)),

where each training example consists of a sequence of inputs

(x(1)i , . . . , x

(τi )i ) and a sequence of outputs (y

(1)i , . . . , y

(τi )i ),

and the current output y(t) depends not only on the current input, but alsoall previous inputs.

Nevin L. Zhang (HKUST) Machine Learning 3 / 41

Page 4: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction

So far, we have been talking about neural network models for labelled data:

{xi , yi}Ni=1 −→ P(y |x),

where each training example consists of one input xi and one output yi .

Next, we will talk about neural network models for sequential data:

{(x(1)i , . . . , x

(τi )i ), (y

(1)i , . . . , y

(τi )i )}Ni=1 −→ P(y(t)|x(1), . . . , x(t)),

where each training example consists of a sequence of inputs

(x(1)i , . . . , x

(τi )i ) and a sequence of outputs (y

(1)i , . . . , y

(τi )i ),

and the current output y(t) depends not only on the current input, but alsoall previous inputs.

Nevin L. Zhang (HKUST) Machine Learning 3 / 41

Page 5: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction

So far, we have been talking about neural network models for labelled data:

{xi , yi}Ni=1 −→ P(y |x),

where each training example consists of one input xi and one output yi .

Next, we will talk about neural network models for sequential data:

{(x(1)i , . . . , x

(τi )i ), (y

(1)i , . . . , y

(τi )i )}Ni=1 −→ P(y(t)|x(1), . . . , x(t)),

where each training example consists of a sequence of inputs

(x(1)i , . . . , x

(τi )i ) and a sequence of outputs (y

(1)i , . . . , y

(τi )i ),

and the current output y(t) depends not only on the current input, but alsoall previous inputs.

Nevin L. Zhang (HKUST) Machine Learning 3 / 41

Page 6: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction: Language Modeling

Data: A collection of sentences.

For each sentence, create an output sequence by shifting it:

(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)

From the training pairs, we can learn a neural language model:

It is used to predict the next word: P(wk |w1, . . . ,wk−1).

It also defines a probability distribution over sentences:

P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .

Nevin L. Zhang (HKUST) Machine Learning 4 / 41

Page 7: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction: Language Modeling

Data: A collection of sentences.

For each sentence, create an output sequence by shifting it:

(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)

From the training pairs, we can learn a neural language model:

It is used to predict the next word: P(wk |w1, . . . ,wk−1).

It also defines a probability distribution over sentences:

P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .

Nevin L. Zhang (HKUST) Machine Learning 4 / 41

Page 8: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction: Language Modeling

Data: A collection of sentences.

For each sentence, create an output sequence by shifting it:

(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)

From the training pairs, we can learn a neural language model:

It is used to predict the next word: P(wk |w1, . . . ,wk−1).

It also defines a probability distribution over sentences:

P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .

Nevin L. Zhang (HKUST) Machine Learning 4 / 41

Page 9: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction: Language Modeling

Data: A collection of sentences.

For each sentence, create an output sequence by shifting it:

(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)

From the training pairs, we can learn a neural language model:

It is used to predict the next word: P(wk |w1, . . . ,wk−1).

It also defines a probability distribution over sentences:

P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .

Nevin L. Zhang (HKUST) Machine Learning 4 / 41

Page 10: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction: Language Modeling

Data: A collection of sentences.

For each sentence, create an output sequence by shifting it:

(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)

From the training pairs, we can learn a neural language model:

It is used to predict the next word: P(wk |w1, . . . ,wk−1).

It also defines a probability distribution over sentences:

P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .

Nevin L. Zhang (HKUST) Machine Learning 4 / 41

Page 11: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction: Dialogue and Machine Translation

Data: A collection of matched pairs.

”How are you?” ; ”I am fine.”

We can still thinking of having an input and an output at each time point,except some inputs and outputs are dummies.

(”How”, ”are”, ”you”,−,−,−), (−,−,−, ”I ”, ”am”, ”fine”).

From the training pairs, we can learn neural model for dialogue or machinetranslation.

Nevin L. Zhang (HKUST) Machine Learning 5 / 41

Page 12: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction: Dialogue and Machine Translation

Data: A collection of matched pairs.

”How are you?” ; ”I am fine.”

We can still thinking of having an input and an output at each time point,except some inputs and outputs are dummies.

(”How”, ”are”, ”you”,−,−,−), (−,−,−, ”I ”, ”am”, ”fine”).

From the training pairs, we can learn neural model for dialogue or machinetranslation.

Nevin L. Zhang (HKUST) Machine Learning 5 / 41

Page 13: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Introduction

Introduction: Dialogue and Machine Translation

Data: A collection of matched pairs.

”How are you?” ; ”I am fine.”

We can still thinking of having an input and an output at each time point,except some inputs and outputs are dummies.

(”How”, ”are”, ”you”,−,−,−), (−,−,−, ”I ”, ”am”, ”fine”).

From the training pairs, we can learn neural model for dialogue or machinetranslation.

Nevin L. Zhang (HKUST) Machine Learning 5 / 41

Page 14: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Outline

1 Introduction

2 Recurrent Neural Networks

3 Long Short-Term Memory (LSTM) RNN

4 RNN Architectures

5 Attention

Nevin L. Zhang (HKUST) Machine Learning 6 / 41

Page 15: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

A circuit diagram, aka recurrent graph, (left), where a black squareindicates time-delayed dependence, or

A unfolded computational graph, aka unrolled graph, (right). The lengthof the unrolled graph is determined by the length the input. In other words,the unrolled graphs for different sequences can be of different lengths.

Nevin L. Zhang (HKUST) Machine Learning 7 / 41

Page 16: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

A circuit diagram, aka recurrent graph, (left), where a black squareindicates time-delayed dependence, or

A unfolded computational graph, aka unrolled graph, (right). The lengthof the unrolled graph is determined by the length the input. In other words,the unrolled graphs for different sequences can be of different lengths.

Nevin L. Zhang (HKUST) Machine Learning 7 / 41

Page 17: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

The input tokens x(t) are represented as embedding vectors, which aredetermined together with other model parameters during learning.

The hidden states h(t) are also vectors.

The current state h(t) depends on the current input x(t) and the previousstate h(t−1) as follows:

a(t) = b + Wh(t−1) + Ux(t)

h(t) = tanh(a(t))

where b, W and U are model parameters. They are independent of time t.

Nevin L. Zhang (HKUST) Machine Learning 8 / 41

Page 18: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

The input tokens x(t) are represented as embedding vectors, which aredetermined together with other model parameters during learning.

The hidden states h(t) are also vectors.

The current state h(t) depends on the current input x(t) and the previousstate h(t−1) as follows:

a(t) = b + Wh(t−1) + Ux(t)

h(t) = tanh(a(t))

where b, W and U are model parameters. They are independent of time t.

Nevin L. Zhang (HKUST) Machine Learning 8 / 41

Page 19: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

The input tokens x(t) are represented as embedding vectors, which aredetermined together with other model parameters during learning.

The hidden states h(t) are also vectors.

The current state h(t) depends on the current input x(t) and the previousstate h(t−1) as follows:

a(t) = b + Wh(t−1) + Ux(t)

h(t) = tanh(a(t))

where b, W and U are model parameters.

They are independent of time t.

Nevin L. Zhang (HKUST) Machine Learning 8 / 41

Page 20: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

The input tokens x(t) are represented as embedding vectors, which aredetermined together with other model parameters during learning.

The hidden states h(t) are also vectors.

The current state h(t) depends on the current input x(t) and the previousstate h(t−1) as follows:

a(t) = b + Wh(t−1) + Ux(t)

h(t) = tanh(a(t))

where b, W and U are model parameters. They are independent of time t.Nevin L. Zhang (HKUST) Machine Learning 8 / 41

Page 21: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

The output sequence is produced as follows:

o(t) = c + Vh(t)

y(t) = softmax(o(t))

where c and V are model parameters.

They are independent of time t.

Nevin L. Zhang (HKUST) Machine Learning 9 / 41

Page 22: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

The output sequence is produced as follows:

o(t) = c + Vh(t)

y(t) = softmax(o(t))

where c and V are model parameters. They are independent of time t.

Nevin L. Zhang (HKUST) Machine Learning 9 / 41

Page 23: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

This is the loss for one training pair:

L({x(1), . . . , x(τ)}, {y(1), . . . , y(τ)}) = −τ∑

t=1

log Pmodel(y(t)|x(1), . . . , x(t)),

where log Pmodel(y(t)|x(1), . . . , x(t)) is obtained by reading the entry for y(t)

from the models output vector y(t).

When there are multiple input-target sequence pairs, the losses are added up.

Training objective: Minimize the total loss of all training pairs w.r.t themodel parameters and embedding vectors:

W,U,V,b, c, θemwhere θem are the embedding vectors.

Nevin L. Zhang (HKUST) Machine Learning 10 / 41

Page 24: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

This is the loss for one training pair:

L({x(1), . . . , x(τ)}, {y(1), . . . , y(τ)}) = −τ∑

t=1

log Pmodel(y(t)|x(1), . . . , x(t)),

where log Pmodel(y(t)|x(1), . . . , x(t)) is obtained by reading the entry for y(t)

from the models output vector y(t).

When there are multiple input-target sequence pairs, the losses are added up.

Training objective: Minimize the total loss of all training pairs w.r.t themodel parameters and embedding vectors:

W,U,V,b, c, θemwhere θem are the embedding vectors.

Nevin L. Zhang (HKUST) Machine Learning 10 / 41

Page 25: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Recurrent Neural Networks

This is the loss for one training pair:

L({x(1), . . . , x(τ)}, {y(1), . . . , y(τ)}) = −τ∑

t=1

log Pmodel(y(t)|x(1), . . . , x(t)),

where log Pmodel(y(t)|x(1), . . . , x(t)) is obtained by reading the entry for y(t)

from the models output vector y(t).

When there are multiple input-target sequence pairs, the losses are added up.

Training objective: Minimize the total loss of all training pairs w.r.t themodel parameters and embedding vectors:

W,U,V,b, c, θemwhere θem are the embedding vectors.

Nevin L. Zhang (HKUST) Machine Learning 10 / 41

Page 26: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Training RNNs

RNNs are trained using stochastic gradient descent.

We need gradients:

∇WL,∇UL,∇VL,∇bL,∇cL,∇θemL

They are computed using Backpropagation Through Time (BPTT),which is an adaption of Backpropagation to the unrolled computationalgraph.

BPTT is implemented in deep learning packages such as Tensorflow.

Nevin L. Zhang (HKUST) Machine Learning 11 / 41

Page 27: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

Training RNNs

RNNs are trained using stochastic gradient descent.

We need gradients:

∇WL,∇UL,∇VL,∇bL,∇cL,∇θemL

They are computed using Backpropagation Through Time (BPTT),which is an adaption of Backpropagation to the unrolled computationalgraph.

BPTT is implemented in deep learning packages such as Tensorflow.

Nevin L. Zhang (HKUST) Machine Learning 11 / 41

Page 28: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Recurrent Neural Networks

RNN and Self-Supervised Learning

Self-supervised learning is a learning technique where the training data isautomatically labelled.

It is still supervised learning, but the datasets do not need to bemanually labelled by a human, but they can e.g. be labelled by findingand exploiting the relations between different input signals.

RNN training is self-supervised learning.

Nevin L. Zhang (HKUST) Machine Learning 12 / 41

Page 29: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Outline

1 Introduction

2 Recurrent Neural Networks

3 Long Short-Term Memory (LSTM) RNN

4 RNN Architectures

5 Attention

Nevin L. Zhang (HKUST) Machine Learning 13 / 41

Page 30: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.

The key idea is to use memory cells and gates:

c(t) = ftc(t−1) + ita(t)

c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.

Nevin L. Zhang (HKUST) Machine Learning 14 / 41

Page 31: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.

The key idea is to use memory cells and gates:

c(t) = ftc(t−1) + ita(t)

c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.

Nevin L. Zhang (HKUST) Machine Learning 14 / 41

Page 32: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.

The key idea is to use memory cells and gates:

c(t) = ftc(t−1) + ita(t)

c(t): memory state at t; a(t): new input at t.

If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.

Nevin L. Zhang (HKUST) Machine Learning 14 / 41

Page 33: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.

The key idea is to use memory cells and gates:

c(t) = ftc(t−1) + ita(t)

c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.

If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.

Nevin L. Zhang (HKUST) Machine Learning 14 / 41

Page 34: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.

The key idea is to use memory cells and gates:

c(t) = ftc(t−1) + ita(t)

c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.

If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.

Nevin L. Zhang (HKUST) Machine Learning 14 / 41

Page 35: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.

The key idea is to use memory cells and gates:

c(t) = ftc(t−1) + ita(t)

c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.

Nevin L. Zhang (HKUST) Machine Learning 14 / 41

Page 36: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.

The key idea is to use memory cells and gates:

c(t) = ftc(t−1) + ita(t)

c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.

Nevin L. Zhang (HKUST) Machine Learning 14 / 41

Page 37: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

In the case of vectors, c(t) = ft ⊗ c(t−1) + it ⊗ a(t), where ⊗ meanspointwise product.

ft is called the forget gate vector because it determines which componentsof the previous state and how much of them to remember/forget.

it is called the input gate vector because it determines which componentsof the input from h(t−1) and x(t) and how much of them should go into thecurrent state.

If we can learn ft and it from data, then we can automatically determinewhich component to remember/forget and how much of them toremember/forget.

Nevin L. Zhang (HKUST) Machine Learning 15 / 41

Page 38: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

In the case of vectors, c(t) = ft ⊗ c(t−1) + it ⊗ a(t), where ⊗ meanspointwise product.

ft is called the forget gate vector because it determines which componentsof the previous state and how much of them to remember/forget.

it is called the input gate vector because it determines which componentsof the input from h(t−1) and x(t) and how much of them should go into thecurrent state.

If we can learn ft and it from data, then we can automatically determinewhich component to remember/forget and how much of them toremember/forget.

Nevin L. Zhang (HKUST) Machine Learning 15 / 41

Page 39: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

In the case of vectors, c(t) = ft ⊗ c(t−1) + it ⊗ a(t), where ⊗ meanspointwise product.

ft is called the forget gate vector because it determines which componentsof the previous state and how much of them to remember/forget.

it is called the input gate vector because it determines which componentsof the input from h(t−1) and x(t) and how much of them should go into thecurrent state.

If we can learn ft and it from data, then we can automatically determinewhich component to remember/forget and how much of them toremember/forget.

Nevin L. Zhang (HKUST) Machine Learning 15 / 41

Page 40: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Basic Idea

In the case of vectors, c(t) = ft ⊗ c(t−1) + it ⊗ a(t), where ⊗ meanspointwise product.

ft is called the forget gate vector because it determines which componentsof the previous state and how much of them to remember/forget.

it is called the input gate vector because it determines which componentsof the input from h(t−1) and x(t) and how much of them should go into thecurrent state.

If we can learn ft and it from data, then we can automatically determinewhich component to remember/forget and how much of them toremember/forget.

Nevin L. Zhang (HKUST) Machine Learning 15 / 41

Page 41: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell

In standard RNN, a(t) = b + Wh(t−1) + Ux(t),h(t) = tanh(a(t))

In LSTM, we introduce a cell state vector ct , and set

ct = ft ⊗ ct−1 + it ⊗ a(t)

h(t) = tanh(ct)

where ft and it are vectors.

Nevin L. Zhang (HKUST) Machine Learning 16 / 41

Page 42: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell

In standard RNN, a(t) = b + Wh(t−1) + Ux(t),h(t) = tanh(a(t))

In LSTM, we introduce a cell state vector ct , and set

ct = ft ⊗ ct−1 + it ⊗ a(t)

h(t) = tanh(ct)

where ft and it are vectors.

Nevin L. Zhang (HKUST) Machine Learning 16 / 41

Page 43: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell

In standard RNN, a(t) = b + Wh(t−1) + Ux(t),h(t) = tanh(a(t))

In LSTM, we introduce a cell state vector ct , and set

ct = ft ⊗ ct−1 + it ⊗ a(t)

h(t) = tanh(ct)

where ft and it are vectors.

Nevin L. Zhang (HKUST) Machine Learning 16 / 41

Page 44: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell: Learning the Gates

ft is determined based on current input x(t) and previous hidden unit h(t−1):

ft = σ(Wf x(t) + Uf h(t−1) + bf ),

where Wf , Uf , bf are parameters to be learned from data.

it is also determined based on current input x(t) and previous hidden unith(t−1):

it = σ(Wix(t) + Uih

(t−1) + bi )

where Wi , Ui , bi are parameters to be learned from data.

Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1. In contrast, tanh is used for the output h(t)

so as to have strong gradient signal during backprop.

Nevin L. Zhang (HKUST) Machine Learning 17 / 41

Page 45: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell: Learning the Gates

ft is determined based on current input x(t) and previous hidden unit h(t−1):

ft = σ(Wf x(t) + Uf h(t−1) + bf ),

where Wf , Uf , bf are parameters to be learned from data.

it is also determined based on current input x(t) and previous hidden unith(t−1):

it = σ(Wix(t) + Uih

(t−1) + bi )

where Wi , Ui , bi are parameters to be learned from data.

Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1. In contrast, tanh is used for the output h(t)

so as to have strong gradient signal during backprop.

Nevin L. Zhang (HKUST) Machine Learning 17 / 41

Page 46: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell: Learning the Gates

ft is determined based on current input x(t) and previous hidden unit h(t−1):

ft = σ(Wf x(t) + Uf h(t−1) + bf ),

where Wf , Uf , bf are parameters to be learned from data.

it is also determined based on current input x(t) and previous hidden unith(t−1):

it = σ(Wix(t) + Uih

(t−1) + bi )

where Wi , Ui , bi are parameters to be learned from data.

Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1.

In contrast, tanh is used for the output h(t)

so as to have strong gradient signal during backprop.

Nevin L. Zhang (HKUST) Machine Learning 17 / 41

Page 47: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell: Learning the Gates

ft is determined based on current input x(t) and previous hidden unit h(t−1):

ft = σ(Wf x(t) + Uf h(t−1) + bf ),

where Wf , Uf , bf are parameters to be learned from data.

it is also determined based on current input x(t) and previous hidden unith(t−1):

it = σ(Wix(t) + Uih

(t−1) + bi )

where Wi , Ui , bi are parameters to be learned from data.

Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1. In contrast, tanh is used for the output h(t)

so as to have strong gradient signal during backprop.

Nevin L. Zhang (HKUST) Machine Learning 17 / 41

Page 48: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell: Output Gate

We can also have a output gate to control which components of the statevector ct and how much of them should be outputted:

ot = σ(Wqx(t) + Uqh(t−1) + bq)

where Wq, Uq, bq are the learnable parameters, and set

h(t) = ot ⊗ tanh(ct)

Nevin L. Zhang (HKUST) Machine Learning 18 / 41

Page 49: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell: Output Gate

We can also have a output gate to control which components of the statevector ct and how much of them should be outputted:

ot = σ(Wqx(t) + Uqh(t−1) + bq)

where Wq, Uq, bq are the learnable parameters, and set

h(t) = ot ⊗ tanh(ct)

Nevin L. Zhang (HKUST) Machine Learning 18 / 41

Page 50: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell: Summary

A standard RNN cell: a(t) = b + Wh(t−1) + Ux(t), h(t) = tanh(a(t))

An LSTM Cell:

ft = σ(Wf x(t) + Uf h(t−1) + bf ) (forget gate, 0© in figure)

it = σ(Wix(t) + Uih

(t−1) + bi ) (input gate, 1© in figure)

ot = σ(Wqx(t) + Uqh(t−1) + bq) (output gate, 3© in figure)

ct = ft ⊗ ct−1 + it ⊗ tanh(Ux(t) + Wh(t−1) + b) (update memory)

h(t) = ot ⊗ tanh(ct) (next hidden unit)

Nevin L. Zhang (HKUST) Machine Learning 19 / 41

Page 51: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

LSTM Cell: Summary

A standard RNN cell: a(t) = b + Wh(t−1) + Ux(t), h(t) = tanh(a(t))An LSTM Cell:

ft = σ(Wf x(t) + Uf h(t−1) + bf ) (forget gate, 0© in figure)

it = σ(Wix(t) + Uih

(t−1) + bi ) (input gate, 1© in figure)

ot = σ(Wqx(t) + Uqh(t−1) + bq) (output gate, 3© in figure)

ct = ft ⊗ ct−1 + it ⊗ tanh(Ux(t) + Wh(t−1) + b) (update memory)

h(t) = ot ⊗ tanh(ct) (next hidden unit)

Nevin L. Zhang (HKUST) Machine Learning 19 / 41

Page 52: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Gated Recurrent Unit

The Gated Recurrent Unit (GRU) is another a gating mechanism to allowRNNs to efficiently learn long-range dependency.

It is similar to LSTM (no memory and no output unit) and hence has fewerparamters. is a simplified version of an LSTM unit with fewer parameters.

Performance also similar to LSTM, except better on small datasets.

Nevin L. Zhang (HKUST) Machine Learning 20 / 41

Page 53: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Long Short-Term Memory (LSTM) RNN

Gated Recurrent Unit

Nevin L. Zhang (HKUST) Machine Learning 21 / 41

Page 54: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

Outline

1 Introduction

2 Recurrent Neural Networks

3 Long Short-Term Memory (LSTM) RNN

4 RNN Architectures

5 Attention

Nevin L. Zhang (HKUST) Machine Learning 22 / 41

Page 55: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

So Far ....

So far, we have concentrated on the following architecture that can be usedto model the relationship pairs of sequences (x(1), . . . , x(τ)) and(y(1), . . . , y(τ)) of the same length.

It is used for language modeling.

Nevin L. Zhang (HKUST) Machine Learning 23 / 41

Page 56: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

Deep RNNs

Sometimes, we might want to use multiple layers of hidden units. This leadsto deep recurrent neural network.

More layers usually implies better performance.

Nevin L. Zhang (HKUST) Machine Learning 24 / 41

Page 57: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

Bidirectional RNNs

In one directional RNN, h(t) only captureinformation from the past (x(1), . . . , x(t)),and we use h(t) to predict y(t)

Sometime, we might need informationfrom the future to predict y(t).

In such case, we can use bidirectionalRNNs.

Found useful in handwriting recognition,speech recognition, and bioinformatics.

Nevin L. Zhang (HKUST) Machine Learning 25 / 41

Page 58: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

The Encoder-Decoder Architecture

The encoder-decoder or sequence tosequence architecture is used for learningto generate a output sequence(y(1), . . . , y(ny )) in response to an inputsequence (x(1), . . . , x(nx )).

The context variable C represents asemantic summary of the input thesequence.

The decoder defines a conditionaldistribution P(y(1), . . . , y(ny )|C ) oversequences, from which output sequencesare generated.

The architecture is used in dialoguesystems and machine translation.

Nevin L. Zhang (HKUST) Machine Learning 26 / 41

Page 59: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

The Encoder-Decoder Architecture

The encoder-decoder or sequence tosequence architecture is used for learningto generate a output sequence(y(1), . . . , y(ny )) in response to an inputsequence (x(1), . . . , x(nx )).

The context variable C represents asemantic summary of the input thesequence.

The decoder defines a conditionaldistribution P(y(1), . . . , y(ny )|C ) oversequences, from which output sequencesare generated.

The architecture is used in dialoguesystems and machine translation.

Nevin L. Zhang (HKUST) Machine Learning 26 / 41

Page 60: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

Seq2seq for machine translation

The generated sequence can be a translation of the input sequence inmachine translation,

Note that this model assumes: p(s2|C, s1, y(1)) = p(s2|s1, y(1)), . . .

Nevin L. Zhang (HKUST) Machine Learning 27 / 41

Page 61: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

Seq2seq for machine translation

The generated sequence can be a translation of the input sequence inmachine translation,

Note that this model assumes: p(s2|C, s1, y(1)) = p(s2|s1, y(1)), . . .

Nevin L. Zhang (HKUST) Machine Learning 27 / 41

Page 62: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

Seq2seq for ChatBot

The generated sequence can also be a reply to the input sequence indialogue systems

Nevin L. Zhang (HKUST) Machine Learning 28 / 41

Page 63: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

Seq2seq in Action

http://jalammar.github.io/images/seq2seq 6.mp4

Nevin L. Zhang (HKUST) Machine Learning 29 / 41

Page 64: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

RNN Architectures

Map a Vector to A sequence

Maps a vector x to a sequence (y(1), . . . , y(τ))

Found useful in image captioning

Nevin L. Zhang (HKUST) Machine Learning 30 / 41

Page 65: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Outline

1 Introduction

2 Recurrent Neural Networks

3 Long Short-Term Memory (LSTM) RNN

4 RNN Architectures

5 Attention

Nevin L. Zhang (HKUST) Machine Learning 31 / 41

Page 66: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Motivation

In the seq2seq model, the context variableC is shared at all time steps of thedecoder.

So, we are look at the same summary ofthe input sequence when generating theoutput at all time steps.

This is not desirable.

Nevin L. Zhang (HKUST) Machine Learning 32 / 41

Page 67: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Motivation

In the seq2seq model, the context variableC is shared at all time steps of thedecoder.

So, we are look at the same summary ofthe input sequence when generating theoutput at all time steps.

This is not desirable.

Nevin L. Zhang (HKUST) Machine Learning 32 / 41

Page 68: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Motivation

In the seq2seq model, the context variableC is shared at all time steps of thedecoder.

So, we are look at the same summary ofthe input sequence when generating theoutput at all time steps.

This is not desirable.

Nevin L. Zhang (HKUST) Machine Learning 32 / 41

Page 69: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Motivation

In fact, we might want to focus on different parts of the input at differenttimes:

Focus at h1 at time step 1,Focus at h4 at time step 2,Focus at h6 at time step 4,. . .

Nevin L. Zhang (HKUST) Machine Learning 33 / 41

Page 70: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Motivation

In fact, we might want to focus on different parts of the input at differenttimes:

Focus at h1 at time step 1,

Focus at h4 at time step 2,Focus at h6 at time step 4,. . .

Nevin L. Zhang (HKUST) Machine Learning 33 / 41

Page 71: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Motivation

In fact, we might want to focus on different parts of the input at differenttimes:

Focus at h1 at time step 1,Focus at h4 at time step 2,

Focus at h6 at time step 4,. . .

Nevin L. Zhang (HKUST) Machine Learning 33 / 41

Page 72: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Motivation

In fact, we might want to focus on different parts of the input at differenttimes:

Focus at h1 at time step 1,Focus at h4 at time step 2,Focus at h6 at time step 4,

. . .

Nevin L. Zhang (HKUST) Machine Learning 33 / 41

Page 73: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Motivation

In fact, we might want to focus on different parts of the input at differenttimes:

Focus at h1 at time step 1,Focus at h4 at time step 2,Focus at h6 at time step 4,. . .

Nevin L. Zhang (HKUST) Machine Learning 33 / 41

Page 74: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

So, we introduce a context variable Ct foreach time step of the decoder:

Ct =∑j

αt,jhj

It is a linear combinations of the hiddenstates from the encoder.

For different sequences, we use differentvalues for the weights αt,j , andconsequently focus on different parts ofthe input.

Nevin L. Zhang (HKUST) Machine Learning 34 / 41

Page 75: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

So, we introduce a context variable Ct foreach time step of the decoder:

Ct =∑j

αt,jhj

It is a linear combinations of the hiddenstates from the encoder.

For different sequences, we use differentvalues for the weights αt,j , andconsequently focus on different parts ofthe input.

Nevin L. Zhang (HKUST) Machine Learning 34 / 41

Page 76: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

So, we introduce a context variable Ct foreach time step of the decoder:

Ct =∑j

αt,jhj

It is a linear combinations of the hiddenstates from the encoder.

For different sequences, we use differentvalues for the weights αt,j , andconsequently focus on different parts ofthe input.

Nevin L. Zhang (HKUST) Machine Learning 34 / 41

Page 77: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

So, we introduce a context variable Ct foreach time step of the decoder:

Ct =∑j

αt,jhj

It is a linear combinations of the hiddenstates from the encoder.

For different sequences, we use differentvalues for the weights αt,j , andconsequently focus on different parts ofthe input.

Nevin L. Zhang (HKUST) Machine Learning 34 / 41

Page 78: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention Example

Nevin L. Zhang (HKUST) Machine Learning 35 / 41

Page 79: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding.

Whatinformation do we use to determine the attention weights αt,j for step t?

ct−1: Context vectors for step t − 1. It tells us what to look for next.

If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.

In general, ct−1 is known as the query.

hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.

The attention weight is to be determined by assessing how compatible the queryand the key are.

Nevin L. Zhang (HKUST) Machine Learning 36 / 41

Page 80: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?

ct−1: Context vectors for step t − 1. It tells us what to look for next.

If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.

In general, ct−1 is known as the query.

hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.

The attention weight is to be determined by assessing how compatible the queryand the key are.

Nevin L. Zhang (HKUST) Machine Learning 36 / 41

Page 81: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?

ct−1: Context vectors for step t − 1. It tells us what to look for next.

If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.

In general, ct−1 is known as the query.

hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.

The attention weight is to be determined by assessing how compatible the queryand the key are.

Nevin L. Zhang (HKUST) Machine Learning 36 / 41

Page 82: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?

ct−1: Context vectors for step t − 1. It tells us what to look for next.

If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.

If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.

In general, ct−1 is known as the query.

hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.

The attention weight is to be determined by assessing how compatible the queryand the key are.

Nevin L. Zhang (HKUST) Machine Learning 36 / 41

Page 83: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?

ct−1: Context vectors for step t − 1. It tells us what to look for next.

If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.

In general, ct−1 is known as the query.

hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.

The attention weight is to be determined by assessing how compatible the queryand the key are.

Nevin L. Zhang (HKUST) Machine Learning 36 / 41

Page 84: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?

ct−1: Context vectors for step t − 1. It tells us what to look for next.

If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.

In general, ct−1 is known as the query.

hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.

The attention weight is to be determined by assessing how compatible the queryand the key are.

Nevin L. Zhang (HKUST) Machine Learning 36 / 41

Page 85: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?

ct−1: Context vectors for step t − 1. It tells us what to look for next.

If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.

In general, ct−1 is known as the query.

hj : Latent state of step j of encoding. It tells us whether hj is what we neednext.

In general, it is known as the key.

The attention weight is to be determined by assessing how compatible the queryand the key are.

Nevin L. Zhang (HKUST) Machine Learning 36 / 41

Page 86: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?

ct−1: Context vectors for step t − 1. It tells us what to look for next.

If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.

In general, ct−1 is known as the query.

hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.

The attention weight is to be determined by assessing how compatible the queryand the key are.

Nevin L. Zhang (HKUST) Machine Learning 36 / 41

Page 87: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Compatibility Function

There are several possible ways to compute the compatibility function.

αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality.

This iscalled dot-product attention).

αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.

αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.

Nevin L. Zhang (HKUST) Machine Learning 37 / 41

Page 88: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Compatibility Function

There are several possible ways to compute the compatibility function.

αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality. This iscalled dot-product attention).

αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.

αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.

Nevin L. Zhang (HKUST) Machine Learning 37 / 41

Page 89: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Compatibility Function

There are several possible ways to compute the compatibility function.

αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality. This iscalled dot-product attention).

αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.

αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.

Nevin L. Zhang (HKUST) Machine Learning 37 / 41

Page 90: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Compatibility Function

There are several possible ways to compute the compatibility function.

αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality. This iscalled dot-product attention).

αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.

αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.

Nevin L. Zhang (HKUST) Machine Learning 37 / 41

Page 91: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Compatibility Function

There are several possible ways to compute the compatibility function.

αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality. This iscalled dot-product attention).

αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.

αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.

Nevin L. Zhang (HKUST) Machine Learning 37 / 41

Page 92: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

The weight αt,j is obtained using thequery ct−1 and the key hj?

It tells us how relevant theinformation at step j .

Then we retrieve the value hj at step jand computed their weighted sum to get:

Ct =∑j

αt,jhj

Note that, for different input sequences,the weights αt,j are different. So,attention means dynamic weights. (Instandard NN, weights are static and donot depend on input.)

Nevin L. Zhang (HKUST) Machine Learning 38 / 41

Page 93: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

The weight αt,j is obtained using thequery ct−1 and the key hj?

It tells us how relevant theinformation at step j .

Then we retrieve the value hj at step jand computed their weighted sum to get:

Ct =∑j

αt,jhj

Note that, for different input sequences,the weights αt,j are different. So,attention means dynamic weights. (Instandard NN, weights are static and donot depend on input.)

Nevin L. Zhang (HKUST) Machine Learning 38 / 41

Page 94: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention

The weight αt,j is obtained using thequery ct−1 and the key hj?

It tells us how relevant theinformation at step j .

Then we retrieve the value hj at step jand computed their weighted sum to get:

Ct =∑j

αt,jhj

Note that, for different input sequences,the weights αt,j are different. So,attention means dynamic weights. (Instandard NN, weights are static and donot depend on input.)

Nevin L. Zhang (HKUST) Machine Learning 38 / 41

Page 95: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Alignments between source and target sentences

Nevin L. Zhang (HKUST) Machine Learning 39 / 41

Page 96: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

Attention in Action

http://jalammar.github.io/images/attention tensor dance.mp4

Nevin L. Zhang (HKUST) Machine Learning 40 / 41

Page 97: Machine Learning - Lecture 10: Recurrent Neural Networkshome.cse.ust.hk/~lzhang/teach/comp5212/slides/l10.t.pdf · Machine Learning Lecture 10: Recurrent Neural Networks Nevin L

Attention

References

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine translation byjointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).

Devlin, Jacob, et al. ”Bert: Pre-training of deep bidirectional transformers for languageunderstanding.” arXiv preprint arXiv:1810.04805 (2018).

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MITpress.www.deeplearningbook.org

Jay Alammar: http://jalammar.github.io/

Vaswani, Ashish, et al. ”Attention is all you need.” Advances in neural informationprocessing systems. 2017.

Nevin L. Zhang (HKUST) Machine Learning 41 / 41