Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Machine LearningLecture 10: Recurrent Neural Networks
Nevin L. [email protected]
Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology
This set of notes is based on internet resources and references listed at theend.
Nevin L. Zhang (HKUST) Machine Learning 1 / 41
Introduction
Outline
1 Introduction
2 Recurrent Neural Networks
3 Long Short-Term Memory (LSTM) RNN
4 RNN Architectures
5 Attention
Nevin L. Zhang (HKUST) Machine Learning 2 / 41
Introduction
Introduction
So far, we have been talking about neural network models for labelled data:
{xi , yi}Ni=1 −→ P(y |x),
where each training example consists of one input xi and one output yi .
Next, we will talk about neural network models for sequential data:
{(x(1)i , . . . , x
(τi )i ), (y
(1)i , . . . , y
(τi )i )}Ni=1 −→ P(y(t)|x(1), . . . , x(t)),
where each training example consists of a sequence of inputs
(x(1)i , . . . , x
(τi )i ) and a sequence of outputs (y
(1)i , . . . , y
(τi )i ),
and the current output y(t) depends not only on the current input, but alsoall previous inputs.
Nevin L. Zhang (HKUST) Machine Learning 3 / 41
Introduction
Introduction
So far, we have been talking about neural network models for labelled data:
{xi , yi}Ni=1 −→ P(y |x),
where each training example consists of one input xi and one output yi .
Next, we will talk about neural network models for sequential data:
{(x(1)i , . . . , x
(τi )i ), (y
(1)i , . . . , y
(τi )i )}Ni=1 −→ P(y(t)|x(1), . . . , x(t)),
where each training example consists of a sequence of inputs
(x(1)i , . . . , x
(τi )i ) and a sequence of outputs (y
(1)i , . . . , y
(τi )i ),
and the current output y(t) depends not only on the current input, but alsoall previous inputs.
Nevin L. Zhang (HKUST) Machine Learning 3 / 41
Introduction
Introduction
So far, we have been talking about neural network models for labelled data:
{xi , yi}Ni=1 −→ P(y |x),
where each training example consists of one input xi and one output yi .
Next, we will talk about neural network models for sequential data:
{(x(1)i , . . . , x
(τi )i ), (y
(1)i , . . . , y
(τi )i )}Ni=1 −→ P(y(t)|x(1), . . . , x(t)),
where each training example consists of a sequence of inputs
(x(1)i , . . . , x
(τi )i ) and a sequence of outputs (y
(1)i , . . . , y
(τi )i ),
and the current output y(t) depends not only on the current input, but alsoall previous inputs.
Nevin L. Zhang (HKUST) Machine Learning 3 / 41
Introduction
Introduction: Language Modeling
Data: A collection of sentences.
For each sentence, create an output sequence by shifting it:
(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)
From the training pairs, we can learn a neural language model:
It is used to predict the next word: P(wk |w1, . . . ,wk−1).
It also defines a probability distribution over sentences:
P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .
Nevin L. Zhang (HKUST) Machine Learning 4 / 41
Introduction
Introduction: Language Modeling
Data: A collection of sentences.
For each sentence, create an output sequence by shifting it:
(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)
From the training pairs, we can learn a neural language model:
It is used to predict the next word: P(wk |w1, . . . ,wk−1).
It also defines a probability distribution over sentences:
P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .
Nevin L. Zhang (HKUST) Machine Learning 4 / 41
Introduction
Introduction: Language Modeling
Data: A collection of sentences.
For each sentence, create an output sequence by shifting it:
(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)
From the training pairs, we can learn a neural language model:
It is used to predict the next word: P(wk |w1, . . . ,wk−1).
It also defines a probability distribution over sentences:
P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .
Nevin L. Zhang (HKUST) Machine Learning 4 / 41
Introduction
Introduction: Language Modeling
Data: A collection of sentences.
For each sentence, create an output sequence by shifting it:
(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)
From the training pairs, we can learn a neural language model:
It is used to predict the next word: P(wk |w1, . . . ,wk−1).
It also defines a probability distribution over sentences:
P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .
Nevin L. Zhang (HKUST) Machine Learning 4 / 41
Introduction
Introduction: Language Modeling
Data: A collection of sentences.
For each sentence, create an output sequence by shifting it:
(”what”, ”is”, ”the”, ”problem”), (”is”, ”the”, ”problem”,−)
From the training pairs, we can learn a neural language model:
It is used to predict the next word: P(wk |w1, . . . ,wk−1).
It also defines a probability distribution over sentences:
P(w1,w2, . . . ,wn) = P(w1)P(w2|w1)P(w3|w2,w1)P(w4|w3,w2,w1) . . .
Nevin L. Zhang (HKUST) Machine Learning 4 / 41
Introduction
Introduction: Dialogue and Machine Translation
Data: A collection of matched pairs.
”How are you?” ; ”I am fine.”
We can still thinking of having an input and an output at each time point,except some inputs and outputs are dummies.
(”How”, ”are”, ”you”,−,−,−), (−,−,−, ”I ”, ”am”, ”fine”).
From the training pairs, we can learn neural model for dialogue or machinetranslation.
Nevin L. Zhang (HKUST) Machine Learning 5 / 41
Introduction
Introduction: Dialogue and Machine Translation
Data: A collection of matched pairs.
”How are you?” ; ”I am fine.”
We can still thinking of having an input and an output at each time point,except some inputs and outputs are dummies.
(”How”, ”are”, ”you”,−,−,−), (−,−,−, ”I ”, ”am”, ”fine”).
From the training pairs, we can learn neural model for dialogue or machinetranslation.
Nevin L. Zhang (HKUST) Machine Learning 5 / 41
Introduction
Introduction: Dialogue and Machine Translation
Data: A collection of matched pairs.
”How are you?” ; ”I am fine.”
We can still thinking of having an input and an output at each time point,except some inputs and outputs are dummies.
(”How”, ”are”, ”you”,−,−,−), (−,−,−, ”I ”, ”am”, ”fine”).
From the training pairs, we can learn neural model for dialogue or machinetranslation.
Nevin L. Zhang (HKUST) Machine Learning 5 / 41
Recurrent Neural Networks
Outline
1 Introduction
2 Recurrent Neural Networks
3 Long Short-Term Memory (LSTM) RNN
4 RNN Architectures
5 Attention
Nevin L. Zhang (HKUST) Machine Learning 6 / 41
Recurrent Neural Networks
Recurrent Neural Networks
A circuit diagram, aka recurrent graph, (left), where a black squareindicates time-delayed dependence, or
A unfolded computational graph, aka unrolled graph, (right). The lengthof the unrolled graph is determined by the length the input. In other words,the unrolled graphs for different sequences can be of different lengths.
Nevin L. Zhang (HKUST) Machine Learning 7 / 41
Recurrent Neural Networks
Recurrent Neural Networks
A circuit diagram, aka recurrent graph, (left), where a black squareindicates time-delayed dependence, or
A unfolded computational graph, aka unrolled graph, (right). The lengthof the unrolled graph is determined by the length the input. In other words,the unrolled graphs for different sequences can be of different lengths.
Nevin L. Zhang (HKUST) Machine Learning 7 / 41
Recurrent Neural Networks
Recurrent Neural Networks
The input tokens x(t) are represented as embedding vectors, which aredetermined together with other model parameters during learning.
The hidden states h(t) are also vectors.
The current state h(t) depends on the current input x(t) and the previousstate h(t−1) as follows:
a(t) = b + Wh(t−1) + Ux(t)
h(t) = tanh(a(t))
where b, W and U are model parameters. They are independent of time t.
Nevin L. Zhang (HKUST) Machine Learning 8 / 41
Recurrent Neural Networks
Recurrent Neural Networks
The input tokens x(t) are represented as embedding vectors, which aredetermined together with other model parameters during learning.
The hidden states h(t) are also vectors.
The current state h(t) depends on the current input x(t) and the previousstate h(t−1) as follows:
a(t) = b + Wh(t−1) + Ux(t)
h(t) = tanh(a(t))
where b, W and U are model parameters. They are independent of time t.
Nevin L. Zhang (HKUST) Machine Learning 8 / 41
Recurrent Neural Networks
Recurrent Neural Networks
The input tokens x(t) are represented as embedding vectors, which aredetermined together with other model parameters during learning.
The hidden states h(t) are also vectors.
The current state h(t) depends on the current input x(t) and the previousstate h(t−1) as follows:
a(t) = b + Wh(t−1) + Ux(t)
h(t) = tanh(a(t))
where b, W and U are model parameters.
They are independent of time t.
Nevin L. Zhang (HKUST) Machine Learning 8 / 41
Recurrent Neural Networks
Recurrent Neural Networks
The input tokens x(t) are represented as embedding vectors, which aredetermined together with other model parameters during learning.
The hidden states h(t) are also vectors.
The current state h(t) depends on the current input x(t) and the previousstate h(t−1) as follows:
a(t) = b + Wh(t−1) + Ux(t)
h(t) = tanh(a(t))
where b, W and U are model parameters. They are independent of time t.Nevin L. Zhang (HKUST) Machine Learning 8 / 41
Recurrent Neural Networks
Recurrent Neural Networks
The output sequence is produced as follows:
o(t) = c + Vh(t)
y(t) = softmax(o(t))
where c and V are model parameters.
They are independent of time t.
Nevin L. Zhang (HKUST) Machine Learning 9 / 41
Recurrent Neural Networks
Recurrent Neural Networks
The output sequence is produced as follows:
o(t) = c + Vh(t)
y(t) = softmax(o(t))
where c and V are model parameters. They are independent of time t.
Nevin L. Zhang (HKUST) Machine Learning 9 / 41
Recurrent Neural Networks
Recurrent Neural Networks
This is the loss for one training pair:
L({x(1), . . . , x(τ)}, {y(1), . . . , y(τ)}) = −τ∑
t=1
log Pmodel(y(t)|x(1), . . . , x(t)),
where log Pmodel(y(t)|x(1), . . . , x(t)) is obtained by reading the entry for y(t)
from the models output vector y(t).
When there are multiple input-target sequence pairs, the losses are added up.
Training objective: Minimize the total loss of all training pairs w.r.t themodel parameters and embedding vectors:
W,U,V,b, c, θemwhere θem are the embedding vectors.
Nevin L. Zhang (HKUST) Machine Learning 10 / 41
Recurrent Neural Networks
Recurrent Neural Networks
This is the loss for one training pair:
L({x(1), . . . , x(τ)}, {y(1), . . . , y(τ)}) = −τ∑
t=1
log Pmodel(y(t)|x(1), . . . , x(t)),
where log Pmodel(y(t)|x(1), . . . , x(t)) is obtained by reading the entry for y(t)
from the models output vector y(t).
When there are multiple input-target sequence pairs, the losses are added up.
Training objective: Minimize the total loss of all training pairs w.r.t themodel parameters and embedding vectors:
W,U,V,b, c, θemwhere θem are the embedding vectors.
Nevin L. Zhang (HKUST) Machine Learning 10 / 41
Recurrent Neural Networks
Recurrent Neural Networks
This is the loss for one training pair:
L({x(1), . . . , x(τ)}, {y(1), . . . , y(τ)}) = −τ∑
t=1
log Pmodel(y(t)|x(1), . . . , x(t)),
where log Pmodel(y(t)|x(1), . . . , x(t)) is obtained by reading the entry for y(t)
from the models output vector y(t).
When there are multiple input-target sequence pairs, the losses are added up.
Training objective: Minimize the total loss of all training pairs w.r.t themodel parameters and embedding vectors:
W,U,V,b, c, θemwhere θem are the embedding vectors.
Nevin L. Zhang (HKUST) Machine Learning 10 / 41
Recurrent Neural Networks
Training RNNs
RNNs are trained using stochastic gradient descent.
We need gradients:
∇WL,∇UL,∇VL,∇bL,∇cL,∇θemL
They are computed using Backpropagation Through Time (BPTT),which is an adaption of Backpropagation to the unrolled computationalgraph.
BPTT is implemented in deep learning packages such as Tensorflow.
Nevin L. Zhang (HKUST) Machine Learning 11 / 41
Recurrent Neural Networks
Training RNNs
RNNs are trained using stochastic gradient descent.
We need gradients:
∇WL,∇UL,∇VL,∇bL,∇cL,∇θemL
They are computed using Backpropagation Through Time (BPTT),which is an adaption of Backpropagation to the unrolled computationalgraph.
BPTT is implemented in deep learning packages such as Tensorflow.
Nevin L. Zhang (HKUST) Machine Learning 11 / 41
Recurrent Neural Networks
RNN and Self-Supervised Learning
Self-supervised learning is a learning technique where the training data isautomatically labelled.
It is still supervised learning, but the datasets do not need to bemanually labelled by a human, but they can e.g. be labelled by findingand exploiting the relations between different input signals.
RNN training is self-supervised learning.
Nevin L. Zhang (HKUST) Machine Learning 12 / 41
Long Short-Term Memory (LSTM) RNN
Outline
1 Introduction
2 Recurrent Neural Networks
3 Long Short-Term Memory (LSTM) RNN
4 RNN Architectures
5 Attention
Nevin L. Zhang (HKUST) Machine Learning 13 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.
The key idea is to use memory cells and gates:
c(t) = ftc(t−1) + ita(t)
c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.
Nevin L. Zhang (HKUST) Machine Learning 14 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.
The key idea is to use memory cells and gates:
c(t) = ftc(t−1) + ita(t)
c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.
Nevin L. Zhang (HKUST) Machine Learning 14 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.
The key idea is to use memory cells and gates:
c(t) = ftc(t−1) + ita(t)
c(t): memory state at t; a(t): new input at t.
If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.
Nevin L. Zhang (HKUST) Machine Learning 14 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.
The key idea is to use memory cells and gates:
c(t) = ftc(t−1) + ita(t)
c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.
If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.
Nevin L. Zhang (HKUST) Machine Learning 14 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.
The key idea is to use memory cells and gates:
c(t) = ftc(t−1) + ita(t)
c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.
If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.
Nevin L. Zhang (HKUST) Machine Learning 14 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.
The key idea is to use memory cells and gates:
c(t) = ftc(t−1) + ita(t)
c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.
Nevin L. Zhang (HKUST) Machine Learning 14 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
Long Short-Term Memory (LSTM) unit is a widely used technique toaddress long-term dependencies.
The key idea is to use memory cells and gates:
c(t) = ftc(t−1) + ita(t)
c(t): memory state at t; a(t): new input at t.If the forget gate ft is open (i.e, 1) and the input gate it is closed(i.e., 0), the current memory is kept.If the forget gate ft is closed (i.e, 0) and the input gate it is open(i.e., 1), the current memory is erased and replaced by new input.If we can learn ft and it from data, then we can automaticallydetermine how much history to remember/forget.
Nevin L. Zhang (HKUST) Machine Learning 14 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
In the case of vectors, c(t) = ft ⊗ c(t−1) + it ⊗ a(t), where ⊗ meanspointwise product.
ft is called the forget gate vector because it determines which componentsof the previous state and how much of them to remember/forget.
it is called the input gate vector because it determines which componentsof the input from h(t−1) and x(t) and how much of them should go into thecurrent state.
If we can learn ft and it from data, then we can automatically determinewhich component to remember/forget and how much of them toremember/forget.
Nevin L. Zhang (HKUST) Machine Learning 15 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
In the case of vectors, c(t) = ft ⊗ c(t−1) + it ⊗ a(t), where ⊗ meanspointwise product.
ft is called the forget gate vector because it determines which componentsof the previous state and how much of them to remember/forget.
it is called the input gate vector because it determines which componentsof the input from h(t−1) and x(t) and how much of them should go into thecurrent state.
If we can learn ft and it from data, then we can automatically determinewhich component to remember/forget and how much of them toremember/forget.
Nevin L. Zhang (HKUST) Machine Learning 15 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
In the case of vectors, c(t) = ft ⊗ c(t−1) + it ⊗ a(t), where ⊗ meanspointwise product.
ft is called the forget gate vector because it determines which componentsof the previous state and how much of them to remember/forget.
it is called the input gate vector because it determines which componentsof the input from h(t−1) and x(t) and how much of them should go into thecurrent state.
If we can learn ft and it from data, then we can automatically determinewhich component to remember/forget and how much of them toremember/forget.
Nevin L. Zhang (HKUST) Machine Learning 15 / 41
Long Short-Term Memory (LSTM) RNN
Basic Idea
In the case of vectors, c(t) = ft ⊗ c(t−1) + it ⊗ a(t), where ⊗ meanspointwise product.
ft is called the forget gate vector because it determines which componentsof the previous state and how much of them to remember/forget.
it is called the input gate vector because it determines which componentsof the input from h(t−1) and x(t) and how much of them should go into thecurrent state.
If we can learn ft and it from data, then we can automatically determinewhich component to remember/forget and how much of them toremember/forget.
Nevin L. Zhang (HKUST) Machine Learning 15 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell
In standard RNN, a(t) = b + Wh(t−1) + Ux(t),h(t) = tanh(a(t))
In LSTM, we introduce a cell state vector ct , and set
ct = ft ⊗ ct−1 + it ⊗ a(t)
h(t) = tanh(ct)
where ft and it are vectors.
Nevin L. Zhang (HKUST) Machine Learning 16 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell
In standard RNN, a(t) = b + Wh(t−1) + Ux(t),h(t) = tanh(a(t))
In LSTM, we introduce a cell state vector ct , and set
ct = ft ⊗ ct−1 + it ⊗ a(t)
h(t) = tanh(ct)
where ft and it are vectors.
Nevin L. Zhang (HKUST) Machine Learning 16 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell
In standard RNN, a(t) = b + Wh(t−1) + Ux(t),h(t) = tanh(a(t))
In LSTM, we introduce a cell state vector ct , and set
ct = ft ⊗ ct−1 + it ⊗ a(t)
h(t) = tanh(ct)
where ft and it are vectors.
Nevin L. Zhang (HKUST) Machine Learning 16 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell: Learning the Gates
ft is determined based on current input x(t) and previous hidden unit h(t−1):
ft = σ(Wf x(t) + Uf h(t−1) + bf ),
where Wf , Uf , bf are parameters to be learned from data.
it is also determined based on current input x(t) and previous hidden unith(t−1):
it = σ(Wix(t) + Uih
(t−1) + bi )
where Wi , Ui , bi are parameters to be learned from data.
Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1. In contrast, tanh is used for the output h(t)
so as to have strong gradient signal during backprop.
Nevin L. Zhang (HKUST) Machine Learning 17 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell: Learning the Gates
ft is determined based on current input x(t) and previous hidden unit h(t−1):
ft = σ(Wf x(t) + Uf h(t−1) + bf ),
where Wf , Uf , bf are parameters to be learned from data.
it is also determined based on current input x(t) and previous hidden unith(t−1):
it = σ(Wix(t) + Uih
(t−1) + bi )
where Wi , Ui , bi are parameters to be learned from data.
Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1. In contrast, tanh is used for the output h(t)
so as to have strong gradient signal during backprop.
Nevin L. Zhang (HKUST) Machine Learning 17 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell: Learning the Gates
ft is determined based on current input x(t) and previous hidden unit h(t−1):
ft = σ(Wf x(t) + Uf h(t−1) + bf ),
where Wf , Uf , bf are parameters to be learned from data.
it is also determined based on current input x(t) and previous hidden unith(t−1):
it = σ(Wix(t) + Uih
(t−1) + bi )
where Wi , Ui , bi are parameters to be learned from data.
Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1.
In contrast, tanh is used for the output h(t)
so as to have strong gradient signal during backprop.
Nevin L. Zhang (HKUST) Machine Learning 17 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell: Learning the Gates
ft is determined based on current input x(t) and previous hidden unit h(t−1):
ft = σ(Wf x(t) + Uf h(t−1) + bf ),
where Wf , Uf , bf are parameters to be learned from data.
it is also determined based on current input x(t) and previous hidden unith(t−1):
it = σ(Wix(t) + Uih
(t−1) + bi )
where Wi , Ui , bi are parameters to be learned from data.
Note the sigmoid activation function is used for the gates so that theirvalues are often close to 0 or 1. In contrast, tanh is used for the output h(t)
so as to have strong gradient signal during backprop.
Nevin L. Zhang (HKUST) Machine Learning 17 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell: Output Gate
We can also have a output gate to control which components of the statevector ct and how much of them should be outputted:
ot = σ(Wqx(t) + Uqh(t−1) + bq)
where Wq, Uq, bq are the learnable parameters, and set
h(t) = ot ⊗ tanh(ct)
Nevin L. Zhang (HKUST) Machine Learning 18 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell: Output Gate
We can also have a output gate to control which components of the statevector ct and how much of them should be outputted:
ot = σ(Wqx(t) + Uqh(t−1) + bq)
where Wq, Uq, bq are the learnable parameters, and set
h(t) = ot ⊗ tanh(ct)
Nevin L. Zhang (HKUST) Machine Learning 18 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell: Summary
A standard RNN cell: a(t) = b + Wh(t−1) + Ux(t), h(t) = tanh(a(t))
An LSTM Cell:
ft = σ(Wf x(t) + Uf h(t−1) + bf ) (forget gate, 0© in figure)
it = σ(Wix(t) + Uih
(t−1) + bi ) (input gate, 1© in figure)
ot = σ(Wqx(t) + Uqh(t−1) + bq) (output gate, 3© in figure)
ct = ft ⊗ ct−1 + it ⊗ tanh(Ux(t) + Wh(t−1) + b) (update memory)
h(t) = ot ⊗ tanh(ct) (next hidden unit)
Nevin L. Zhang (HKUST) Machine Learning 19 / 41
Long Short-Term Memory (LSTM) RNN
LSTM Cell: Summary
A standard RNN cell: a(t) = b + Wh(t−1) + Ux(t), h(t) = tanh(a(t))An LSTM Cell:
ft = σ(Wf x(t) + Uf h(t−1) + bf ) (forget gate, 0© in figure)
it = σ(Wix(t) + Uih
(t−1) + bi ) (input gate, 1© in figure)
ot = σ(Wqx(t) + Uqh(t−1) + bq) (output gate, 3© in figure)
ct = ft ⊗ ct−1 + it ⊗ tanh(Ux(t) + Wh(t−1) + b) (update memory)
h(t) = ot ⊗ tanh(ct) (next hidden unit)
Nevin L. Zhang (HKUST) Machine Learning 19 / 41
Long Short-Term Memory (LSTM) RNN
Gated Recurrent Unit
The Gated Recurrent Unit (GRU) is another a gating mechanism to allowRNNs to efficiently learn long-range dependency.
It is similar to LSTM (no memory and no output unit) and hence has fewerparamters. is a simplified version of an LSTM unit with fewer parameters.
Performance also similar to LSTM, except better on small datasets.
Nevin L. Zhang (HKUST) Machine Learning 20 / 41
Long Short-Term Memory (LSTM) RNN
Gated Recurrent Unit
Nevin L. Zhang (HKUST) Machine Learning 21 / 41
RNN Architectures
Outline
1 Introduction
2 Recurrent Neural Networks
3 Long Short-Term Memory (LSTM) RNN
4 RNN Architectures
5 Attention
Nevin L. Zhang (HKUST) Machine Learning 22 / 41
RNN Architectures
So Far ....
So far, we have concentrated on the following architecture that can be usedto model the relationship pairs of sequences (x(1), . . . , x(τ)) and(y(1), . . . , y(τ)) of the same length.
It is used for language modeling.
Nevin L. Zhang (HKUST) Machine Learning 23 / 41
RNN Architectures
Deep RNNs
Sometimes, we might want to use multiple layers of hidden units. This leadsto deep recurrent neural network.
More layers usually implies better performance.
Nevin L. Zhang (HKUST) Machine Learning 24 / 41
RNN Architectures
Bidirectional RNNs
In one directional RNN, h(t) only captureinformation from the past (x(1), . . . , x(t)),and we use h(t) to predict y(t)
Sometime, we might need informationfrom the future to predict y(t).
In such case, we can use bidirectionalRNNs.
Found useful in handwriting recognition,speech recognition, and bioinformatics.
Nevin L. Zhang (HKUST) Machine Learning 25 / 41
RNN Architectures
The Encoder-Decoder Architecture
The encoder-decoder or sequence tosequence architecture is used for learningto generate a output sequence(y(1), . . . , y(ny )) in response to an inputsequence (x(1), . . . , x(nx )).
The context variable C represents asemantic summary of the input thesequence.
The decoder defines a conditionaldistribution P(y(1), . . . , y(ny )|C ) oversequences, from which output sequencesare generated.
The architecture is used in dialoguesystems and machine translation.
Nevin L. Zhang (HKUST) Machine Learning 26 / 41
RNN Architectures
The Encoder-Decoder Architecture
The encoder-decoder or sequence tosequence architecture is used for learningto generate a output sequence(y(1), . . . , y(ny )) in response to an inputsequence (x(1), . . . , x(nx )).
The context variable C represents asemantic summary of the input thesequence.
The decoder defines a conditionaldistribution P(y(1), . . . , y(ny )|C ) oversequences, from which output sequencesare generated.
The architecture is used in dialoguesystems and machine translation.
Nevin L. Zhang (HKUST) Machine Learning 26 / 41
RNN Architectures
Seq2seq for machine translation
The generated sequence can be a translation of the input sequence inmachine translation,
Note that this model assumes: p(s2|C, s1, y(1)) = p(s2|s1, y(1)), . . .
Nevin L. Zhang (HKUST) Machine Learning 27 / 41
RNN Architectures
Seq2seq for machine translation
The generated sequence can be a translation of the input sequence inmachine translation,
Note that this model assumes: p(s2|C, s1, y(1)) = p(s2|s1, y(1)), . . .
Nevin L. Zhang (HKUST) Machine Learning 27 / 41
RNN Architectures
Seq2seq for ChatBot
The generated sequence can also be a reply to the input sequence indialogue systems
Nevin L. Zhang (HKUST) Machine Learning 28 / 41
RNN Architectures
Seq2seq in Action
http://jalammar.github.io/images/seq2seq 6.mp4
Nevin L. Zhang (HKUST) Machine Learning 29 / 41
RNN Architectures
Map a Vector to A sequence
Maps a vector x to a sequence (y(1), . . . , y(τ))
Found useful in image captioning
Nevin L. Zhang (HKUST) Machine Learning 30 / 41
Attention
Outline
1 Introduction
2 Recurrent Neural Networks
3 Long Short-Term Memory (LSTM) RNN
4 RNN Architectures
5 Attention
Nevin L. Zhang (HKUST) Machine Learning 31 / 41
Attention
Motivation
In the seq2seq model, the context variableC is shared at all time steps of thedecoder.
So, we are look at the same summary ofthe input sequence when generating theoutput at all time steps.
This is not desirable.
Nevin L. Zhang (HKUST) Machine Learning 32 / 41
Attention
Motivation
In the seq2seq model, the context variableC is shared at all time steps of thedecoder.
So, we are look at the same summary ofthe input sequence when generating theoutput at all time steps.
This is not desirable.
Nevin L. Zhang (HKUST) Machine Learning 32 / 41
Attention
Motivation
In the seq2seq model, the context variableC is shared at all time steps of thedecoder.
So, we are look at the same summary ofthe input sequence when generating theoutput at all time steps.
This is not desirable.
Nevin L. Zhang (HKUST) Machine Learning 32 / 41
Attention
Motivation
In fact, we might want to focus on different parts of the input at differenttimes:
Focus at h1 at time step 1,Focus at h4 at time step 2,Focus at h6 at time step 4,. . .
Nevin L. Zhang (HKUST) Machine Learning 33 / 41
Attention
Motivation
In fact, we might want to focus on different parts of the input at differenttimes:
Focus at h1 at time step 1,
Focus at h4 at time step 2,Focus at h6 at time step 4,. . .
Nevin L. Zhang (HKUST) Machine Learning 33 / 41
Attention
Motivation
In fact, we might want to focus on different parts of the input at differenttimes:
Focus at h1 at time step 1,Focus at h4 at time step 2,
Focus at h6 at time step 4,. . .
Nevin L. Zhang (HKUST) Machine Learning 33 / 41
Attention
Motivation
In fact, we might want to focus on different parts of the input at differenttimes:
Focus at h1 at time step 1,Focus at h4 at time step 2,Focus at h6 at time step 4,
. . .
Nevin L. Zhang (HKUST) Machine Learning 33 / 41
Attention
Motivation
In fact, we might want to focus on different parts of the input at differenttimes:
Focus at h1 at time step 1,Focus at h4 at time step 2,Focus at h6 at time step 4,. . .
Nevin L. Zhang (HKUST) Machine Learning 33 / 41
Attention
Attention
So, we introduce a context variable Ct foreach time step of the decoder:
Ct =∑j
αt,jhj
It is a linear combinations of the hiddenstates from the encoder.
For different sequences, we use differentvalues for the weights αt,j , andconsequently focus on different parts ofthe input.
Nevin L. Zhang (HKUST) Machine Learning 34 / 41
Attention
Attention
So, we introduce a context variable Ct foreach time step of the decoder:
Ct =∑j
αt,jhj
It is a linear combinations of the hiddenstates from the encoder.
For different sequences, we use differentvalues for the weights αt,j , andconsequently focus on different parts ofthe input.
Nevin L. Zhang (HKUST) Machine Learning 34 / 41
Attention
Attention
So, we introduce a context variable Ct foreach time step of the decoder:
Ct =∑j
αt,jhj
It is a linear combinations of the hiddenstates from the encoder.
For different sequences, we use differentvalues for the weights αt,j , andconsequently focus on different parts ofthe input.
Nevin L. Zhang (HKUST) Machine Learning 34 / 41
Attention
Attention
So, we introduce a context variable Ct foreach time step of the decoder:
Ct =∑j
αt,jhj
It is a linear combinations of the hiddenstates from the encoder.
For different sequences, we use differentvalues for the weights αt,j , andconsequently focus on different parts ofthe input.
Nevin L. Zhang (HKUST) Machine Learning 34 / 41
Attention
Attention Example
Nevin L. Zhang (HKUST) Machine Learning 35 / 41
Attention
Attention
For a particular sequence, we have finished step t − 1 of decoding.
Whatinformation do we use to determine the attention weights αt,j for step t?
ct−1: Context vectors for step t − 1. It tells us what to look for next.
If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.
In general, ct−1 is known as the query.
hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.
The attention weight is to be determined by assessing how compatible the queryand the key are.
Nevin L. Zhang (HKUST) Machine Learning 36 / 41
Attention
Attention
For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?
ct−1: Context vectors for step t − 1. It tells us what to look for next.
If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.
In general, ct−1 is known as the query.
hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.
The attention weight is to be determined by assessing how compatible the queryand the key are.
Nevin L. Zhang (HKUST) Machine Learning 36 / 41
Attention
Attention
For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?
ct−1: Context vectors for step t − 1. It tells us what to look for next.
If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.
In general, ct−1 is known as the query.
hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.
The attention weight is to be determined by assessing how compatible the queryand the key are.
Nevin L. Zhang (HKUST) Machine Learning 36 / 41
Attention
Attention
For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?
ct−1: Context vectors for step t − 1. It tells us what to look for next.
If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.
If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.
In general, ct−1 is known as the query.
hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.
The attention weight is to be determined by assessing how compatible the queryand the key are.
Nevin L. Zhang (HKUST) Machine Learning 36 / 41
Attention
Attention
For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?
ct−1: Context vectors for step t − 1. It tells us what to look for next.
If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.
In general, ct−1 is known as the query.
hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.
The attention weight is to be determined by assessing how compatible the queryand the key are.
Nevin L. Zhang (HKUST) Machine Learning 36 / 41
Attention
Attention
For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?
ct−1: Context vectors for step t − 1. It tells us what to look for next.
If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.
In general, ct−1 is known as the query.
hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.
The attention weight is to be determined by assessing how compatible the queryand the key are.
Nevin L. Zhang (HKUST) Machine Learning 36 / 41
Attention
Attention
For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?
ct−1: Context vectors for step t − 1. It tells us what to look for next.
If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.
In general, ct−1 is known as the query.
hj : Latent state of step j of encoding. It tells us whether hj is what we neednext.
In general, it is known as the key.
The attention weight is to be determined by assessing how compatible the queryand the key are.
Nevin L. Zhang (HKUST) Machine Learning 36 / 41
Attention
Attention
For a particular sequence, we have finished step t − 1 of decoding. Whatinformation do we use to determine the attention weights αt,j for step t?
ct−1: Context vectors for step t − 1. It tells us what to look for next.
If “held” attended to at step t − 1, likely to have “talk”, “meeting”,etc at step t.If “drink” attended to at step t − 1, likely to have “water”, “juice”, etcat step t.
In general, ct−1 is known as the query.
hj : Latent state of step j of encoding. It tells us whether hj is what we neednext. In general, it is known as the key.
The attention weight is to be determined by assessing how compatible the queryand the key are.
Nevin L. Zhang (HKUST) Machine Learning 36 / 41
Attention
Compatibility Function
There are several possible ways to compute the compatibility function.
αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality.
This iscalled dot-product attention).
αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.
αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.
Nevin L. Zhang (HKUST) Machine Learning 37 / 41
Attention
Compatibility Function
There are several possible ways to compute the compatibility function.
αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality. This iscalled dot-product attention).
αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.
αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.
Nevin L. Zhang (HKUST) Machine Learning 37 / 41
Attention
Compatibility Function
There are several possible ways to compute the compatibility function.
αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality. This iscalled dot-product attention).
αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.
αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.
Nevin L. Zhang (HKUST) Machine Learning 37 / 41
Attention
Compatibility Function
There are several possible ways to compute the compatibility function.
αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality. This iscalled dot-product attention).
αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.
αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.
Nevin L. Zhang (HKUST) Machine Learning 37 / 41
Attention
Compatibility Function
There are several possible ways to compute the compatibility function.
αt,j = h>j ct−1, assuming hj and ct−1 have equal dimensionality. This iscalled dot-product attention).
αt,j = h>j Wct−1 for some matrix W the can be learned jointly with otherparameters.
αt,j = f (ct−1,hj), where f is a function represented by a simple neuralnetwork that can be learned jointly with other parameter. This is calledadditive attention.
Nevin L. Zhang (HKUST) Machine Learning 37 / 41
Attention
Attention
The weight αt,j is obtained using thequery ct−1 and the key hj?
It tells us how relevant theinformation at step j .
Then we retrieve the value hj at step jand computed their weighted sum to get:
Ct =∑j
αt,jhj
Note that, for different input sequences,the weights αt,j are different. So,attention means dynamic weights. (Instandard NN, weights are static and donot depend on input.)
Nevin L. Zhang (HKUST) Machine Learning 38 / 41
Attention
Attention
The weight αt,j is obtained using thequery ct−1 and the key hj?
It tells us how relevant theinformation at step j .
Then we retrieve the value hj at step jand computed their weighted sum to get:
Ct =∑j
αt,jhj
Note that, for different input sequences,the weights αt,j are different. So,attention means dynamic weights. (Instandard NN, weights are static and donot depend on input.)
Nevin L. Zhang (HKUST) Machine Learning 38 / 41
Attention
Attention
The weight αt,j is obtained using thequery ct−1 and the key hj?
It tells us how relevant theinformation at step j .
Then we retrieve the value hj at step jand computed their weighted sum to get:
Ct =∑j
αt,jhj
Note that, for different input sequences,the weights αt,j are different. So,attention means dynamic weights. (Instandard NN, weights are static and donot depend on input.)
Nevin L. Zhang (HKUST) Machine Learning 38 / 41
Attention
Alignments between source and target sentences
Nevin L. Zhang (HKUST) Machine Learning 39 / 41
Attention
Attention in Action
http://jalammar.github.io/images/attention tensor dance.mp4
Nevin L. Zhang (HKUST) Machine Learning 40 / 41
Attention
References
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine translation byjointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
Devlin, Jacob, et al. ”Bert: Pre-training of deep bidirectional transformers for languageunderstanding.” arXiv preprint arXiv:1810.04805 (2018).
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MITpress.www.deeplearningbook.org
Jay Alammar: http://jalammar.github.io/
Vaswani, Ashish, et al. ”Attention is all you need.” Advances in neural informationprocessing systems. 2017.
Nevin L. Zhang (HKUST) Machine Learning 41 / 41