24
Recurrent Neural Networks (RNN) CSC321: Intro to Machine Learning and Neural Networks, Winter 2016 Michael Guerzhoy http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Some slides from Richard Socher, Geoffrey Hinton, Andrej Karpathy

Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Recurrent Neural Networks (RNN)

CSC321: Intro to Machine Learning and Neural Networks, Winter 2016

Michael Guerzhoy

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Some slides from Richard Socher, Geoffrey Hinton, Andrej Karpathy

Page 2: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Motivating Example: Language Models

β€’ Want to assign probability to a sentenceβ€’ β€œDafjdkf adkjfhalj fadlag dfah” – zero probability

β€’ β€œFuriously sleep ideas green colorless” – very low probability

β€’ β€œColorless green ideas sleep furiously” – slightly higher

β€’ β€œThe quick brown fox jumped over the lazy dog” – even higher

https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously

Page 3: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Application for Language Models

β€’ Applicationsβ€’ OCR gives several hypotheses, need to choose the most

probable one

β€’ Choose a plausible translation from English to French

β€’ Complete the sentence β€œA core objective of a learner is to generalize from its […]”

β€’ In every case, a language model can be used to evaluate all the possible hypotheses, and select the one with the highest probability

Page 4: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Sentence Completion

β€’ Suppose a language model M can compute𝑃𝑀 𝑀1, 𝑀2, … , π‘€π‘˜

β€’ For an incomplete sentence 𝑀1 𝑀2 𝑀3β€¦π‘€π‘˜βˆ’1, find π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π‘€π‘˜

𝑃(𝑀1, 𝑀2, 𝑀3, … , π‘€π‘˜) to complete the sentence

β€’ Now, fix π‘€π‘˜ , and find π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π‘€π‘˜+1

𝑃(𝑀1, 𝑀2, 𝑀3, … , π‘€π‘˜ , π‘€π‘˜+1)

Page 5: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Probabilistic Sentence Generation

β€’ 𝑃 π‘€π‘˜ 𝑀1, 𝑀2, β€¦π‘€π‘˜βˆ’1 =𝑃(𝑀1 β€¦π‘€π‘˜βˆ’1π‘€π‘˜)

𝑃(𝑀1β€¦π‘€π‘˜βˆ’1)∝ 𝑃 𝑀1 … π‘€π‘˜βˆ’1π‘€π‘˜

β€’ Choose word 𝑀(𝑗) according toexp(𝛼 𝑃 𝑀1𝑀2β€¦π‘€π‘˜βˆ’1𝑀

(𝑗) )

σ𝑗 exp(𝛼 𝑃 𝑀1𝑀2β€¦π‘€π‘˜βˆ’1𝑀(𝑗) )

β€’ (Question: Higher 𝛼 => ?)

β€’ Generally, take 𝑃 to be the input to the softmax that produces the probability in the RNN (better), or simply the probability of the sentence

Page 6: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Generating β€œShakespeare” character-by-character with RNN

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

(Here, the w’s are characters, not words)

Page 7: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Generating Strings with Finite State Machines

A B

E

Page 8: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Generating Strings with Finite State Machines

A B

E

ABAAEEAAEABEEEEEAAABEEAAB – Not allowedAABEA – AllowedABBA – Not allowed

Page 9: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Markov Models: Probabilistic Finite State Machines

A B

E

0.9

0.1

0.5

0.5

0.8

0.2

Page 10: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Markov Models: Probabilistic Finite State Machines

A B

E

0.9

0.1

0.5

0.5

0.8

0.2

𝑃 π‘€π‘˜ = 𝐴 π‘€π‘˜βˆ’1 = 𝐸 = 0.8

𝑃 π‘€π‘˜ = 𝐡 π‘€π‘˜βˆ’1 = 𝐡 = 0

Page 11: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

A More Complicated FSM

β€’ Suppose our alphabet is {β€œa”, β€œb”, β€œc”, β€œe”, β€œi”}

β€’ Encode the rule:β€œi” before β€œe,” except after β€œc”

(think β€œreceive,” β€œbelieve,” β€œdeceit”…)

Page 12: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Recurrent Neural Networks

π‘₯𝑑-the tth character of the string (β€œthe character at time t”)β„Žπ‘‘-the tth character of the string (β€œthe character at time t”)

Page 13: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

RNN for Language Modelling

β€’ Given a list of word vectors (e.g., one-hot encodings of words) π‘₯1, … , π‘₯π‘‘βˆ’1, π‘₯𝑑, π‘₯𝑑+1, … , π‘₯𝑇At a single time step:β€’ β„Žπ‘‘ = 𝜎 π‘Š β„Žβ„Ž β„Žπ‘‘βˆ’1 +π‘Š β„Žπ‘₯ π‘₯𝑑‒ ΰ·œπ‘¦π‘‘ = π‘ π‘œπ‘“π‘‘π‘šπ‘Žπ‘₯(π‘Š 𝑆 β„Žπ‘‘)

β€’ 𝑃 π‘₯𝑑+1 = 𝑣𝑗 π‘₯1, π‘₯2, … , π‘₯𝑑 = ΰ·œπ‘¦π‘‘,𝑗

β€’ β„Ž is the state (e.g., the previousword vector could be part of β„Ž)

β€’ π‘₯𝑑 is the data

β€’ ΰ·œπ‘¦π‘‘ is the predicted output

Page 14: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Cost Function

β€’ Same as before: negative log-probability of the right answer:

𝐽(𝑑) = βˆ’Οƒπ‘—=1𝑉 𝑦𝑑,𝑗 log ΰ·œπ‘¦π‘‘,𝑗

𝐽 =

𝑑

𝐽(𝑑)

β€’ ΰ·œπ‘¦π‘‘,𝑗 = 1 iff π‘₯𝑑+1 = 𝑣𝑗

Page 15: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

RNN β€œ=β€œ feedforward net with shared weights

w3 w4

w1 w2w3 w4

w1 w2w3 w4

w1 w2w3 w4

time=0

time=2

time=1

time=3

Assume that there is a time delay of 1 in using each connection.

The recurrent net is just a layered net that keeps reusing the same weights.

Page 16: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Reminder: Multivariate Chain Rule

𝑓1(π‘₯) 𝑓2(π‘₯)

π‘₯

𝑓3(π‘₯)

𝑔(𝑓1 π‘₯ , 𝑓2 π‘₯ , 𝑓3 π‘₯ ) πœ•π‘”

πœ•π‘₯= Οƒ

πœ•π‘”

πœ•π‘“π‘–

πœ•π‘“π‘–πœ•π‘₯

Page 17: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Gradient

β€’πœ•π½

πœ•π‘Š= σ𝑑

πœ•π½(𝑑)

πœ•π‘Š

β€’πœ•π½(𝑑)

πœ•π‘Š=Οƒπ‘˜=1

𝑑 πœ•π½(𝑑)

πœ•β„Žπ‘˜

πœ•β„Žπ‘˜

πœ•π‘Š= Οƒπ‘˜=1

𝑑 πœ•π½(𝑑)

πœ•β„Žπ‘‘

πœ•β„Žπ‘‘

πœ•β„Žπ‘˜

πœ•β„Žπ‘˜

πœ•π‘Š

Page 18: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Vanishing an Exploding Gradient

β€’πœ•π½(𝑑)

πœ•π‘Š=Οƒπ‘˜=1

𝑑 πœ•π½(𝑑)

πœ•β„Žπ‘˜

πœ•β„Žπ‘˜

πœ•π‘Š= Οƒπ‘˜=1

𝑑 πœ•π½(𝑑)

πœ•β„Žπ‘‘

πœ•β„Žπ‘‘

πœ•β„Žπ‘˜

πœ•β„Žπ‘˜

πœ•π‘Š

β€’πœ•β„Žπ‘‘

πœ•β„Žπ‘˜= ς𝑗=π‘˜+1

𝑑 πœ•β„Žπ‘—

πœ•β„Žπ‘—βˆ’1

β€’ Assume 1-d h’s β€’ In fact, they are vectors

β€’ Often, πœ•β„Žπ‘—

πœ•β„Žπ‘—βˆ’1< 1 for all j or

πœ•β„Žπ‘—

πœ•β„Žπ‘—βˆ’1> 1 for all j

β€’ They are all related by the same weight to each other

β€’ This means that πœ•β„Žπ‘‘

πœ•β„Žπ‘˜is either very close to 0 (vanishing

grad.) or very large (exploding grad.) for large |𝑑 βˆ’ π‘˜|

Page 19: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Vanishing Gradient is a Problem

Problem: if the gradient vanishes, we can’t figure out that the weight needs to be changed because of what’s happening at time=0 to make the cost function at time=n smaller

time=0

time=2

time=1

time=3

Page 20: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Exploding Gradient is a Problem

β€’ Why?

β€’ β€œHacky” solution: clip the gradients

Page 21: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Vanishing Gradient in Language Models

β€’ In the case of language modeling or question answering words from steps far away are not taken into consideration when training to predict the next word

β€’ Example:Jane walked into the room. John walked in too. It was late in the day. Jane said hi to ____

Page 22: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Visualizing the hidden state*

Karpathy et al. β€œVisualizing and Understanding Recurrent Networks”http://arxiv.org/abs/1506.02078

*The RNN there is somewhat more complicated than what we saw so far

Page 23: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Karpathy et al. β€œVisualizing and Understanding Recurrent Networks”http://arxiv.org/abs/1506.02078

Page 24: Recurrent Neural Networks (RNN)guerzhoy/321/lec/W08/rnn.pdfΒ Β· RNN for Language Modelling β€’Given a list of word vectors (e.g., one-hot encodings of words) 1,…, π‘‘βˆ’1, 𝑑,

Karpathy et al. β€œVisualizing and Understanding Recurrent Networks”http://arxiv.org/abs/1506.02078