Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Recurrent Neural Networks (RNN)
CSC321: Intro to Machine Learning and Neural Networks, Winter 2016
Michael Guerzhoy
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Some slides from Richard Socher, Geoffrey Hinton, Andrej Karpathy
Motivating Example: Language Models
β’ Want to assign probability to a sentenceβ’ βDafjdkf adkjfhalj fadlag dfahβ β zero probability
β’ βFuriously sleep ideas green colorlessβ β very low probability
β’ βColorless green ideas sleep furiouslyβ β slightly higher
β’ βThe quick brown fox jumped over the lazy dogβ β even higher
https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously
Application for Language Models
β’ Applicationsβ’ OCR gives several hypotheses, need to choose the most
probable one
β’ Choose a plausible translation from English to French
β’ Complete the sentence βA core objective of a learner is to generalize from its [β¦]β
β’ In every case, a language model can be used to evaluate all the possible hypotheses, and select the one with the highest probability
Sentence Completion
β’ Suppose a language model M can computeππ π€1, π€2, β¦ , π€π
β’ For an incomplete sentence π€1 π€2 π€3β¦π€πβ1, find ππππππ₯π€π
π(π€1, π€2, π€3, β¦ , π€π) to complete the sentence
β’ Now, fix π€π , and find ππππππ₯π€π+1
π(π€1, π€2, π€3, β¦ , π€π , π€π+1)
Probabilistic Sentence Generation
β’ π π€π π€1, π€2, β¦π€πβ1 =π(π€1 β¦π€πβ1π€π)
π(π€1β¦π€πβ1)β π π€1 β¦ π€πβ1π€π
β’ Choose word π€(π) according toexp(πΌ π π€1π€2β¦π€πβ1π€
(π) )
Οπ exp(πΌ π π€1π€2β¦π€πβ1π€(π) )
β’ (Question: Higher πΌ => ?)
β’ Generally, take π to be the input to the softmax that produces the probability in the RNN (better), or simply the probability of the sentence
Generating βShakespeareβ character-by-character with RNN
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
(Here, the wβs are characters, not words)
Generating Strings with Finite State Machines
A B
E
Generating Strings with Finite State Machines
A B
E
ABAAEEAAEABEEEEEAAABEEAAB β Not allowedAABEA β AllowedABBA β Not allowed
Markov Models: Probabilistic Finite State Machines
A B
E
0.9
0.1
0.5
0.5
0.8
0.2
Markov Models: Probabilistic Finite State Machines
A B
E
0.9
0.1
0.5
0.5
0.8
0.2
π π€π = π΄ π€πβ1 = πΈ = 0.8
π π€π = π΅ π€πβ1 = π΅ = 0
A More Complicated FSM
β’ Suppose our alphabet is {βaβ, βbβ, βcβ, βeβ, βiβ}
β’ Encode the rule:βiβ before βe,β except after βcβ
(think βreceive,β βbelieve,β βdeceitββ¦)
Recurrent Neural Networks
π₯π‘-the tth character of the string (βthe character at time tβ)βπ‘-the tth character of the string (βthe character at time tβ)
RNN for Language Modelling
β’ Given a list of word vectors (e.g., one-hot encodings of words) π₯1, β¦ , π₯π‘β1, π₯π‘, π₯π‘+1, β¦ , π₯πAt a single time step:β’ βπ‘ = π π ββ βπ‘β1 +π βπ₯ π₯π‘β’ ΰ·π¦π‘ = π πππ‘πππ₯(π π βπ‘)
β’ π π₯π‘+1 = π£π π₯1, π₯2, β¦ , π₯π‘ = ΰ·π¦π‘,π
β’ β is the state (e.g., the previousword vector could be part of β)
β’ π₯π‘ is the data
β’ ΰ·π¦π‘ is the predicted output
Cost Function
β’ Same as before: negative log-probability of the right answer:
π½(π‘) = βΟπ=1π π¦π‘,π log ΰ·π¦π‘,π
π½ =
π‘
π½(π‘)
β’ ΰ·π¦π‘,π = 1 iff π₯π‘+1 = π£π
RNN β=β feedforward net with shared weights
w3 w4
w1 w2w3 w4
w1 w2w3 w4
w1 w2w3 w4
time=0
time=2
time=1
time=3
Assume that there is a time delay of 1 in using each connection.
The recurrent net is just a layered net that keeps reusing the same weights.
Reminder: Multivariate Chain Rule
π1(π₯) π2(π₯)
π₯
π3(π₯)
π(π1 π₯ , π2 π₯ , π3 π₯ ) ππ
ππ₯= Ο
ππ
πππ
πππππ₯
Gradient
β’ππ½
ππ= Οπ‘
ππ½(π‘)
ππ
β’ππ½(π‘)
ππ=Οπ=1
π‘ ππ½(π‘)
πβπ
πβπ
ππ= Οπ=1
π‘ ππ½(π‘)
πβπ‘
πβπ‘
πβπ
πβπ
ππ
Vanishing an Exploding Gradient
β’ππ½(π‘)
ππ=Οπ=1
π‘ ππ½(π‘)
πβπ
πβπ
ππ= Οπ=1
π‘ ππ½(π‘)
πβπ‘
πβπ‘
πβπ
πβπ
ππ
β’πβπ‘
πβπ= Οπ=π+1
π‘ πβπ
πβπβ1
β’ Assume 1-d hβs β’ In fact, they are vectors
β’ Often, πβπ
πβπβ1< 1 for all j or
πβπ
πβπβ1> 1 for all j
β’ They are all related by the same weight to each other
β’ This means that πβπ‘
πβπis either very close to 0 (vanishing
grad.) or very large (exploding grad.) for large |π‘ β π|
Vanishing Gradient is a Problem
Problem: if the gradient vanishes, we canβt figure out that the weight needs to be changed because of whatβs happening at time=0 to make the cost function at time=n smaller
time=0
time=2
time=1
time=3
Exploding Gradient is a Problem
β’ Why?
β’ βHackyβ solution: clip the gradients
Vanishing Gradient in Language Models
β’ In the case of language modeling or question answering words from steps far away are not taken into consideration when training to predict the next word
β’ Example:Jane walked into the room. John walked in too. It was late in the day. Jane said hi to ____
Visualizing the hidden state*
Karpathy et al. βVisualizing and Understanding Recurrent Networksβhttp://arxiv.org/abs/1506.02078
*The RNN there is somewhat more complicated than what we saw so far
Karpathy et al. βVisualizing and Understanding Recurrent Networksβhttp://arxiv.org/abs/1506.02078
Karpathy et al. βVisualizing and Understanding Recurrent Networksβhttp://arxiv.org/abs/1506.02078