Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... ·...

Slides credited from Hung-Yi Lee & Richard Socher

OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)

Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue

Applications◦ Sequential Input◦ Sequential Output

◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Language ModelingGoal: estimate the probability of a word sequence

Example task: determinate whether a sequence is grammatical or makes more sense

recognize speechor

wreck a nice beachOutput =

“recognize speech”

If P(recognize speech)> P(wreck a nice beach)

N-Gram Language ModelingGoal: estimate the probability of a word sequence

N-gram language model◦ Probability is conditioned on a window of (n-1) previous words

◦ Estimate the probability based on the training data

𝑃 beach|nice =𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ

𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data

Count of “nice beach” in the training data

Issue: some sequences may not appear in the training data

N-Gram Language ModelingTraining data:◦ The dog ran ……

◦ The cat jumped ……

P( jumped | dog ) = 0

P( ran | cat ) = 0give some small probability→ smoothing

0.0001

➢ The probability is not accurate.

➢ The phenomenon happens because we cannot collect all the possible text in the world as training data.

Neural Language ModelingIdea: estimate not from count, but from the NN prediction

Neural Network

vector of “START”

P(next word is “wreck”)

Neural Network

vector of “wreck”

P(next word is “a”)

Neural Network

vector of “a”

P(next word is “nice”)

Neural Network

vector of “nice”

P(next word is “beach”)

P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)

Neural Language Modeling

10Bengio et al., “A Neural Probabilistic Language Model,” in JMLR, 2003.

hidden

output

context vector

Probability distribution of the next word

Neural Language ModelingThe input layer (or hidden layer) of the related words are close

◦ If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “… cat jump …” in the data)

rabbit

Smoothing is automatically done

Issue: fixed context window for conditioning

Recurrent Neural NetworkIdea: condition the neural network on all previous words and tie the weights at each time step

Assumption: temporal information matters

RNN Language Modeling

vector of “START”

P(next word is “wreck”)

vector of “wreck”

P(next word is “a”)

vector of “a”

P(next word is “nice”)

vector of “nice”

P(next word is “beach”)

hidden

output

context vector

word prob dist

Idea: pass the information from the previous hidden layer to leverage all contexts

RNNLM Formulation

At each time step,

…………

……

vector of the current word

probability of the next word

Recurrent Neural Network Definition

18http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

: tanh, ReLU

Model TrainingAll model parameters can be updated by

19http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

yt-1 yt+1yt target

predicted

Backpropagation

……

Layer lLayer 1−l

Backward Pass

Forward Pass

Error signal

Backpropagation

Backward Pass

Error signal

( )Lz1

( )Lz2

Layer L

Layer l

( )lz1

( )lz2

Layer L-1

… ( )TW L( )TlW 1+

( )yCL1-L

( )1L− mz

Backpropagation through Time (BPTT)Unfold

◦ Input: init, x1, x2, …, xt

◦ Output: ot

◦ Target: yt

st ytxt ot

xt-1 st-1

st-21o

◦ Output: ot

◦ Target: yt

st ytxt ot

xt-1 st-1

…( )yC

◦ Output: ot

◦ Target: yt

st ytxt ot

xt-1 st-1

◦ Output: ot

◦ Target: yt

st ytxt ot

xt-1 st-1

Weights are tied together

the same memory

pointer

◦ Output: ot

◦ Target: yt

st ytxt ot

xt-1 st-1

Weights are tied together

For 𝐶(1)Backward Pass:For 𝐶(2)

For 𝐶(3)For 𝐶(4)

Forward Pass: Compute s1, s2, s3, s4 ……

y1 y2 y3

x1x2 x3

o1 o2 o3

𝐶(1) 𝐶(2) 𝐶(3) 𝐶(4)

s1 s2 s3s4

RNN Training IssueThe gradient is a product of Jacobian matrices, each associated with a step in the forward computation

Multiply the same matrix at each time step during backprop

The gradient becomes very small or very large quickly→ vanishing or exploding gradient

Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

Rough Error Surface

31Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

The error surface is either very flat or very steep

Vanishing/Exploding Gradient Example

1 step 2 steps 5 steps

10 steps 20 steps 50 steps

Possible SolutionsRecurrent Neural Network

Exploding Gradient: Clipping

clipped gradientIdea: control the gradient value to avoid exploding

Parameter setting: values from half to ten times the average can still yield convergence

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

Vanishing Gradient: Initialization + ReLUIRNN◦ initialize all W as identity

matrix I

◦ use ReLU for activation functions

35Le et al., “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,” arXiv, 2016. [link]

Vanishing Gradient: Gating MechanismRNN models temporal sequence information◦ can handle “long-term dependencies” in theory

36http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Issue: RNN cannot handle such “long-term dependencies” in practice due to vanishing gradient→ apply the gating mechanism to directly encode the long-distance information

“I grew up in France…I speak fluent French.”

ExtensionRecurrent Neural Network

Bidirectional RNN

ℎ = ℎ; ℎ represents (summarizes) the past and future around a single token

Deep Bidirectional RNN

Each memory layer passes an intermediate representation to the next

How to Frame the Learning Problem?The learning algorithm f is to map the input domain X into the output domain Y

Input domain: word, word sequence, audio signal, click logs

Output domain: single label, sequence tags, tree structure, probability distribution

YXf →:

Network design should leverage input and output domain properties

Input Domain – Sequence ModelingIdea: aggregate the meaning from all words into a vector

Method:◦ Basic combination: average, sum

◦ Neural combination: ✓Recursive neural network (RvNN)

✓Recurrent neural network (RNN)

✓Convolutional neural network (CNN)

How to compute

規格(specification)

誠意(sincerity)

這(this)

有(have)

誠意這規格有

Sentiment AnalysisEncode the sequential input into a vector using RNN

……

… …

Input Output

RNN considers temporal information to learn sentence vectors as the input of classification tasks

Output Domain – Sequence PredictionPOS Tagging

Speech Recognition

Machine Translation

“推薦我台大後門的餐廳”推薦/VV我/PN台大/NR後門/NN的/DEG餐廳/NN

“大家好”

“How are you doing today?” “你好嗎?”

The output can be viewed as a sequence of classification

POS TaggingTag a word at each timestamp◦ Input: word sequence

◦ Output: corresponding POS tag sequence

四樓好專業

N VA AD

Natural Language Understanding (NLU)Tag a word at each timestamp◦ Input: word sequence

◦ Output: IOB-format slot tag and intent tag

<START> just sent email to bob about fishing this weekend <END>

O O O OB-contact_name

B-subject I-subject I-subject

→ send_email(contact_name=“bob”, subject=“fishing this weekend”)

Osend_email

Temporal orders for input and output are the same

超棒的醬汁

Machine Translation

Cascade two RNNs, one for encoding and one for decoding◦ Input: word sequences in the source language

◦ Output: word sequences in the target language

encoder

decoder

Chit-Chat Dialogue ModelingCascade two RNNs, one for encoding and one for decoding◦ Input: word sequences in the question

◦ Output: word sequences in the response

Temporal ordering for input and output may be different

Sci-Fi Short Film - SUNSPRING

https://www.youtube.com/watch?v=LY7x2Ihqj 53

Concluding RemarksLanguage Modeling◦ RNNLM

Recurrent Neural Networks◦ Definition

◦ Backpropagation through Time (BPTT)

◦ Vanishing/Exploding Gradient

Applications◦ Sequential Input: Sequence-Level Embedding

◦ Sequential Output: Tagging / Seq2Seq (Encoder-Decoder)

Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... ·...

Documents

FILE NO. 190312 ORDINANCE NO. 122-19 1 [Health Code ... · FILE NO. 190312 ORDINANCE NO. 122-19 1 [Health Code-Restricting the Sale, Manufacture, and Distribution of Tobacco Products,

Magnetic Alarm Switches - NTE Electronics, Inc54−630 White SPST−NO NO for Closed Loop System Magnet S107 54−636 Brown SPST−NO NO for Closed Loop System Magnet S107 54−637

Learning to Navigate … at City Scale · Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray

CREATE MOMENTUM - CPHR BCcphrbc.ca/wp-content/uploads/2019/03/CONF19_FullProgram-190312-WEB.pdfThe ROI of Wellness: The Importance of Investing in the ... BUSINESS AND MEDIA EDUCATION

190312 Sergej Epp AI Ready Go submission...Reconnaissance Weaponization and Delivery Exploitation Command and Control Lateral Movement Installation Actions on the Objective Automated

Slides credited from Dr. David Silver & Hung-Yi Leemiulab/s107-adl/doc/190416_DeepRL.p… · Machine Learning Supervised Learning v.s. Reinforcement Learning Reinforcement Learning

Can j Neurol Sci 2009 Aug 36(Suppl 2) I-s107

Syma s107 r Best RC Helicopter

Adaptively Truncating Backpropagation Through Time to ...auai.org/uai2019/proceedings/papers/290.pdf · ish (Bengio et al., 1994; Pascanu et al., 2013). When the gradients vanish,

Full page fax print - Syma S107 Helicopter Review S107 Manual.pdf · OPERATION: * Don't operate the helicopter under the direct sun or strong lighting;it will affect the control system

Full page fax print - Syma S107 Helicopter Reviewsyma107.com/download89/Syma S107 Manual.pdf · If the helicopter spins during flight, ... 1 .lndoor environment with calm air condition.Beware

Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Andrada Otvos-Moldovan, Iulia Armean, Elena Tanase, Vlad Frandes University of Medicine and Pharmacy, Targu-Mures Scientific coordinators: Pascanu Ionela,

Author: Iulia Armean Coordinators: Conf. Dr. Pascanu Ionela, Dr. Pop Raluca Coauthors: Tanase Elena, Andrei Ioana, Frandes Vlad

How to Construct Deep Recurrent Neural Networks to Construct Deep Recurrent Neural Networks Razvan Pascanu 1, Caglar Gulcehre , Kyunghyun Cho2, and Yoshua Bengio1 1Departement d’Informatique

DAPTING UXILIARY 3. S LOSSES USING GRADIENT SIMILARITYbalaji/CL-NeurIPS2018-adapt-poster.pdf · 2018-12-03 · Yunshu Du, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Razvan Pascanu,

Accounting Writing Assessment Information Session 18 SEPT 2015 3:30 P.M., S107 PBB 1

Recurrent Nets and Attention for System 2 Processing · (Pascanu, Mikolov, Bengio, ICML 2013; Bengio, Boulanger & Pascanu, ICASSP 2013) • Clipping gradients (avoid exploding gradients)

Marine Officer MOS Assignment Handbook 190312 · Aviation Combat Element (ACE), and the Logistics Combat Element (LCE) will shape and inform your understanding of how each ... Figure

“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio