The Vanilla RNN Forward · 2020-06-09 · The Popular LSTM Cell i to f t Input Gate Output Gate...

The Vanilla RNN Forward

ht = tanhWxtht−1

⎛⎝⎜

⎞⎠⎟

yt = F(ht )Ct = Loss(yt ,GTt )

indicates shared weights

The Vanilla RNN Backward

ht = tanhWxtht−1

⎛⎝⎜

⎞⎠⎟

yt = F(ht )Ct = Loss(yt ,GTt )

∂h1= ∂Ct

⎛⎝⎜

⎞⎠⎟

∂yt∂h1

⎛⎝⎜

⎞⎠⎟

= ∂Ct

⎛⎝⎜

⎞⎠⎟

∂yt∂ht

⎛⎝⎜

⎞⎠⎟

∂ht∂ht−1

⎛⎝⎜

⎞⎠⎟!

∂h2∂h1

⎛⎝⎜

⎞⎠⎟

The Popular LSTM Cell

Input Gate Output Gate

Forget Gate

xt ht-1

ct = ft ⊗ ct−1 +

it ⊗ tanhWxtht−1

⎛⎝⎜

⎞⎠⎟

xt ht-1 xt ht-1

ft =σ Wf

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

ht = ot ⊗ tanhct

Similarly for it, ot

* Dashed line indicates time-lag

LSTM – Forward/Backward

GoTo:http://arunmallya.github.io/writeups/nn/lstm/index.html#/

Multi-layer RNNs

• We can of course design RNNs with multiple hidden layers

x1 x2 x3 x4 x5 x6

y1 y2 y3 y4 y5 y6

• Think exotic: Skip connections across layers, across time, …

Bi-directional RNNs

• RNNs can process the input sequence in forward and in the reverse direction

x1 x2 x3 x4 x5 x6

y1 y2 y3 y4 y5 y6

• Popular in speech recognition

• RNNs allow for processing of variable length inputs and outputs by maintaining state information across time steps

• Various Input-Output scenarios are possible (Single/Multiple)

• RNNs can be stacked, or bi-directional

• Vanilla RNNs are improved upon by LSTMs which address the vanishing gradient problem through the CEC

• Exploding gradients are handled by gradient clipping

The Popular LSTM Cell

Forget Gate

xt ht-1

ct = ft ⊗ ct−1 +

⎛⎝⎜

⎞⎠⎟

xt ht-1 xt ht-1

ft =σ Wf

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

ht = ot ⊗ tanhct

Similarly for it, ot

Extension I: Peephole LSTM

Forget Gate

xt ht-1

ct = ft ⊗ ct−1 +

⎛⎝⎜

⎞⎠⎟

xt ht-1 xt ht-1

ft =σ Wf

xtht−1ct−1

⎜⎜

⎟⎟+ bf

⎜⎜

⎟⎟

ht = ot ⊗ tanhct

Similarly for it, ot (uses ct)

Peephole LSTM

• Gates can only see the output from the previous time step, which is close to 0 if the output gate is closed. However, these gates control the CEC cell.

• Helped the LSTM learn better timing for the problems tested –Spike timing and Counting spike time delays

Recurrent nets that time and count, Gers et al., 2000

Other minor variants

• Coupled Input and Forget Gate

ft =σ Wf

xtht−1ct−1it−1ft−1ot−1

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟

ft = 1− it

• Full Gate Recurrence

LSTM: A Search Space Odyssey

• Tested the following variants, using Peephole LSTM as standard:

1. No Input Gate (NIG) 2. No Forget Gate (NFG) 3. No Output Gate (NOG) 4. No Input Activation Function (NIAF) 5. No Output Activation Function (NOAF) 6. No Peepholes (NP) 7. Coupled Input and Forget Gate (CIFG) 8. Full Gate Recurrence (FGR)

• On the tasks of:– Timit Speech Recognition: Audio frame to 1 of 61 phonemes– IAM Online Handwriting Recognition: Sketch to characters– JSB Chorales: Next-step music frame prediction

LSTM: A Search Space Odyssey, Greff et al., 2015

LSTM: A Search Space Odyssey

• The standard LSTM performed reasonably well on multiple datasets and none of the modifications significantly improved the performance

• Coupling gates and removing peephole connections simplified the LSTM without hurting performance much

• The forget gate and output activation are crucial

• Found interaction between learning rate and network size to be minimal – indicates calibration can be done using a small network first

LSTM: A Search Space Odyssey, Greff et al., 2015

Gated Recurrent Unit (GRU)

• A very simplified version of the LSTM– Merges forget and input gate into a single ‘update’ gate– Merges cell and hidden state

• Has fewer parameters than an LSTM and has been shown to outperform LSTM on some tasks

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Cho et al., 2014

Update Gate

Reset Gate

xt ht-1

xth’t

rt =σ Wr

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

ht = (1− zt )⊗ ht−1 + zt ⊗ h 't

zt =σ Wz

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

h 't = tanhWxt

rt ⊗ ht−1

⎛⎝⎜

⎞⎠⎟

rt Reset Gate

xt ht-1

rt =σ Wr

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

rt Reset Gate

xt ht-1

xth’t

rt =σ Wr

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

h 't = tanhWxt

rt ⊗ ht−1

⎛⎝⎜

⎞⎠⎟

Update Gate

Reset Gate

xt ht-1

xth’t

rt =σ Wr

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

zt =σ Wz

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

h 't = tanhWxt

rt ⊗ ht−1

⎛⎝⎜

⎞⎠⎟

Update Gate

Reset Gate

xt ht-1

rt =σ Wr

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

ht = (1− zt )⊗ ht−1 + zt ⊗ h 't

xth’t

zt =σ Wz

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

h 't = tanhWxt

rt ⊗ ht−1

⎛⎝⎜

⎞⎠⎟

An Empirical Exploration of Recurrent Network Architectures

• Given the rather ad-hoc design of the LSTM, the authors try to determine if the architecture of the LSTM is optimal

• They use an evolutionary search for better architectures

An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015

Evolutionary Architecture Search

• A list of top-100 architectures so far is maintained, initialized with the LSTM and the GRU

• The GRU is considered as the baseline to beat• New architectures are proposed, and retained based on

performance ratio with GRU

• All architectures are evaluated on 3 problems– Arithmetic: Compute digits of sum or difference of two numbers

provided as inputs. Inputs have distractors to increase difficulty3e36d9-h1h39f94eeh43keg3c = 3369 – 13994433 = -13991064

– XML Modeling: Predict next character in valid XML modeling– Penn Tree-Bank Language Modeling: Predict distributions over words

• At each step– Select 1 architecture at random, evaluate on 20 randomly chosen

hyperparameter settings. – Alternatively, propose a new architecture by mutating an existing one.

Choose probability p from [0,1] uniformly and apply a transformation to each node with probability p• If node is a non-linearity, replace with {tanh(x), sigmoid(x), ReLU(x), Linear(0, x),

Linear(1, x), Linear(0.9, x), Linear(1.1, x)}• If node is an elementwise op, replace with {multiplication, addition, subtraction}• Insert random activation function between node and one of its parents• Replace node with one of its ancestors (remove node)• Randomly select a node (node A). Replace the current node with either the sum,

product, or difference of a random ancestor of the current node and a random ancestor of A.

– Add architecture to list based on minimum relative accuracy wrt GRU on 3 different tasks

• 3 novel architectures are presented in the paper• Very similar to GRU, but slightly outperform it

• LSTM initialized with a large positive forget gate bias outperformed both the basic LSTM and the GRU!

LSTM initialized with large positive forget gate bias?

• Recall

ct = ft ⊗ ct−1 + it ⊗ tanhWxtht−1

⎛⎝⎜

⎞⎠⎟

ft =σ Wf

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

δct−1 = δct ⊗ ft• Gradients will vanish if f is close to 0. Using a large positive bias ensures

that f has values close to 1, especially when training begins• Helps learn long-range dependencies• Originally stated in Learning to forget: Continual prediction with LSTM,

Gers et al., 2000, but forgotten over time

Summary

• LSTMs can be modified with Peephole Connections, Full Gate Recurrence, etc. based on the specific task at hand

• Architectures like the GRU have fewer parameters than the LSTM and might perform better

• An LSTM with large positive forget gate bias works best!

Other Useful Resources / References

• http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf• http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf

• R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks, ICML 2013

• S. Hochreiter, and J. Schmidhuber, Long short-term memory, Neural computation, 1997 9(8), pp.1735-1780

• F.A. Gers, and J. Schmidhuber, Recurrent nets that time and count, IJCNN 2000• K. Greff , R.K. Srivastava, J. Koutník, B.R. Steunebrink, and J. Schmidhuber,

LSTM: A search space odyssey, IEEE transactions on neural networks and learning systems, 2016

• K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, ACL 2014

• R. Jozefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent network architectures, JMLR 2015

The Vanilla RNN Forward · 2020-06-09 · The Popular LSTM Cell i to f t Input Gate Output Gate...

Documents

WKM Saf-T-Seal Gate Valves - slb.com

COLLEGE MAP€¦ · COLLEGE MAP Directory Gate 2 Gate 3 Gate 1 t Agricultural Plot l Art Centre (Ferguson House) u Avonmoore X Boarder’s Dining Room (Main Hall) Boarding b Goodlet

CVF Open Access - Feedback Networksopenaccess.thecvf.com/content_cvpr_2017/papers/Zamir...The cell gate Cd t is computed as follows: C˜d t =tanh(W d,xc(X d−1 t)+W d,hc(H d t−1)),

Proposed Project Variances...X X X X X X X X X X E FO FO FO OH OH OH T T T T T T T W W PROPOSED 24-INCH NORTH COUNTY EXTENSION M.P. 5.0 Back EQUATION Ahead Diff. = -83' 315+34 314+51

Kissing Gates · 2013-03-26 · Metal kissing gate components: 1 x hanging post 60mm diameter x 1.9m long 1 x gate 1.2m wide x approx 1-1.2m high 1 x circular compound 1.2m approx

ET 9 - 2022 S 9 A Y...x ub - s h . x ll f sses. x t. x d one e - t.. x nt ( ptop t) t m x bies x 1 ,roups x r t m rs x d 25 w rship x s of 345, 786.50 x rds x ted 5 t - s x ing x ment

SEMICONDUCTOR Rectangular Gate Valve Slot Valve Fitting … · 2019. 6. 18. · SEMICONDUCTOR Rectangular Gate Valve Non-Standard 39.6mm x 329mm < 1.0 X E-7 ~ 1.2 X E5 Pa(abs) < 1.0

Industrial and General-Purpose Gate Driver ICs · 2017-03-16 · Opto-Isolated Gate Drivers/Voltage Sources x New Product Highlights x IRS2890DS 600 V Gate Driver IC x IRS200x 200

Smart home t gate

x) · cot x = ¯csc 2 x d dx sec x = sec x tan x f(t) = 1 2 gt 2 + v 0 t + s0 v(t) = f'(t) a(t) = v'(t) = f"(t) d dx csc x = ¯csc x cot x d dx (f(x)⋅g(x))= f(x)g'(x)+ g(x)f'(x)

0.1um T-Gate Technology (By EBMF-10.5)

online.thegateacademy.comonline.thegateacademy.com/gate/docs/EC/GATE-2017... · THE GATE GATE — 2017 The minimum value of the function f(x) —x(x2 — 2) in range of— 100 x 100

UNIVERSAL QUANTUM GATE SETS & THE T-OPERATOR

Factorized Variational Autoencoders for Face landmarks … · 2017. 7. 21. · Face landmarks X t = Xö t = V! öx 1,t öx 2,t öx 3,t... öx n,t "! x 1,t x 2,t x 3,t... x n,t " Figure

VERT-X Linear Swing Gate Motor ... - Centurion Systems VERT-X Swing Gate Motor... · VERT- @CENTURION Beauty is the beast! With such striking, rugged good looks, the VERT-X linear

GATE FITTINGS - Hoffman Fence · overhead slide gate fittings panic bar rolling gate hardware round post hardware ... 1/4" x 2" machine bolts. ... automatic gate closer

Smart home t gate

MonetaryEconomics Lecture1 TheNewKeynesianmodel/menu/... · 2012. 3. 30. · t −logY] AnotherexampleassumingY t = F (X t,Z t) = F elogXt,elogZt Y t ≈F (X,Z)+F x (X,Z)X [logX t

VICINITY MAP - Watchung, New Jerseywatchungnj.gov/BOA/2020/Applicants-Plans.pdf · t/e t/e t/e t/e t/e t/e x x x x x x x x x x x x x x x x x x \acpublish_8596\watchung twp.le. rev

Design of Digital Systems II - Masaryk UniversityAn XOR gate is a 2-input gate whose output is 1 if its inputs are di erent X Y = X0Y + X Y0 An XNOR gate is a 2-input gate whose output