Recurrent Neural Networks (RNNs) - idi.ntnu.no of Recurrent NNs 1 Language Modeling - Given a sequence of words, predict the next one. ... "surrender" U, V, W = Weights ¶L ¶H1 =

Recurrent Neural Networks (RNNs)

Keith L. Downing

The Norwegian University of Science and Technology (NTNU)Trondheim, [email protected]

November 7, 2017

Keith L. Downing Recurrent Neural Networks (RNNs)

Recurrent Networks for Sequence Learning

Strictly Feed-Forward Recurrent

Feedback connections create directed cycles, which produce dynamicmemory: current neural activation states reflect past activation states.


History-Dependent Activation Patterns

Grandma made me breakfast

Trump made me nauseous

Same two inputs produce differentinternal patterns due to different

prior contexts: Grandma -vs- Trump


Sequence Completion = Prediction

Trumpmademe

Grandmamademe

breakfast

++

nauseous

++ +

By modifying this mapping from internal activation states to outputs (viaweight modification, i.e. learning), the net can predict the next word in asentence.


Time-Delay NetsInclude history in the input, so no recurrence needed; similar to NetTalk’s use ofcontext.

GrandmamademeInput(T) Input(T-1) Input(T-2)

breakfast

But this requires n-grams as input, and there are many of them!Over-sensitivity to word position.Trump made me nauseous and Trump made me very nauseous wouldbe handled very differently, because they yield different n-grams.


Elman -vs- Jordan Nets

Context Layer

JordanInput Layer

Output Layer

HiddenLayer

Elman

Context Layer

Input Layer

Output Layer

HiddenLayer

The Elman net should be more versatile, since its internal context is free toassume many activation patterns (i.e., representations), whereas the contextof the Jordan net is restricted to that of the target patterns.


Backpropagation Through Time (BPTT)

Input

Hidden

Output

U

V

W

In

Hid

Out

U

V

W

In

Hid

Out

U

V

In

Hid

Out

U

V

W W

T - k T - 1 T

Unfoldin

Time

In

Hid

Out

U

V

W

In

Hid

Out

U

V

In

Hid

Out

U

V

W W

T - k T - 1 T

Backpropagating Error

E

Target(T)

EE

Target(T-1)Target(T-k)

For each case, take average of recommended

updates for U, V, W


Applications of Recurrent NNs

1 Language Modeling - Given a sequence of words, predict the next one.2 Text Generation - Given a few words, produce a meaningful sequence.3 Machine Translation - Given a sequence of words in one language,

produce a semantically equivalent sequence in another language.4 Speech Recognition - Given a sequence of acoustic signals, produce a

sequence of phonemes.5 Image Captioning - Given an image, produce a short text description.

This typically requires RNNs + Convolution Nets.

Most of the top RNN systems for natural language (NL) applicationsnow use Long Short-Term Memory (LSTM; Hochreiter + Schmidhuber(1997))

RNNs used for any NL task yield useful internal vector representations(a.k.a. embeddings) for NL expressions, wherein semantically similarexpressions have similar vectors.


Recurrent Architectures: Elman Net

X

H

O

L

Y

Input

Hidden

Output

Loss

Target

X1

H1

O1

L1

Y1

X2

H2

O2

L2

Y2

X3

H3

O3

L3

Y3

X4

H4

O4

L4

Y4

"Well" "we" "busted" "out"

"we" "busted" "out" "of"

"you" "know" "down" "of"


Recurrent Architectures: Jordan Net

X

H

O

L

Y

Input

Hidden

Output

Loss

Target

X1

H1

O1

L1

Y1

X2

H2

O2

L2

Y2

X3

H3

O3

L3

Y3

X4

H4

O4

L4

Y4





Recurrent Architectures: Teacher Forcing

X

H

O

L

Y

Input

Hidden

Output

Loss

Target

X1

H1

O1

L1

Y1

X2

H2

O2

L2

Y2

X3

H3

O3

L3

Y3

X4

H4

O4

L4

Y4




During testing, the output (not the target) feeds back to H.

Since Y is independent of H, the gradients do not propagate backwards intime: the full power of BPTT is not needed for learning.


Recurrent Architectures: Delayed Supervision

X

H

O

L

Y

Input

Hidden

Output

Loss

Target

X1

H1

X2

H2

X3

H3

Xn

Hn

O

L

Y

"Well" "we" "busted" "surrender"

"Springsteen"

"Lady Gaga"


Recurrent Architectures: Captioning

X1

H1

O1

L1

Y1

X

H

O

L

Y

Input

Hidden

Output

Loss

Target

H2

O2

L2

Y2

H3

O3

L3

Y3

H4

O4

L4

Y4

"

"nerd" "on" "Halloween"

"Sketchy" "dude" "hacking" "Mastercard"

"Skinny"


Calculating RNN Gradients

Standard calculations once the function graph is unrolled

X

H

O

L

Y

Input

Hidden

Output

Loss

Target

X1

H1

O1

L1

Y1

X2

H2

O2

L2

Y2

Xn

Hn

On

Ln

Yn

"Well" "we"

U

V

WW

V

UW

U

V

W

U

V

"surrender"

U, V, W = Weights

∂L∂H1

=∂O1∂H1

• ∂L1∂O1

+∂H2∂H1

• ∂L∂H2


Calculating RNN Gradients

The effects of Hi upon L involve all paths from Hi to Li , Li+1, . . .Ln.

X1

H1

O1

L1

Y1

W

V

U

L2 Ln

H2

∂L∂H1

=∂O1∂H1

• ∂L1∂O1

+∂H2∂H1

• ∂L∂H2


Jacobian involving the Hidden Layer

Derivatives of Loss w.r.t. Hidden LayerNote: the hidden layer’s activation levels change with each recurrent loop, sothere is no single Jacobian, ∂L

∂H . We must calculate ∂L∂Hi∀i

∂L∂Hi

=∂Oi∂Hi• ∂Li

∂Oi+

∂Hi+1∂Hi

• ∂L∂Hi+1

This recurrent relationship can create a long product (in blue):

∂L∂Hi

= · · ·+∂Hi+1

∂Hi•

∂Hi+2∂Hi+1

• · · · • ∂Hn

∂Hn−1• ∂L

∂Hn+ · · ·

Each term includes the weights (W):

∂Hk+1∂Hk

=∂sk+1∂Hk

×∂H(sk+1)

∂sk+1= W T ×

∂H(sk+1)

∂sk+1

where sk+1 = sum of weighted inputs to H at time k+1.


∂Hk+1/∂Hk

∂Hk+1

∂Hk=

∂h1,k+1∂h1,k

∂h1,k+1∂h2,k

· · · ∂h1,k+1∂hm,k

∂h2,k+1∂h1,k

∂h2,k+1∂h2,k

· · · ∂h2,k+1∂hm,k... ... ... ...

... ... ... ...

∂hm,k+1∂h1,k

∂hm,k+1∂h2,k

· · · ∂hm,k+1∂hm,k


Finishing the Calculation of ∂Hk+1/∂Hk

Assume H uses tanh as its activation function∂ tanh(x)

∂x = 1− tanh(x)2.

si ,k+1 = sum of weighted inputs to hidden node hi on the k+1 recursioncycle.

hi ,k+1 = output of hidden node i on cycle k+1.

∂H(sk+1)

∂sk+1=

∣∣∣∣∣∣∣∣∣∣∣∣∣

∂h1,k+1∂s1,k+1

∂h2,k+1∂s2,k+1

...

∣∣∣∣∣∣∣∣∣∣∣∣∣=

∣∣∣∣∣∣∣∣∣∣∣∣

1−h21,k+1

1−h22,k+1

...

∣∣∣∣∣∣∣∣∣∣∣∣For each node in H (hi ), there is a) one row in ∂H(sk+1)

∂sk+1and b) one row

in W T (for all incoming weights to hi ).

So multiply each weight in the ith row of W T by the ith member of thecolumn vector to produce ∂Hk+1

∂Hk.


Finishing the Calculation of ∂Hk+1/∂Hk

∂Hk+1

∂Hk=

(1−h2

1,k+1)w11 (1−h21,k+1)w21 · · · (1−h2

1,k+1)wm1

(1−h22,k+1)w12 (1−h2

2,k+1)w22 · · · (1−h22,k+1)wm2

......

......

Since row-by-row multiplication is not a standard array operation, the columnvector can be converted into a diagonal matrix (with the only non-zero valuesalong the main diagonal) denoted diag(1−H2). Then we can use standardmatrix multiplication (•):

∂Hk+1

∂Hk= diag(1−H2)•W T =

(1−h2

1,k+1) 0 · · · 0

0 (1−h22,k+1) · · · 0

......

......

•

w11 w21 w31 · · ·w12 w22 · · · · · ·

......

......


Jacobians of Loss w.r.t. Weights

Notice the sums across all recurrent cycles (t).

∂L∂V =

n

∑t=1

∂Ot∂V •

∂Lt∂Ot

where ∂Ot∂V is a 3-d array containing the values of Ht ; the details of ∂Lt

∂Otdepend upon the actual loss function and activation function of O.

∂L∂U =

n

∑t=1

∂Ht∂U •

∂L∂Ht

where ∂Ht∂U is a 3-d array containing the values of Xt .

∂L∂W =

n−1

∑t=1

∂Ht+1∂W • ∂L

∂Ht+1

where ∂Ht+1∂W is a 3-d array containing the values of Ht .

These gradient products include the activations of X and H. When theseapproach 0, the gradients vanish.When the sigmoid or tanh is near saturation (0 or 1 for sigmoid; -1 or 1 for tanh),the diagonal matrix in ∂Hk+1

∂Hkapproaches zero; and the gradients vanish again.


Handling Uncooperative Gradients

Recurrence creates long product gradients which often vanish (andoccasionally explode).

How can we combat this?

Innovative Network Architectures

Reservoir Computing: Echo-State Machines and Liquid-State Machines

Gated RNNs: Long Short-Term Memory (LSTM) and Gated RecurrentUnits (GRU)

Explicit Memory - Neural Turing Machine

Supplementary Techniques

Multiple Timescales - using different degrees of leak/forgetting and skipconnections.

Gradient Clipping - setting an upper bound on gradient magnitude.

Regularizing - adding penalty for gradient decay to the loss function.


Reservoir Computing

a aa

b

c

e

Training Cases: (a, a)(a, b)(b, a)(a, c)(c, e)

+++

Reservoir

-

(Random, FixedWeights)

ModifiableWeights

c

e

Testing Given: a

Produce: (a, b, a, c, e)

Echo State Machines (ESM) and Liquid State Machines (LSM)

MANY reservoir neurons with random and recurrent connections allowsequences of inputs to create complex internal states (a.k.a. contexts).

Only tuneable synapses: reservoir to outputs→ no long gradients.

Drawback: initializing reservoir weights is tricky.


Even RNNs Forget

Repeated multiplication by hidden-to-hidden weights,along with the integration of new inputs, causes memories(i.e. hidden-node states) to fade over time.For some tasks, e.g. reinforcement learning, this isdesirable: older information should have a diminishingeffect upon current decision-making.In other tasks, such as natural-language processing, somekey pieces of old information need to retain influence.I found my bicycle in the same town and on the samestreet, but not on the same front porch from which I hadreported to the police that it had been stolen.What was stolen?A clear memory of bicycle must be maintained throughoutprocessing of the entire sentence (or paragraph, or story).


Long Short-Term Memory (LSTM)

Invented by Hochreiter and Schmidhuber in 1997.

Breakthrough in 2009: Won Int’l Conf. on Doc. Analysis andRecognition (ICDAR) competition.

2013: Record for speech recognition on DARPA’s TIMIT dataset.

Currently used in commercial speech-recognition systems by majorplayers: Google, Apple, Microsoft...

Critical basis of RNN success, which is otherwise very limited.

Hundreds, if not thousands, of LSTM variations. One of the mostpopular is the GRU (Gated Recurrent Unit), which has fewer gates butstill good performance. Also, Phased LSTM’s handle asynchronousinput.

Easily combined with other NN components, e.g., convolution.

Increased memory retention→ hidden layers continue to produce(some) high values even when external inputs are low→ gradients donot vanish.

See Understanding LSTM Networks. Colah’s Blog (August, 2015).


The LSTM Module

!

! !tanh

x

xx

+

tanh

Xt

ForgetGate

InputGate

OutputGate

Xt+1

HtHt-1

Ct-1 Ct

! sigmoid gate Ct cell state

Ht cell outputXt cell input

layer of nodespointwise op

Ht-1 Ht

Time


Long Short-Term Memory (LSTM)

Yellow boxes are full layers of nodes that are preceded by a matrix oflearnable weights and that produce a vector of values.

Orange ovals have unweighted vectors as inputs. They perform theiroperation (+ or ×) pointwise: either individually on each item of a vector,or pairwise with corresponding elements of two vectors (e.g. the statevector and the forget vector). A.k.a. Hadamard Product.

The cell’s internal state, C (a vector), runs along the top (red line) of thediagram.

C is selectively modified, as controlled by the forget and input gates,

The yellow tanh produces suggested contributions to Ct−1.

The modified state (Ct ) is preserved (along the red line) and passed tothe next time step.

Ct also passes through the (orange) tanh, whose result is also gated toproduce the cell’s output value, Ht .

Ht along with the next input Xt+1 comprise the complete set of weightedinputs to the 3 gates and (yellow) tanh at time t+1.


Supplementary Techniques for Gradient Management

These are minor adjustments, applicable to many NN architectures.

Gradient Clipping

Oversized gradients→ Large jumps in search space→ risky in ruggederror landscapes.

Reduce problem by restricting gradients to a pre-defined range.

Gradient-Sensitive Regularization

Develop a metric for amount of gradient decay across recurrent steps.

Include this value as a penalty in the loss function.

Facilitate Multiple Time Scales

Skip connections over time - implement delays by linking output of alayer at time t directly to input at time t+d.

Variable leak rates - neurons have diverse leak parameters. More leak→ more forgetting→ shorter timescale of information integration.


Documents

Recurrent Neural Networks (RNNs) - idi.ntnu.no of Recurrent NNs 1 Language Modeling - Given a sequence of words, predict the next one. ... "surrender" U, V, W = Weights ¶L ¶H1 =