84
Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June, 2017 LG:nlp:DL:pushpak 1

Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Natural Language Processing

Pushpak Bhattacharyya CSE Dept,

IIT Patna and Bombay

Introduction to DL-NLP

12 June, 2017 LG:nlp:DL:pushpak 1

Page 2: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Motivation for DL-NLP

12 June, 2017 LG:nlp:DL:pushpak 2

Page 3: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Text classification using Deep NN n  Text classification at the heart of many NLP

tasks n  Soft decisions needed n  A piece of text may belong to multiple classes

n  “Recent curb on H1-B in the US visas- promised in Trump’s election speeches- are causing IT industries to re-orient their business plan”

n  Belongs to BOTH economics and Politics- more to former

12 June, 2017 LG:nlp:DL:pushpak 3

Page 4: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

An example SMS complaint

n  I have purchased a 80 litre Videocon fridge about 4 months ago when the freeze go to sleep that time compressor give a sound (khat khat khat khat ....) what is possible fault over it is normal I can't understand please help me give me a suitable answer.

12 June, 2017 LG:nlp:DL:pushpak 4

Page 5: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Significant words (in red): after stop word removal

n  I have purchased a 80 litre Videocon fridge about 4 months ago when the freeze go to sleep that time compressor give a sound (khat khat khat khat ....) what is possible fault over it is normal I can't understand please help me give me a suitable answer.

12 June, 2017 LG:nlp:DL:pushpak 5

Page 6: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

SMS classification Action complaint

Hidden neurons

Emergency

-1

x1 x2

-1 1.5 1.5

12 June, 2017 6 LG:nlp:DL:pushpak

SMS feature vector: input neurons

Page 7: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Basic Neural Models

n  Precursor to Deep Learning n  Models

n  Perceptron and PTA n  Feedforward Network and Backpropagation n  Boltzmann Machine n  Self Organization and Kohonen’s Map n  Neocognitron

12 June, 2017 LG:nlp:DL:pushpak 7

Page 8: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Perceptron

12 June, 2017 8 LG:nlp:DL:pushpak

Page 9: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

The Perceptron Model

Output = y

wn Wn-1

w1

Xn-1

x1

Threshold = θ

12 June, 2017 9 LG:nlp:DL:pushpak

Page 10: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

θ

1 y

Step function / Threshold function y = 1 for Σwixi >=θ =0 otherwise

Σwixi

12 June, 2017 10 LG:nlp:DL:pushpak

Page 11: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Perceptron Training Algorithm (PTA)

Preprocessing: 1.  The computation law is modified to

y = 1 if ∑wixi > θ y = o if ∑wixi < θ

à

. . .

θ, ≤

w1 w2 wn

x1 x2 x3 xn

. . .

θ, <

w1 w2 w3 wn

x1 x2 x3 xn

w3

12 June, 2017 11 LG:nlp:DL:pushpak

Page 12: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

PTA – preprocessing cont…

2. Absorb θ as a weight

à 3. Negate all the zero-class examples

. . .

θ

w1 w2 w3 wn

x2 x3 xn x1

w0=θ

x0= -1

. . .

θ

w1 w2 w3

wn

x2 x3 xn x1

12 June, 2017 12 LG:nlp:DL:pushpak

Page 13: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Example to demonstrate preprocessing

n  OR perceptron 1-class <1,1> , <1,0> , <0,1> 0-class <0,0> Augmented x vectors:- 1-class <-1,1,1> , <-1,1,0> , <-1,0,1> 0-class <-1,0,0> Negate 0-class:- <1,0,0> 12 June, 2017 13 LG:nlp:DL:pushpak

Page 14: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Example to demonstrate preprocessing cont..

Now the vectors are x0 x1 x2

X1 -1 0 1 X2 -1 1 0 X3 -1 1 1 X4 1 0 0

12 June, 2017 14 LG:nlp:DL:pushpak

Page 15: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Perceptron Training Algorithm

1.  Start with a random value of w ex: <0,0,0…>

2.  Test for wxi > 0 If the test succeeds for i=1,2,…n

then return w 3. Modify w, wnext = wprev + xfail

12 June, 2017 15 LG:nlp:DL:pushpak

Page 16: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Convergence of PTA n  Statement:

Whatever be the initial choice of weights and whatever be the vector chosen for testing, PTA converges if the vectors are from a linearly separable function.

12 June, 2017 16 LG:nlp:DL:pushpak

Page 17: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Feedforward Network and Backpropagation

12 June, 2017 17 LG:nlp:DL:pushpak

Page 18: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Example - XOR

w2=1 w1=1 θ  = 0.5

x1x2 x1x2

-1

x1 x2

-1 1.5 1.5

1 1

12 June, 2017 18 LG:nlp:DL:pushpak

Page 19: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Gradient Descent Technique

n  Let E be the error at the output layer

n  ti = target output; oi = observed output

n  i is the index going over n neurons in the outermost layer

n  j is the index going over the p patterns (1 to p) n  Ex: XOR:– p=4 and n=1

∑∑= =

−=p

j

n

ijii otE

1 1

2)(21

12 June, 2017 19 LG:nlp:DL:pushpak

Page 20: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Weights in a FF NN

n  wmn is the weight of the connection from the nth neuron to the mth neuron

n  E vs surface is a complex surface in the space defined by the weights wij

n  gives the direction in which a movement of the operating point in the wmn co-ordinate space will result in maximum decrease in error

W

m

n

wmn

mnwE

δδ

mnmn w

Ewδδ

−∝Δ

12 June, 2017 20 LG:nlp:DL:pushpak

Page 21: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Backpropagation algorithm

n  Fully connected feed forward network n  Pure FF network (no jumping of

connections over layers)

Hidden layers

Input layer (n i/p neurons)

Output layer (m o/p neurons)

j

iwji

….

….

….

….

12 June, 2017 21 LG:nlp:DL:pushpak

Page 22: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Gradient Descent Equations

iji

jji

j

thj

ji

j

jji

jiji

jownet

jw

jnetE

netwnet

netE

wE

wEw

ηδδ

δηδ

δδδ

δ

δ

δδ

δδ

ηηδδ

η

==Δ

−=

=×=

≤≤=−=Δ

)layer j at theinput (

)10 rate, learning(

12 June, 2017 22 LG:nlp:DL:pushpak

Page 23: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Backpropagation – for outermost layer

ijjjjji

jjjj

m

ppp

thj

j

j

jj

ooootwoootj

otE

netneto

oE

netEj

)1()(

))1()(( Hence,

)(21

)layer j at theinput (

1

2

−−=Δ

−−−−=

−=

=×−=−=

∑=

η

δ

δ

δ

δδ

δδ

δ

12 June, 2017 23 LG:nlp:DL:pushpak

Page 24: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Backpropagation for hidden layers

Hidden layers

Input layer (n i/p neurons)

Output layer (m o/p neurons) j

i

….

….

….

….

k

δk is propagated backwards to find value of δj

12 June, 2017 24 LG:nlp:DL:pushpak

Page 25: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Backpropagation – for hidden layers

)1()(

)1()( Hence,

)1()(

)1(

layernext

layernext

layernext

jjk

kkj

jjk

kjkj

jjk j

k

k

jjj

j

j

jj

iji

oow

oow

ooonet

netE

oooE

neto

oE

netEj

jow

−=

−××−−=

−××−=

−×−=

×−=−=

δ

δδ

δδ

δδ

δδ

δ

δ

δδ

δδ

δ

ηδ

12 June, 2017 25 LG:nlp:DL:pushpak

Cause of Vanishing Gradient problem!

Page 26: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

General Backpropagation Rule

ijjk

kkj ooow )1()(layernext

−= ∑∈

δ

)1()( jjjjj ooot −−=δ

iji jow ηδ=Δ•  General weight updating rule:

•  Where

for outermost layer

for hidden layers

12 June, 2017 26 LG:nlp:DL:pushpak

Page 27: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

How does it work? n  Input propagation forward and error

propagation backward (e.g. XOR)

w2=1 w1=1 θ  = 0.5

x1x2 x1x2

-1

x1 x2

-1 1.5 1.5

1 1

12 June, 2017 27 LG:nlp:DL:pushpak

Page 28: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Recurrent Neural Network

Acknowledgement: 1. http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

By Denny Britz 2. Introduction to RNN by Jeffrey Hinton http://www.cs.toronto.edu/~hinton/csc2535/lectures.html

12 June, 2017 LG:nlp:DL:pushpak 28

Page 29: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Sequence processing m/c

12 June, 2017 LG:nlp:DL:pushpak 29

Page 30: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

E.g. POS Tagging

12 June, 2017 LG:nlp:DL:pushpak 30

Purchased Videocon machine

VBD NNP NN

Page 31: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

I

h0 h1

o1 o2 o3 o4

c1

a11 a12 a13

a14

Decision on a piece of text

E.g. Sentiment Analysis

12 June, 2017 31 LG:nlp:DL:pushpak

Page 32: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

I

h0 h1

o1 o2 o3 o4

c2

a21 a22 a23

a24

like

h2

12 June, 2017 32 LG:nlp:DL:pushpak

Page 33: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

I

h0 h1

o1 o2 o3 o4

c3

a31 a32 a33

a34

like the

h3 h2

12 June, 2017 33 LG:nlp:DL:pushpak

Page 34: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

I

h0 h1

o1 o2 o3 o4

c4

a41

a42 a43

a44

like the

h3 h2

camera

h4

12 June, 2017 34 LG:nlp:DL:pushpak

Page 35: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

I

h0 h1

o1 o2 o3 o4

c5

a51

a52 a53

a54

like the

h3 h2

camera <EOS>

h4 h5

Positive sentiment

12 June, 2017 35 LG:nlp:DL:pushpak

Page 36: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Back to RNN model

12 June, 2017 LG:nlp:DL:pushpak 36

Page 37: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Notation: input and state n  xt is the input at time step t. For example,

could be a one-hot vector corresponding to the second word of a sentence.

n  st is the hidden state at time step t. It is the “memory” of the network.

n  st= f(U.xt+Wst-1) U and W matrices are learnt

n  f is a function of the input and the previous state

n  Usually tanh or ReLU (approximated by softplus)

12 June, 2017 LG:nlp:DL:pushpak 37

Page 38: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Tanh, ReLU (rectifier linear unit) and Softplus

12 June, 2017 LG:nlp:DL:pushpak 38

=tanheeee

xx

xx

+

−=tanh

),0max()( xxf =

)1ln()( exxg +=

Page 39: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Notation: output n  ot is the output at step t

n  For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary

n  ot=softmax(V.st)

12 June, 2017 LG:nlp:DL:pushpak 39

Page 40: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Operation of RNN

n  RNN shares the same parameters (U, V, W) across all steps

n  Only the input changes

n  Sometimes the output at each time step is not needed: e.g., in sentiment analysis

n  Main point: the hidden states !!

12 June, 2017 LG:nlp:DL:pushpak 40

Page 41: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

The equivalence between feedforward nets and recurrent nets

w1 w4

w2 w3

w1 w2 W3 W4

time=0

time=2

time=1

time=3

Assume that there is a time delay of 1 in using each connection.

The recurrent net is just a layered net that keeps reusing the same weights.

w1 w2 W3 W4

w1 w2 W3 W4

12 June, 2017 41 LG:nlp:DL:pushpak

Page 42: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Reminder: Backpropagation with weight constraints •  Linear constraints

between the weights.

•  Compute the gradients as usual

•  Then modify the gradients so that they satisfy the constraints.

•  So if the weights started off satisfying the constraints, they will continue to satisfy them.

2121

21

21

21

:

::

wandwforwE

wEuse

wEand

wEcompute

wwneedwewwconstrainTo

∂+

Δ=Δ

=

Example

12 June, 2017 42 LG:nlp:DL:pushpak

Page 43: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Backpropagation through time (BPTT algorithm)

n  The forward pass at each time step. n  n  The backward pass computes the error

derivatives at each time step.

n  After the backward pass we add together the derivatives at all the different times for each weight.

12 June, 2017 43 LG:nlp:DL:pushpak

Page 44: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Binary addition using recurrent network (Jeffrey Hinton’s lecture)

•  Feed forward n/w

•  But problem of variable length input

00100110 10100110

11001100

hidden units

12 June, 2017 44 LG:nlp:DL:pushpak

Page 45: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

The algorithm for binary addition

no carry print 1

carry print 1

no carry print 0

carry print 0

1 1

1 0

1 0

1 0

1 0

0 1

0 1

0 1

0 1

0 0

0 0

0 0

0 0

1 1

1 1

This is a finite state automaton. It decides what transition to make by looking at the next column. It prints after making the transition. It moves from right to left over the two input numbers.

1 1

12 June, 2017 45 LG:nlp:DL:pushpak

Page 46: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

A recurrent net for binary addition •  Two input units and one output

unit. •  Given two input digits at each

time step. •  The desired output at each time

step is the output for the column that was provided as input two time steps ago.

–  It takes one time step to update the hidden units based on the two input digits.

–  It takes another time step for the hidden units to cause the output.

0 0 1 1 0 1 0 0

0 1 0 0 1 1 0 1

1 0 0 0 0 0 0 1

time

12 June, 2017 46 LG:nlp:DL:pushpak

Page 47: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

The connectivity of the network

•  The input units have feed forward connections

•  Allow them to vote for the next hidden activity pattern.

3 fully interconnected hidden units

12 June, 2017 47 LG:nlp:DL:pushpak

Page 48: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

What the network learns n  Learns four distinct patterns of activity for the

3 hidden units.

n  Patterns correspond to the nodes in the finite state automaton

n  Nodes in FSM are like activity vectors

n  The automaton is restricted to be in exactly one state at each time

n  The hidden units are restricted to have exactly one vector of activity at each time.

12 June, 2017 48 LG:nlp:DL:pushpak

Page 49: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

The backward pass is linear n  The backward pass, is

completely linear. If you double the error derivatives at the final layer, all the error derivatives will double.

n  The forward pass determines the slope of the linear function used for backpropagating through each neuron.

12 June, 2017 49 LG:nlp:DL:pushpak

Page 50: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Recall: Backpropagation Rule

ijjk

kkj ooow )1()(layernext

−= ∑∈

δ

)1()( jjjjj ooot −−=δ

iji jow ηδ=Δ•  General weight updating rule:

•  Where

for outermost layer

for hidden layers

12 June, 2017 50 LG:nlp:DL:pushpak

Page 51: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

The problem of exploding or vanishing gradients (1/2)

–  If the weights are small, the gradients shrink exponentially

–  If the weights are big the gradients grow exponentially.

•  Typical feed-forward neural nets can cope with these exponential effects because they only have a few hidden layers.

12 June, 2017 51 LG:nlp:DL:pushpak

Page 52: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

The problem of exploding or vanishing gradients (2/2) •  In an RNN trained on long sequences (e.g.

sentence with 20 words) the gradients can easily explode or vanish.

–  We can avoid this by initializing the weights very carefully.

•  Even with good initial weights, its very hard to detect that the current target output depends on an input from many time-steps ago.

–  So RNNs have difficulty dealing with long-range dependencies.

12 June, 2017 52 LG:nlp:DL:pushpak

Page 53: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Vanishing/Exploding gradient: solution

n  LSTM n  Error becomes “trapped” in the memory

portion of the block n  This is referred to as an "error carousel“ n  Continuously feeds error back to each

of the gates until they become trained to cut off the value

n  (to be expanded) 12 June, 2017 LG:nlp:DL:pushpak 53

Page 54: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Boltzmann Machine

12 June, 2017 LG:nlp:DL:pushpak 54

Page 55: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Illustration of the basic idea of Boltzmann Machine

n  Illustrative task: To learn the identity function

n  The setting is probabilistic, x = 1 or x = -1, with uniform probability, i.e., n  P(x=1) = 0.5, P(x=-1) =

0.5 n  For, x=1, y=1 with P=0.9 n  For, x=-1, y=-1 with P=0.9

w12 1 2

x y

x y 1 -1

1 -1

12 June, 2017 55 LG:nlp:DL:pushpak

Page 56: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Illustration of the basic idea of Boltzmann Machine (contd.)

n  Let α = output neuron states β = input neuron states Pα|β = observed probability distribution Qα|β = desired probability distribution Qβ = probability distribution on input

states β 12 June, 2017 56 LG:nlp:DL:pushpak

Page 57: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Illustration of the basic idea of Boltzmann Machine (contd.)

n  The divergence D is given as: D = ∑α∑β Qα|β Qβ ln Qα|β / Pα|β called KL divergence formula D = ∑α∑β Qα|β Qβ ln Qα|β / Pα|β

>= ∑α∑β Qα|β Qβ ( 1 - Pα|β /Qα|β) >= ∑α∑β Qα|β Qβ - ∑α∑β Pα|β Qβ >= ∑α∑β Qαβ - ∑α∑β Pαβ {Qαβ and Pαβ are joint distributions} >= 1 – 1 = 0

12 June, 2017 57 LG:nlp:DL:pushpak

Precursor to Softmax Layer

Page 58: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Gradient descent for finding the weight change rule

P(Sα) ∞ exp(-E(Sα)/T) P(Sα) = (exp(-E(Sα)/T)) / (∑β є all statesexp(-E(Sβ)/T) ln(P(Sα))= (-E(Sα)/T)-ln Z D= ∑α∑βQα|β Qβ ln (Qα|β / Pα|β) Δwij= η (δD/δwij); gradient descent

12 June, 2017 58 LG:nlp:DL:pushpak

Page 59: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Calculating gradient: 1/2

δD / δwij = δ/δwij [∑α∑β Qα|β Qβ ln (Qα|β / Pα|β)] = δ/δwij [∑α∑β Qα|β Qβ ln Qα|β - ∑α∑β Qα|β Qβ ln Pα|β]

δ(ln Pα|β) /δwij = δ/δwij [-E(Sα)/T– lnZ] Z = ∑βexp(-E(Sβ))/T

Constant With respect To wij

12 June, 2017 59 LG:nlp:DL:pushpak

Page 60: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Calculating gradient: 2/2 δ [-E(Sα)/T] /δwij = (-1/T) δ/δwij [ - ∑i ∑j>i wij si

sj ] = (-1/T)[-sisj|α] = (1/T)[sisj|α]

δ (ln Z)/δwij=(1/Z)(δZ/δwij) Z= ∑βexp(-E(Sβ)/T) δZ/δwij= ∑β[exp(-E(Sβ)/T)(δ(-E(Sβ/T)/δwij )]

= (1/T) ∑βexp(-E(Sβ)/T).sisj|β 12 June, 2017 60 LG:nlp:DL:pushpak

Page 61: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Final formula for Δwij

Δwij= [1/T/ [sisj|α – (1/Z) ∑βexp(-E(Sβ)/

T).sisj|β = [1/T][sisj|α – ∑βP(Sβ).sisj|β] Expectation of ith and jth

Neurons being on together

12 June, 2017 61 LG:nlp:DL:pushpak

Page 62: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Issue of Hidden Neurons

n  Boltzmann machines n  can come with hidden neurons n  are equivalent to a Markov Random field n  with hidden neurons are like a Hidden

Markov Machines

n  Training a Boltzmann machine is equivalent to running the Expectation Maximization Algorithm

12 June, 2017 62 LG:nlp:DL:pushpak

Page 63: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Use of Boltzmann machine

n  Computer Vision n  Understanding scene involves what is

called “Relaxation Search” which gradually minimizes a cost function with progressive relaxation on constraints

n  Boltzmann machine has been found to be slow in the training n  Boltzmann training is NP-hard.

12 June, 2017 63 LG:nlp:DL:pushpak

Page 64: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Questions

n  Does the Boltzmann machine reach the global minimum? What ensures it?

n  Why is simulated annealing applied to Boltzmann machine? n  local minimum à increase T à n/w runs àgradually reduce T à reach global minimum.

n  Understand the effect of varying T n  Higher T à small difference in energy states

ignored, convergence to local minimum fast.

12 June, 2017 64 LG:nlp:DL:pushpak

Page 65: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Self Organization and Kohonen Net

12 June, 2017 LG:nlp:DL:pushpak 65

Page 66: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Mapping of Brain

Side areas For auditory information processing

Back of brain( vision)

Lot of resilience: Visual and auditory areas can do each other’s job

12 June, 2017 66 LG:nlp:DL:pushpak

Page 67: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Character Recognition: A, A, A, ,

, O/p grid

. . . . I/p neuron

12 June, 2017 67 LG:nlp:DL:pushpak

Page 68: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Kohonen Net •  Self Organization or Kohonen network fires a group of neurons instead of a single one. •  The group “some how” produces a “picture” of the cluster. •  Fundamentally SOM is competitive learning. •  But weight changes are incorporated on a neighborhood. •  Find the winner neuron, apply weight change for the winner and its “neighbors”.

12 June, 2017 68 LG:nlp:DL:pushpak

Page 69: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Neurons on the contour are the “neighborhood” neurons.

Winner

12 June, 2017 69 LG:nlp:DL:pushpak

Page 70: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Weight change rule for SOM

W(n+1) = W(n) + η(n) (I(n) – W(n)) P+δ(n) P+δ(n) P+δ(n)

Neighborhood: function of n

Learning rate: function of n δ(n) is a decreasing function of n η(n) learning rate is also a decreasing function of n 0 < η(n) < η(n –1 ) <=1

12 June, 2017 70 LG:nlp:DL:pushpak

Page 71: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Pictorially

Winner δ(n)

. . . .

Convergence for kohonen not proved except for uni-dimension

12 June, 2017 71 LG:nlp:DL:pushpak

Page 72: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

… … .

P neurons o/p layer

n neurons

Clusters:

A A :

A

Wp

B : C :

: :

12 June, 2017 72 LG:nlp:DL:pushpak

Page 73: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Clustering Algos

1. Competitive learning 2. K – means clustering 3. Counter Propagation

12 June, 2017 73 LG:nlp:DL:pushpak

Page 74: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

K – means clustering K o/p neurons are required from the knowledge of K clusters being present.

26 neurons

Full connection

n neurons

……

……

12 June, 2017 74 LG:nlp:DL:pushpak

Page 75: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Steps

1. Initialize the weights randomly. 2. Ik is the vector presented at kth iteration. 3. Find W* such that

|w* - Ik| < |wj - Ik| for all j 4. make W*(new) = W* (old) + η(Ik - w* ). 5 k ß k +1. 6. Go to 2 until the error is below a threshold.

12 June, 2017 75 LG:nlp:DL:pushpak

Page 76: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Two part assignment

Supervised

Hidden layer

12 June, 2017 76 LG:nlp:DL:pushpak

Page 77: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

A1 A2 A3

4 I/p neurons A4

Cluster Discovery By SOM/Kohenen Net

12 June, 2017 77 LG:nlp:DL:pushpak

Page 78: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

NeoCognitron (Fukusima et. al., 1980)

12 June, 2017 78 LG:nlp:DL:pushpak

Page 79: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Hierarchical feature extraction based

12 June, 2017 79 LG:nlp:DL:pushpak

Page 80: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

12 June, 2017 80 LG:nlp:DL:pushpak

Page 81: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

S-Layer

n  Each S-layer in the neocognitron is intended for extraction of features from corresponding stage of hierarchy.

n  Particular S-layers are formed by distinct number of S-planes. Number of these S-planes depends on the number of extracted features.

12 June, 2017 81 LG:nlp:DL:pushpak

Page 82: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

V-Layer

n  Each V-layer in the neocognitron is intended for obtaining of informations about average activity in previous C-layer (or input layer).

n  Particular V-layers are always formed by only one V-plane.

12 June, 2017 82 LG:nlp:DL:pushpak

Page 83: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

C-Layer

n  Each C-layer in the neocognitron is intended for ensuring of tolerance of shifts of features extracted in previous S-layer.

n  Particular C-layers are formed by distinct number of C-planes. Their number depends on the number of features extracted in the previous S-layer.

12 June, 2017 83 LG:nlp:DL:pushpak

Page 84: Natural Language Processingai-nlp-ml/resources/lectures/NLP-dl-12...Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Introduction to DL-NLP 12 June,

Hands on for afternoon

n  Implement the binary adder on RNN

n  Create a FF-BP POS tagger

12 June, 2017 LG:nlp:DL:pushpak 84