NLP Programming Tutorial 10 - Neural Networks · NLP Programming Tutorial 10 – Neural Networks NLP Programming Tutorial 10 - Neural Networks ... This is binary classification (of

1

NLP Programming Tutorial 10 – Neural Networks

NLP Programming Tutorial 10 -Neural Networks

Graham NeubigNara Institute of Science and Technology (NAIST)

2


Prediction Problems

Given x, predict y

3


Example we will use:

● Given an introductory sentence from Wikipedia

● Predict whether the article is about a person

● This is binary classification (of course!)

Given

Gonso was a Sanron sect priest (754-827)in the late Nara and early Heian periods.

Predict

Yes!

Shichikuzan Chigogataki Fudomyoo isa historical site located at Magura, MaizuruCity, Kyoto Prefecture.

No!

4


Linear Classifiers

y = sign (w⋅ϕ(x))

= sign (∑i=1

Iw i⋅ϕi( x))

● x: the input

● φ(x): vector of feature functions {φ1(x), φ

2(x), …, φ

I(x)}

● w: the weight vector {w1, w

2, …, w

I}

● y: the prediction, +1 if “yes”, -1 if “no”● (sign(v) is +1 if v >= 0, -1 otherwise)

5


Example Feature Functions:Unigram Features

● Equal to “number of times a particular word appears”

x = A site , located in Maizuru , Kyotoφ

unigram “A”(x) = 1 φ

unigram “site”(x) = 1 φ

unigram “,”(x) = 2

φunigram “located”

(x) = 1 φunigram “in”

(x) = 1

φunigram “Maizuru”

(x) = 1 φunigram “Kyoto”

(x) = 1

φunigram “the”

(x) = 0 φunigram “temple”

(x) = 0

…The restare all 0

● For convenience, we use feature names (φunigram “A”

)

instead of feature indexes (φ1)

6


Calculating the Weighted Sumx = A site , located in Maizuru , Kyoto

φunigram “A”

(x) = 1φ

unigram “site”(x) = 1

φunigram “,”

(x) = 2

φunigram “located”

(x) = 1

φunigram “in”

(x) = 1

φunigram “Maizuru”

(x) = 1

φunigram “Kyoto”

(x) = 1

wunigram “a”

= 0

wunigram “site”

= -3

wunigram “located”

= 0

wunigram “Maizuru”

= 0

wunigram “,”

= 0w

unigram “in” = 0

wunigram “Kyoto”

= 0

φunigram “priest”

(x) = 0 wunigram “priest”

= 2

φunigram “black”

(x) = 0 wunigram “black”

= 0

* =

0

-3

…

0

000

0

0

0

…

+

++

+

+++

++

=-3 → No!

7


The Perceptron

● Think of it as a “machine” to calculate a weighted sum

sign(∑i=1

Iw i⋅ϕi( x))

φ“A”

= 1φ

“site”= 1

φ“,”

= 2

φ“located”

= 1

φ“in”

= 1

φ“Maizuru”

= 1

φ“Kyoto”

= 1

φ“priest”

= 0

φ“black”

= 0

0

-30000020

-1

8


Problem: Linear Constraint

● Perceptron cannot achieve high accuracy on non-linear functions

X

O

O

X

9


Neural Networks

● Neural networks connect multiple perceptrons together

φ“A”

= 1φ

“site”= 1

φ“,”

= 2

φ“located”

= 1

φ“in”

= 1

φ“Maizuru”

= 1

φ“Kyoto”

= 1

φ“priest”

= 0

φ“black”

= 0

-1

● Motivation: Can express non-linear functions

10


Example:

● Build two classifiers:

X

O

O

X

φ(x2) = {1, 1}φ(x

1) = {-1, 1}

φ(x4) = {1, -1}φ(x

3) = {-1, -1}

w1

w2

φ1

φ2

1

φ1

φ2

1

1

1

-1

-1

-1

-1

φ1

φ2 y

1

y2

11


Example:

● These classifiers map the points to a new space

X

O

O

X

φ(x2) = {1, 1}φ(x

1) = {-1, 1}

φ(x4) = {1, -1}φ(x

3) = {-1, -1}

11-1

-1-1-1

φ1

φ2

y2

y1

y1

y2

y(x1) = {-1, -1}

X Oy(x

2) = {1, -1}

O

y(x3) = {-1, 1}

y(x4) = {-1, -1}

12


Example: ● In the new space, examples are classifiable!

X

O

O

X

φ(x2) = {1, 1}φ(x

1) = {-1, 1}

φ(x4) = {1, -1}φ(x

3) = {-1, -1}

11-1

-1-1-1

φ1

φ2

y2

y1

y1

y2

y(x1) = {-1, -1}

X O y(x2) = {1, -1}

Oy(x3) = {-1, 1}

y(x4) = {-1, -1}

111

y3

13


Example:

● Final neural network:

w1

w2

φ1

φ2

1

φ1

φ2

1

1

1

-1

-1

-1

-1

1 1

1

1

w3 y4

14


Representing a Neural Network

● Assume network is fully connected and in layers

● Each perceptron:● A layer ID● A weight vector

network = [ (1, w

0),

(1, w1),

(1, w2),

(2, w3)

]

φ“A”

= 1φ

“site”= 1

φ“,”

= 2

φ“located”

= 1

φ“in”

= 1

φ“Maizuru”

= 1

φ“Kyoto”

= 1

φ“priest”

= 0

φ“black”

= 0

0

1

2

3

Layer 1 Layer 2

15


Neural Network Prediction Process

● Predict one perceptron at a time using previous layer

φ“A”

= 1φ

“site”= 1

φ“,”

= 2

φ“located”

= 1

φ“in”

= 1

φ“Maizuru”

= 1

φ“Kyoto”

= 1

φ“priest”

= 0

φ“black”

= 0

0

1

2

3

16




φ“A”

= 1φ

“site”= 1

φ“,”

= 2

φ“located”

= 1

φ“in”

= 1

φ“Maizuru”

= 1

φ“Kyoto”

= 1

φ“priest”

= 0

φ“black”

= 0

0

1

2

3

-1

17




φ“A”

= 1φ

“site”= 1

φ“,”

= 2

φ“located”

= 1

φ“in”

= 1

φ“Maizuru”

= 1

φ“Kyoto”

= 1

φ“priest”

= 0

φ“black”

= 0

0

1

2

3

-1

1

18




φ“A”

= 1φ

“site”= 1

φ“,”

= 2

φ“located”

= 1

φ“in”

= 1

φ“Maizuru”

= 1

φ“Kyoto”

= 1

φ“priest”

= 0

φ“black”

= 0

0

1

2

3

-1

1

1

19




φ“A”

= 1φ

“site”= 1

φ“,”

= 2

φ“located”

= 1

φ“in”

= 1

φ“Maizuru”

= 1

φ“Kyoto”

= 1

φ“priest”

= 0

φ“black”

= 0

0

1

2

3

-1

1

1

-1

20


Review:Pseudo-code for Perceptron Predicton

predict_one(w, phi) score = 0

for each name, value in phi # score = w*φ(x)if name exists in w

score += value * w[name]if score >= 0

return 1else

return -1

21


Pseudo-Code for NN Prediction

predict_nn(network, phi) y = [ phi, {}, {} … ] # activations for each layer for each node i:

layer, weight = network[i] # predict the answer with the previous perceptron

answer = predict_one(weight, y[layer-1])# save this answer as a feature for the next layery[layer][i] = answer

return the answer for the last perceptron

22


-4 -3 -2 -1 0 1 2 3 4

-2

0

2

sign(φ*x)y

Neural Network Activation Functions

● Previously described NN uses step function

● Step function is not differentiable → use tanh

y=sign(w⋅ϕ( x))

-4 -3 -2 -1 0 1 2 3 4

-2-1012

tanh(φ*x)

y

y=tanh (w⋅ϕ(x ))

Python:from math import tanhtanh(x)

23


Learning a Perceptron w/ tanh

● First, calculate the error:

● Update each weight with:

● Where λ is the learning rate

● (For step function perceptron δ = -2 or +2, λ = 1/2)

δ = y' - y

correct tag system output

w ←w+ λ⋅δ⋅ϕ(x )

24


Problem: Don't Know Correct Answer!

● For NNs, only know correct tag for last layer

φ“A”

= 1φ

“site”= 1

φ“,”

= 2

φ“located”

= 1

φ“in”

= 1

φ“Maizuru”

= 1

φ“Kyoto”

= 1

φ“priest”

= 0

φ“black”

= 0

0

1

2

3y' = 1y = -1

y' = ?y = 1

y' = ?y = 1

y' = ?y = 1

25


-4 -3 -2 -1 0 1 2 3 4

-2

0

2

y

Answer: Back-Propogation● Pass error backwards along the network

● Also consider gradient of tanh

● Combine:

δ = -0.9

δ = 0.2

δ = 0.4

j

w=0.1

w=1

w=-0.3

∑iδ iw j ,i

d tanh (ϕ(x )∗w)=1−(ϕ(x)∗w)2=1− y j

2

δ j=(1− y j2)∑i

δiw j ,i

26


Back Propagation Codeupdate_nn(network, phi, y')

create array δ calculate y using predict_nn for each node j in reverse order:

if j is the last nodeδ

j = y' – y

j

elseδ

j =

for each node j:layer, w = network[j]for each name, val in y[layer-1]:

w[name] += λ * δj * val

(1− y j2)∑i

δ i w j ,i

27


Training process

● For previous perceptron, we initialized weights to zero

● In NN: randomly initialize weights(so not all perceptrons are identical)

create networkrandomize network weightsfor I iterations

for each labeled pair x, y in the dataphi = create_features(x)update_nn(w, phi, y)

28


Exercise

29


Exercise (1)● Write two programs

● train-nn: Creates a neural network model● test-nn: Reads a neural network model

● Test train-nn● Input: test/03-train-input.txt● Use one iteration, one hidden layer, two hidden nodes● Calculate updates by hand and make sure they are

correct

30


Exercise (2)

● Train a model on data/titles-en-train.labeled

● Predict the labels of data/titles-en-test.word

● Grade your answers ● script/grade-prediction.py data-en/titles-en-test.labeled your_answer

● Compare:● With a single perceptron/SVM classifiers● With different neural network structures

31


Thank You!

Documents

NLP Programming Tutorial 10 - Neural Networks · NLP Programming Tutorial 10 – Neural Networks NLP Programming Tutorial 10 - Neural Networks ... This is binary classification (of