Upload
mark-chang
View
538
Download
2
Embed Size (px)
Citation preview
Computa(onal Linguis(cs Week 5
Neural Networks and Neural Language Models
By Mark Chang
Outlines
• Machine Learning • Neural Networks • Training Neural Networks • Vector Space of Seman(cs • Neural Language Models (word2vec)
Machine Learning
Machine Learning
Training Data
Machine Learning Model Output
Answer
Error FeedBack
Machine Learning Model
Tes(ng Data
AJer Training
Output
Machine Learning
Training Data
X , Y x(i), y(i)
Model
h Parameter
w
Output
h(X)
Answer
Y
Cost Func(on
E(h(X),Y) Feedback
X
Y
Logis(c Regression
Training Data X Y
-‐0.47241379 0 -‐0.35344828 0 -‐0.30148276 0 0.33448276 1 0.35344828 1 0.37241379 1 0.39137931 1 0.41034483 1 0.44931034 1 0.49827586 1 0.51724138 1
…. ….
Model
Sigmoid func(on h(x) =1
1 + e
�(w0+w1x)
w0 + w1x < 0
h(x) ⇡ 0
w0 + w1x > 0
h(x) ⇡ 1
Cost Func(on
• Cross Entropy
E(h(X), Y ) =�1
m
(mX
i
y
(i)log(h(x(i))) + (1� y
(i))log(1� h(x(i))))
y
(i) = 1
E(h(x(i)), y(i)) = �log(h(x(i)))
h(x(i)) ⇡ 0 ) E(h(x(i)), y(i)) ⇡ 1h(x(i)) ⇡ 1 ) E(h(x(i)), y(i)) ⇡ 0
y
(i) = 0
E(h(x(i)), y(i)) = �log(1� h(x(i)))
h(x(i)) ⇡ 0 ) E(h(x(i)), y(i)) ⇡ 0
h(x(i)) ⇡ 1 ) E(h(x(i)), y(i)) ⇡ 1
Cost Func(on
• Cross Entropy
E(h(X), Y ) =�1
m
(mX
i
y
(i)log(h(x(i))) + (1� y
(i))log(1� h(x(i))))
h(x(i)) ⇡ 0 and y
(i) = 0 ) E(h(X), Y ) ⇡ 0
h(x(i)) ⇡ 1 and y
(i) = 1 ) E(h(X), Y ) ⇡ 0
h(x(i)) ⇡ 0 and y
(i) = 1 ) E(h(X), Y ) ⇡ 1h(x(i)) ⇡ 1 and y
(i) = 0 ) E(h(X), Y ) ⇡ 1
w1 w0
Feedback
• Gradient Descent:
w0 w0–⌘
@E(h(X), Y )
@w0
w1 w1–⌘@E(h(X), Y )
@w1
(�@E(h(X), Y )
@w0,�@E(h(X), Y )
@w1)
Feedback
Neural Networks
Neurons & Ac(on Poten(al
h`p://humanphisiology.wikispaces.com/file/view/neuron.png/216460814/neuron.png
h`p://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Ac(on_poten(al.svg/1037px-‐Ac(on_poten(al.svg.png
Synapse
h`p://www.quia.com/files/quia/users/lmcgee/Systems/endocrine-‐nervous/synapse.gif
Ar(ficial Neurons
n W1
W2
x1
x2
b Wb
y
n
in
= w1x1 + w2x2 + w
b
n
out
=1
1 + e
�nin
nin
nout
y =1
1 + e�(w1x1+w2x2+wb)
nout
= 1
nout
= 0.5
nout
= 0(0,0)
x2
x1
Ar(ficial Neurons
n
in
= w1x1 + w2x2 + w
b
n
out
=1
1 + e
�nin
n
in
= w1x1 + w2x2 + w
b
n
out
=1
1 + e
�nin
w1x1 + w2x2 + wb = 0
w1x1 + w2x2 + wb > 0
w1x1 + w2x2 + wb < 0
1
0
Binary Classifica(on:AND Gate
x1 x2 y
0 0 0
0 1 0
1 0 0
1 1 1
(0,0)
(0,1) (1,1)
(1,0)
0
1
n 20
20
b-‐30
y x1
x2
y =1
1 + e�(20x1+20x2�30)
20x1 + 20x2 � 30 = 0
Binary Classifica(on:OR Gate
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 1
y =1
1 + e�(20x1+20x2�10)
(0,0)
(0,1) (1,1)
(1,0)
0
1
n 20
20
b-‐10
y x1
x2
20x1 + 20x2 � 10 = 0
XOR Gate ?
(0,0)
(0,1) (1,1)
(1,0)
0
0
1
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
Binary Classifica(on:XOR Gate
n
-‐20
20
b
-‐10
y
(0,0)
(0,1) (1,1)
(1,0)
0 1
(0,0)
(0,1) (1,1)
(1,0)
1
0
(0,0)
(0,1) (1,1)
(1,0) 0
0 1
n1 20
20
b-‐30
x1
x2
n2 20
20
b-‐10
x1
x2
x1 x2 n1 n2 y
0 0 0 0 0
0 1 0 1 1
1 0 0 1 1
1 1 1 1 0
Neural Networks
x
y
n11
n12
n21
n22 W12,y
W12,x
b
W11,y
W11,b W12,b
b
W11,x W21,11
W22,12
W21,12
W22,11
W21,b W22,b
z1
z2
Input Layer
Hidden Layer
Output Layer
Visual Pathway
http://www.nature.com/neuro/journal/v8/n8/images/nn0805-975-F1.jpg
Training Neural Networks
Training Neural Networks
Training Data
Neural Networks Output
Answer
Ini(aliza(on Forward Propaga(on
Error Func(on
Backward Propaga(on
Ini(aliza(on
• Randomly sampling W from –N ~ N
x
y
n11
n12
n21
n22 W12,y
W12,x
b
W11,y
W11,b W12,b
b
W11,x W21,11
W22,12
W21,12
W22,11
W21,b W22,b
z1
z2
Forward Propaga(on
Error Func(on
J = �(z1log(n21(out)) + (1� z1)log(1� n21(out)))
� (z2log(n22(out)) + (1� z2)log(1� n22(out)))
n21
n22
z1
z2
nout
⇡ 0 and z = 0 ) J ⇡ 0
nout
⇡ 1 and z = 1 ) J ⇡ 0
nout
⇡ 0 and z = 1 ) J ⇡ 1nout
⇡ 1 and z = 0 ) J ⇡ 1
w1 w0
Gradient Descent
w21,11 w21,11 � ⌘@J
@w21,11
w21,12 w21,12 � ⌘@J
@w21,12
w21,b w21,b � ⌘@J
@w21,b
w22,11 w21,11 � ⌘@J
@w22,11
w22,12 w21,12 � ⌘@J
@w22,12
w22,b w21,b � ⌘@J
@w22,b
w11,x w11,x � ⌘@J
@w11,x
w11,y w11,y � ⌘@J
@w11,y
w11,b w11,b � ⌘@J
@w11,b
w12,x w12,x � ⌘@J
@w12,x
w12,y w12,y � ⌘@J
@w12,y
w12,b w12,b � ⌘@J
@w12,b
(–@J
@w0, –
@J
@w1)
Backward Propaga(on
http://cpmarkchang.logdown.com/posts/277349-neural-network-backward-propagation
Vector Space of Seman(cs
Distribu(on Seman(cs
• The meaning of a word can be inferred from its context.
The meanings of dog and cat are similar.
The dog run. A cat run. A dog sleep. The cat sleep. A dog bark. The cat meows.
Seman(c Vectors
The dog run. A cat run. A dog sleep. The cat sleep. A dog bark. The cat meows.
the a run sleep bark meow
dog 1 2 2 2 1 0
cat 2 1 2 2 0 1
Seman(c Vectors
dog (1, 2,..., xn)
cat (2, 1,..., xn)
Car (0, 0,..., xn)
Cosine Similarity
• Cosine Similarity between A & B is: A ·B|A||B|
dog (a1, a2, ..., an)
cat (b1, b2, ..., bn)
Cosine similarity between dog & cat is:
a1b1 + a2b2 + ...+ anbnpa21 + a22 + ...+ a2n
pb21 + b22 + ...+ b2n
Opera(on of Vectors
Woman + King -‐ Man = Queen
Woman Queen
Man King
King -‐ Man
King -‐ Man
Neural Language Models (word2vec)
Dimension is too LARGE
(x1=the, x2 =a,..., xn)
dog
Dimension of seman(c vectors is equal to the size of vocabulary.
x1 x2 x3 x4 xn ...
Compressed Vectors
dog
One-‐Hot Encoding
Neural Network
Compressed Vector
1.2
0.7
0.5
1
0
0
0
One-‐Hot Encoding
dog cat run fly 1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
Ini(alize Weights dog
cat run
fly
dog
cat run
fly
W =
2
664
w11 w12 w13
w21 w22 w23
w31 w32 w33
w31 w32 w43
3
775V =
2
664
v11 v12 v13v21 v22 v23v31 v32 v33v31 v32 v43
3
775
Compressed Vectors
1
0
0
0
dog
High dimension Low
dimension
v11
v12
v13
v11
v12
v13
v11
v12
v13
1
0
0
0
Compressed Vectors
dog cat run fly
v11
v12
v13
v21
v22
v23
v31
v32
v33
v41
v42
v43
dog
cat run
fly
Context Word
dog 1
0
0
0
v11
v12
v13
v11
v12
v13 run
0
0
1
0
w31
w32
w33
dog
cat run
fly dog cat run fly
1
1 + e�V1W3⇡ 1
V1 ·W3 = v11w31 + v12w32 + v13w33
Context Word
cat 1
0
0
0
v11
v12
v13
v21 v22
v23 run
0
0
1
0
w31
w32
w33
dog cat run fly
V2 ·W3 = v21w31 + v22w32 + v23w33
dog cat run fly
1
1 + e�V2W3⇡ 1
Non-‐context Word
dog 1
0
0
0
v11
v12
v13
v11
v12
v13
fly
0
0
1
0
w41
w42
w43
V1 ·W4 = v11w41 + v12w42 + v13w43
1
1 + e�V1W4⇡ 0
dog cat run fly
dog cat run
fly
Non-‐context Word
cat 1
0
0
0
v11
v12
v13
v21 v22
v23
w41
w42
w43
0
0
1
0
V2 ·W4 = v21w41 + v22w42 + v23w43
dog cat run
fly
dog cat run
fly
fly
1
1 + e�V2W4⇡ 0
Result
dog cat run fly
v11
v12
v13
v21
v22
v23
v31
v32
v33
v41
v42
v43
dog cat run
fly
Further Reading
• Logis(c Regression 3D – h`p://cpmarkchang.logdown.com/posts/189069-‐logis(-‐regression-‐model
• OverFimng and Regulariza(on – h`p://cpmarkchang.logdown.com/posts/193261-‐machine-‐learning-‐overfimng-‐and-‐regulariza(on
• Model Selec(on – h`p://cpmarkchang.logdown.com/posts/193914-‐machine-‐learning-‐model-‐selec(on
• Neural Network Back Propaga(on – h`p://cpmarkchang.logdown.com/posts/277349-‐neural-‐network-‐backward-‐propaga(on
Further Reading • Neural Probabilis(c Language Model:
– h`p://cpmarkchang.logdown.com/posts/255785-‐neural-‐network-‐neural-‐probabilis(c-‐language-‐model
– h`p://cpmarkchang.logdown.com/posts/276263-‐-‐hierarchical-‐probabilis(c-‐neural-‐networks-‐neural-‐network-‐language-‐model
• Word2vec – h`p://arxiv.org/pdf/1301.3781.pdf – h`p://papers.nips.cc/paper/5021-‐distributed-‐representa(ons-‐of-‐words-‐and-‐phrases-‐and-‐their-‐composi(onality.pdf
– h`p://www-‐personal.umich.edu/~ronxin/pdf/w2vexp.pdf