Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
”UBC-ALM: Combining k-NN with SVD for WSD”
Michael Brutz
CU-Boulder Department of Applied Mathematics
21 March 2013
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 1 / 41
UBC-ALM: Combining k-NN and SVD for WSD
This 2007 paper by Agirre and Lacalle looks at representing senses of aword with feature vectors and combining two mathematical methods to doWSD using those vectors.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 2 / 41
Outline
1 Feature Vectors2 SVD3 k-Nearest Neighbors4 Paper’s Results
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 3 / 41
Feature Vectors
Feature Vectors
The idea is to represent a word with a vector (a list of numbers).
One of the simplest ways to do this is what is known as the bag-of-wordsapproach.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 4 / 41
Feature Vectors
Feature Vectors
The idea is to represent a word with a vector (a list of numbers).
One of the simplest ways to do this is what is known as the bag-of-wordsapproach.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 4 / 41
Feature Vectors
Feature Vectors
If you have the sentence:
”The cat ran.”
Then the bag-of-words representation for the words in this sentence is tohave a vector where the only non-zero entries are:
the = 1, cat =1, ran =1
basically, just having a value of 1 for every position in the vectorcorresponding to a word that appears in the sentence.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 5 / 41
Feature Vectors
Feature Vectors
To make this into a vector, all we do is associate each entry in the vectorwith a word.
For the ”The cat ran.” example, this vector would look like
|A1
|
=
10...10...1
Again, where the 1’s are in the slots for ”the”, ”cat”, and ”ran”, and all
the others are 0.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 6 / 41
Feature Vectors
Feature Vectors
When we collect a set of such vectors into a single object like so
| |A1 A2
| |
=
1...
0 1... 0
1...
0 1...
...1 0
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 7 / 41
Feature Vectors
Feature Vectors
We have what is called a matrix.
| |A1 A2
| |
=
1...
0 1... 0
1...
0 1...
...1 0
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 8 / 41
Feature Vectors
Feature Vectors
I’m only using two vectors for illustration, you can include any number andstill have a matrix.
| | |A1 A2 ... An
| | |
=
1...
...0 1 0... 0 1
1... ...
...0 1 0...
......
1 0 1
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 9 / 41
Feature Vectors
Feature Vectors
This leads us into SVD.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 10 / 41
SVD
SVD
SVD stands for Singular Value Decomposition.
In mathy terms, it’s where you take a matrix, A, and break it up into theproduct of three special matrices.
SVD of A
A = UΣV T
U contains the left singular vectors of the matrix A, Σ contains its singularvalues, and V contains its right singular vectors.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 11 / 41
SVD
SVD
SVD stands for Singular Value Decomposition.
In mathy terms, it’s where you take a matrix, A, and break it up into theproduct of three special matrices.
SVD of A
A = UΣV T
U contains the left singular vectors of the matrix A, Σ contains its singularvalues, and V contains its right singular vectors.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 11 / 41
SVD
SVD
This is what it looks like for a 3 by 3 matrix.
a11 a12 a13a21 a22 a23a31 a32 a33
= | | |
u1 u2 u3| | |
σ1 0 00 σ2 00 0 σ3
− v1 −− v2 −− v3 −
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 12 / 41
SVD
SVD
The terms in the middle matrix, the σ’s, are what are called the ”singularvalues”. a11 a12 a13
a21 a22 a23a31 a32 a33
= | | |
u1 u2 u3| | |
σ1 0 00 σ2 00 0 σ3
− v1 −− v2 −− v3 −
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 13 / 41
SVD
SVD
When we delete the singular values the farthest along the diagonal, we stillhave a good approximation to our original matrix A a11 a12 a13
a21 a22 a23a31 a32 a33
≈ | | |
u1 u2 u3| | |
σ1 0 00 σ2 00 0 0
− v1 −− v2 −− v3 −
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 14 / 41
SVD
SVD
This implies we can approximate the three column vectors of the A matrixby only using the first two u vectors. a11 a12 a13
a21 a22 a23a31 a32 a33
≈ | | |
u1 u2 u3| | |
σ1 0 00 σ2 00 0 0
− v1 −− v2 −− v3 −
You have to know how matrix multiplication works to see this, but the ideaof using SVD to represent more vectors with fewer is the important thing.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 15 / 41
SVD
SVD
This implies we can approximate the three column vectors of the A matrixby only using the first two u vectors. a11 a12 a13
a21 a22 a23a31 a32 a33
≈ | | |
u1 u2 u3| | |
σ1 0 00 σ2 00 0 0
− v1 −− v2 −− v3 −
You have to know how matrix multiplication works to see this, but the ideaof using SVD to represent more vectors with fewer is the important thing.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 15 / 41
SVD
SVD
So what do these approximations do?
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 16 / 41
SVD
SVD
Let’s ask Big Bird!
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 17 / 41
SVD
SVD
Grayscale images are represented with numbers that determine by howbright to make each pixel, meaning this 104 by 138 pixel image is actually
represented by a matrix.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 18 / 41
SVD
SVD
Using 1 Singular Value
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 19 / 41
SVD
SVD
Using 5 Singular Values
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 20 / 41
SVD
SVD
Using 40 Singular Values
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 21 / 41
SVD
SVD
Using All 104 Singular Values
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 22 / 41
SVD
SVD
As you can see, we can get a pretty good approximation to what all thevectors comprising the image are doing, using only a fraction of thesingular vectors.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 23 / 41
SVD
SVD
Moreover, with regards to WSD it isn’t that we want this
Using 40 Singular Values
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 24 / 41
SVD
SVD
We actually want this
Using 5 Singular Values
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 25 / 41
SVD
SVD
The reason is that while using fewer singular values blurs the image, this istantamount to adding in unseen word counts with respect to our vectordescription of words.
Using 5 Singular Values
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 26 / 41
SVD
SVD
It’s a double whammy! Not only do we have fewer vectors to deal with,but we also get guesstimates to word counts we haven’t seen in collectingdata, but suspect are relevant.
Using 5 Singular Values
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 27 / 41
SVD
SVD
So instead of using simple bag-of-words vectors, we instead use thesesingular vectors to describe senses of a word as they help with sparsity ofdata issues.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 28 / 41
k-Nearest Neighbors
k-NN
The last piece of the puzzle is to use k-NN (k-Nearest Neighbors) todisambiguate words.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 29 / 41
k-Nearest Neighbors
k-NN
Before understanding how it works, one needs to realize that vectors havea geometric interpretation as points in space.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 30 / 41
k-Nearest Neighbors
k-NN
For example, take the vector
(0.40.6
)
This can be represented graphically as
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 31 / 41
k-Nearest Neighbors
k-NN
For example, take the vector
(0.40.6
)
This can be represented graphically as
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 31 / 41
k-Nearest Neighbors
k-NN
Graphically Representing (0.4,0.6)
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 32 / 41
k-Nearest Neighbors
k-NN
For the sake of illustration, say that this ”X” represents the word ”bass”and we are trying to find the sense for it.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 33 / 41
k-Nearest Neighbors
k-NN
Now add in the points that have been tagged in training our WSDclassifier.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 34 / 41
k-Nearest Neighbors
k-NN
Let ∝ represent the ”fish” version and the eighth-notes represent themusic version for feature vectors from a hand-tagged corpus.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 35 / 41
k-Nearest Neighbors
k-NN
The way k-NN works is to assign the label to the target word that isshared the most by its k nearest hand-tagged neighbors.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 36 / 41
k-Nearest Neighbors
k-NN
For example, if k = 3 then in this example we would say that the targetword has the fish sense .
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 37 / 41
k-Nearest Neighbors
k-NN
If k = 5, then the music sense.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 38 / 41
Paper’s Results
Paper’s Results
In the paper, all of these approaches had some extra bells and whistlesthrown on them (e.g. the feature vectors weren’t just bag-of-words, andthey used a weighted k-NN), but nothing really worth getting into.
If you’ve understood what I’ve said so far, then you understand what theywere fundamentally doing in the paper.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 39 / 41
Paper’s Results
Paper’s Results
They compared their WSD classifier to the others entered in theSemEval-2007 on the Lexical Sample and All-Words tasks, and overall didpretty well:
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 40 / 41
Paper’s Results
Questions?
Thank you for your time.
M. Brutz (CU-Boulder) LING 7800: Paper Presentation Spr13 41 / 41