Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Text as Data

Justin Grimmer

Associate ProfessorDepartment of Political Science

Stanford University

October 9th, 2014

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 1 / 44

The Vector Space Model of Text

1) Task:- Numerous tasks will suppose that we can measure document similarity

or dissimiliarity

2) Objective Function- For a variety of tasks, will impose some measure or definition of

similarity, dissimilarity, or distance.

d(X i ,X j) = Dissimilarity(Distance) Bigger implies further apart

s(X i ,X j) = Similarity Bigger implies closer together

- Objective functions determine which points we compare andaggregate similarity, dissimilarity, and distance

3) Optimization- Depends on the particular task, likely arranging/grouping objects to

find similarity

4) Validation- Are the mathematical definitions of similarity actually similar for our

particular purpose?


Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling


Texts and Geometry


X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space

rich set of results from linearalgebra






Texts and Geometry


X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3







Texts and Geometry


X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3


- Provides a geometry

modify with word weighting





Texts and Geometry


X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3







Texts and Geometry


X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3







Texts and Geometry


X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3







Texts and Geometry


X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3







Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7


Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J


Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7


Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J


Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7


Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J


Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7


Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J


Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7


Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J


Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7


Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J


Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7


Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J


Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7


Vector Length

x_1

x_2

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true


Vector Length

x_1

x_2

a



- c =√a2 + b2



Vector Length

x_1

x_2

a

b



- c =√a2 + b2



Vector Length

x_1

x_2

a

bc =

(a^2 + b^2)^(1/2)



- c =√a2 + b2



Vector Length

x_1

x_2

a

bc =

(a^2 + b^2)^(1/2)



- c =√a2 + b2



Vector (Euclidean) Length

Definition

Suppose v ∈ <J . Then, we will define its length as

||v || = (v · v)1/2

= (v21 + v22 + v23 + . . .+ v2J )1/2


Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?




1) d(X i ,X j) ≥ 0


3) d(X i ,X j) = d(X j ,X i )







1) d(X i ,X j) ≥ 0


3) d(X i ,X j) = d(X j ,X i )







1) d(X i ,X j) ≥ 0


3) d(X i ,X j) = d(X j ,X i )







1) d(X i ,X j) ≥ 0


3) d(X i ,X j) = d(X j ,X i )







1) d(X i ,X j) ≥ 0


3) d(X i ,X j) = d(X j ,X i )







1) d(X i ,X j) ≥ 0


3) d(X i ,X j) = d(X j ,X i )


Explore distance functions to compare documents Do we want additionalassumptions/properties?


Measuring the Distance Between DocumentsEuclidean Distance

0 1 2 3 4

01

23

4

Word 1

Wor

d 2



0 1 2 3 4

01

23

4

Word 1

Wor

d 2



0 1 2 3 4

01

23

4

Word 1

Wor

d 2


Measuring the Distance Between Documents

Definition

The Euclidean distance between documents X i and X j as

||X i − X j || =

√√√√ J∑m=1

(xim − xjm)2

Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:

||(1, 4)− (2, 1)|| =√

(1− 2)2 + (4− 1)2

=√

10



Definition

The Euclidean distance between documents X i and X j as

||X i − X j || =

√√√√ J∑m=1

(xim − xjm)2

Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:

||(1, 4)− (2, 1)|| =√

(1− 2)2 + (4− 1)2

=√

10



Many distance metrics

Consider the Minkowski family

Definition

The Minkowski Distance between documents X i and X j for value p is

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p



Many distance metrics Consider the Minkowski family

Definition


dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p



Many distance metrics Consider the Minkowski family

Definition


dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p


Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p



Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4


dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p



Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4


dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p



Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4


dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p



Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4


dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p



Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4


dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p


What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|


What Does p Do?

Increasing p greater importance of coordinates with largest differences

If we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|


limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|


What Does p Do?


limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|


limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|


What Does p Do?


limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|


limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|


What Does p Do?


limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest difference

All other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|


What Does p Do?


limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measure

Decreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|


What Does p Do?


limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|


limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|


What Does p Do?


limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|


limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|


Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)

Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10



Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10




d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10




d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10




d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10




d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10




d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10




d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10




d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10




d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10


Are all differences equal?

Previous metrics treat all dimensions as equal

We may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?



Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweighting

Mahalanobis Distance

Definition


,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)




Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition


,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)





Definition

Suppose that we have a covariance matrix Σ

. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)





Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)





Definition


dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)





Definition


dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definite

What does Σ do?




Definition


dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)



Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 00 1

)



−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 00 0.5

)



−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(0.5 00 1

)



−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 0.30.3 0.5

)



−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 0.30.3 0.5

)



−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 0.30.3 0.5

)


Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m



Σ =

(1 00 1

)

Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m



Σ =

(1 00 1

)Then distance is Euclidean

Special Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m



Σ =

(1 00 1


Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m



Σ =

(1 00 1


Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m


Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?







- ? s(a, b) = s(b, a).








- ? s(a, b) = s(b, a).








- ? s(a, b) = s(b, a).








- ? s(a, b) = s(b, a).








- ? s(a, b) = s(b, a).








- ? s(a, b) = s(b, a).




0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Measure 1: Inner product

(2, 1)′ · (1, 4) = 6



0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Measure 1: Inner product

(2, 1)′ · (1, 4) = 6


0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ


0 1 2 3 4

01

23

4

Word 1

Wor

d 2


(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ


0 1 2 3 4

01

23

4

Word 1

Wor

d 2


(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ


0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ


(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ


Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65


Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)

(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65


Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65


Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65


Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65


Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65


Cosine Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

cos θ: removes document length from similarity measure

Projects texts to unit length representation onto sphere


Cosine Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere


Cosine Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere


Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics




X ∗i =X i

||X i ||











X ∗i =X i

||X i ||











X ∗i =X i

||X i ||











X ∗i =X i

||X i ||











X ∗i =X i

||X i ||











X ∗i =X i

||X i ||











X ∗i =X i

||X i ||











X ∗i =X i

||X i ||









Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero


Kernel Similarity

Definition


k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)







Kernel Similarity

Definition


k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)







Kernel Similarity

Definition


k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)







Kernel Similarity

Definition


k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)







Kernel Similarity

Definition


k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)







The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.


The Kernel Trick


There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimension

We might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >






The Kernel Trick



s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >






The Kernel Trick



s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >






The Kernel Trick



s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >






The Kernel Trick



s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >






The Kernel Trick



s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >






The Kernel Trick



s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >






Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)


Weighting Words



- Lots of noise

- Reweight words






Weighting Words



- Lots of noise

- Reweight words






Weighting Words



- Lots of noise

- Reweight words






Weighting Words



- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative

- Make specific assumptions about characteristics of informative words





Weighting Words



- Lots of noise

- Reweight words






Weighting Words



- Lots of noise

- Reweight words






Weighting Words



- Lots of noise

- Reweight words






Weighting Words



- Lots of noise

- Reweight words






Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)




- Used frequently




idfj = logN





- Used frequently




idfj = logN





- Used frequently


Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measures

Inverse document frequency:


idfj = logN





- Used frequently




idfj = logN





- Used frequently




idfj = logN





- Used frequently




idfj = logN

nj

idf = (idf1, idf2, . . . , idfJ)




- Used frequently




idfj = logN




Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use



Why log ?

- Maximum at nj = 1





Why log ?

- Maximum at nj = 1





Why log ?

- Maximum at nj = 1




Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)








= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)








= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)






How Does This Matter For Measuring Similarity/Dissimilarity?

Inner Product


= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)








= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)








= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)








= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)


Weighting Words: Inner Product

Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

=√

(X i − X j)′Σ(X i − X j)



Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J


d2(X i ,X j) =

√√√√ J∑m=1


=√

(X i − X j)′Σ(X i − X j)



Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J


d2(X i ,X j) =

√√√√ J∑m=1


=√

(X i − X j)′Σ(X i − X j)



Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J


d2(X i ,X j) =

√√√√ J∑m=1


=√

(X i − X j)′Σ(X i − X j)


Final Product

Applying some measure of distance, similarity (if symmetric) yields:

D =

0 d(1, 2) d(1, 3) . . . d(1,N)

d(2, 1) 0 d(2, 3) . . . d(2,N)d(3, 1) d(3, 2) 0 . . . d(3,N)

......

.... . .

...d(N, 1) d(N, 2) d(N, 3) . . . 0

Lower Triangle contains unique information N(N − 1)/2


Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native Americans

Why?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space



Spirling (2013): model Treaties between US and Native AmericansWhy?
















































































How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings




- Peace Between Us







- Peace Between Us







- Peace Between Us







- Peace Between Us







- Peace Between Us







- Peace Between Us







- Peace Between Us







- Peace Between Us







- Peace Between Us







- Peace Between Us







- Peace Between Us







- Peace Between Us





Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity


Kernel Trick

- Kernel Methods: Represent texts, measure similarity

simultaneously


- Problem solved:




Kernel Trick



- Problem solved:




Kernel Trick



- Problem solved:




Kernel Trick



- Problem solved:




Kernel Trick



- Problem solved:

- Arthur gives all his money to Justin

- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.



Kernel Trick



- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur

- Discard word order: same sentence Kernel : different sentences.



Kernel Trick



- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence

Kernel : different sentences.



Kernel Trick



- Problem solved:




Kernel Trick



- Problem solved:




Kernel Trick



- Problem solved:




Kernel Trick



- Problem solved:




Kernel Trick



- Problem solved:




Kernel Trick



- Problem solved:




Similarity and Dissimilarity of Many Things

Throughout the course we’ll measure similarity between documentsWe’ll also (implicitly) study similarity of probability distributionsDevelop a measure of distribution dissimilarity


Similarity of Probability Distributions

Definition

Suppose P is a continuous random variable with density p : < → < and Qis a continuous random variable with density q : < → q.We can define the KL-Divergence between P and Q as

KL(P||Q) =

∫ ∞−∞

p(x) logp(x)

q(x)dx


Assessing Similarity of Other Things

KL-divergence measures dissimilarity between two distributions.


Consider a function. f (x) = −x2.

Maps numbers to other numbers.

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)


Consider a function. f (x) = −x2.Maps numbers to other numbers.

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)


Consider a function. f (x) = −x2.Maps numbers to other numbers.

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)


Take some input (-2 here)

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

−2


Then obtain the value of f (−2)

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

−2


Then obtain the value of f (−2) = −4

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

−2


KL(q||p) is a functional.

A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity


KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.

KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity


KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.

For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity


KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)

KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity


KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity


If q and p are the same distribution then KL(q||p) = 0.

Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.


If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.

Then make this approximation the best possible–minimize theKL-divergence.


If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.


A simple example.

Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity


A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).

Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity


A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity


A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity


Answer:

b =√

3

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity


Answer:b =√

3

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity


Answer:b =√

3

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity


1) Documents in vector space geometry of texts

2) Many methods to measure similarity and dissimilarity