221
Text as Data Justin Grimmer Associate Professor Department of Political Science Stanford University October 9th, 2014 Justin Grimmer (Stanford University) Text as Data October 9th, 2014 1 / 44

Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Text as Data

Justin Grimmer

Associate ProfessorDepartment of Political Science

Stanford University

October 9th, 2014

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 1 / 44

Page 2: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Vector Space Model of Text

1) Task:- Numerous tasks will suppose that we can measure document similarity

or dissimiliarity

2) Objective Function- For a variety of tasks, will impose some measure or definition of

similarity, dissimilarity, or distance.

d(X i ,X j) = Dissimilarity(Distance) Bigger implies further apart

s(X i ,X j) = Similarity Bigger implies closer together

- Objective functions determine which points we compare andaggregate similarity, dissimilarity, and distance

3) Optimization- Depends on the particular task, likely arranging/grouping objects to

find similarity

4) Validation- Are the mathematical definitions of similarity actually similar for our

particular purpose?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 2 / 44

Page 3: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 4: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space

rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 5: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 6: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry

modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 7: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 8: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 9: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 10: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 11: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 12: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 13: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 14: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 15: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 16: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 17: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 18: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 19: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 20: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

a

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 21: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

a

b

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 22: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

a

bc =

(a^2 + b^2)^(1/2)

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 23: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

a

bc =

(a^2 + b^2)^(1/2)

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 24: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector (Euclidean) Length

Definition

Suppose v ∈ <J . Then, we will define its length as

||v || = (v · v)1/2

= (v21 + v22 + v23 + . . .+ v2J )1/2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 6 / 44

Page 25: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 26: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 27: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 28: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 29: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 30: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 31: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 32: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between DocumentsEuclidean Distance

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44

Page 33: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between DocumentsEuclidean Distance

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44

Page 34: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between DocumentsEuclidean Distance

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44

Page 35: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Definition

The Euclidean distance between documents X i and X j as

||X i − X j || =

√√√√ J∑m=1

(xim − xjm)2

Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:

||(1, 4)− (2, 1)|| =√

(1− 2)2 + (4− 1)2

=√

10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 9 / 44

Page 36: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Definition

The Euclidean distance between documents X i and X j as

||X i − X j || =

√√√√ J∑m=1

(xim − xjm)2

Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:

||(1, 4)− (2, 1)|| =√

(1− 2)2 + (4− 1)2

=√

10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 9 / 44

Page 37: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Many distance metrics

Consider the Minkowski family

Definition

The Minkowski Distance between documents X i and X j for value p is

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44

Page 38: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Many distance metrics Consider the Minkowski family

Definition

The Minkowski Distance between documents X i and X j for value p is

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44

Page 39: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Many distance metrics Consider the Minkowski family

Definition

The Minkowski Distance between documents X i and X j for value p is

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44

Page 40: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 41: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 42: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 43: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 44: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 45: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 46: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 47: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differences

If we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 48: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 49: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 50: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest difference

All other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 51: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measure

Decreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 52: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 53: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 54: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)

Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 55: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 56: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 57: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 58: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 59: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 60: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 61: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 62: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 63: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 64: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equal

We may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 65: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweighting

Mahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 66: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 67: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ

. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 68: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 69: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 70: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definite

What does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 71: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 72: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 00 1

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 73: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 00 0.5

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 74: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(0.5 00 1

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 75: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 0.30.3 0.5

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 76: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 0.30.3 0.5

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 77: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 0.30.3 0.5

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 78: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 79: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)

Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 80: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)Then distance is Euclidean

Special Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 81: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 82: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 83: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 84: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 85: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 86: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 87: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 88: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 89: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 90: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Measure 1: Inner product

(2, 1)′ · (1, 4) = 6

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 18 / 44

Page 91: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Measure 1: Inner product

(2, 1)′ · (1, 4) = 6

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 18 / 44

Page 92: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44

Page 93: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44

Page 94: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44

Page 95: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44

Page 96: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 97: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)

(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 98: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 99: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 100: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 101: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 102: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

cos θ: removes document length from similarity measure

Projects texts to unit length representation onto sphere

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 103: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 104: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 105: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 106: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 107: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 108: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 109: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 110: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 111: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 112: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 113: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 114: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 115: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 116: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 117: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 118: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 119: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 120: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 121: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimension

We might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 122: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 123: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 124: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 125: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 126: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 127: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 128: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 129: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 130: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 131: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 132: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative

- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 133: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 134: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 135: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 136: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 137: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 138: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 139: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 140: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measures

Inverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 141: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 142: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 143: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

nj

idf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 144: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 145: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44

Page 146: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44

Page 147: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44

Page 148: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44

Page 149: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 150: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 151: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 152: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?

Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 153: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 154: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 155: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 156: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: Inner Product

Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

=√

(X i − X j)′Σ(X i − X j)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44

Page 157: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: Inner Product

Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

=√

(X i − X j)′Σ(X i − X j)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44

Page 158: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: Inner Product

Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

=√

(X i − X j)′Σ(X i − X j)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44

Page 159: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: Inner Product

Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

=√

(X i − X j)′Σ(X i − X j)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44

Page 160: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Final Product

Applying some measure of distance, similarity (if symmetric) yields:

D =

0 d(1, 2) d(1, 3) . . . d(1,N)

d(2, 1) 0 d(2, 3) . . . d(2,N)d(3, 1) d(3, 2) 0 . . . d(3,N)

......

.... . .

...d(N, 1) d(N, 2) d(N, 3) . . . 0

Lower Triangle contains unique information N(N − 1)/2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 29 / 44

Page 161: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native Americans

Why?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 162: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 163: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 164: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 165: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 166: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 167: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 168: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 169: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 170: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 171: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 172: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 173: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 174: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 175: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 176: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 177: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 178: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 179: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 180: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 181: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 182: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 183: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 184: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity

simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 185: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 186: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 187: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 188: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin

- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 189: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur

- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 190: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence

Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 191: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 192: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 193: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 194: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 195: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 196: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 197: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Similarity and Dissimilarity of Many Things

Throughout the course we’ll measure similarity between documentsWe’ll also (implicitly) study similarity of probability distributionsDevelop a measure of distribution dissimilarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 33 / 44

Page 198: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Similarity of Probability Distributions

Definition

Suppose P is a continuous random variable with density p : < → < and Qis a continuous random variable with density q : < → q.We can define the KL-Divergence between P and Q as

KL(P||Q) =

∫ ∞−∞

p(x) logp(x)

q(x)dx

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 34 / 44

Page 199: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Assessing Similarity of Other Things

KL-divergence measures dissimilarity between two distributions.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 35 / 44

Page 200: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Consider a function. f (x) = −x2.

Maps numbers to other numbers.

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44

Page 201: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Consider a function. f (x) = −x2.Maps numbers to other numbers.

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44

Page 202: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Consider a function. f (x) = −x2.Maps numbers to other numbers.

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44

Page 203: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Take some input (-2 here)

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

−2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 37 / 44

Page 204: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Then obtain the value of f (−2)

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

−2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 38 / 44

Page 205: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Then obtain the value of f (−2) = −4

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

−2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 39 / 44

Page 206: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional.

A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 207: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.

KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 208: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.

For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 209: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)

KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 210: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 211: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

If q and p are the same distribution then KL(q||p) = 0.

Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44

Page 212: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.

Then make this approximation the best possible–minimize theKL-divergence.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44

Page 213: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44

Page 214: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

A simple example.

Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44

Page 215: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).

Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44

Page 216: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44

Page 217: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44

Page 218: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Answer:

b =√

3

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44

Page 219: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Answer:b =√

3

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44

Page 220: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Answer:b =√

3

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44

Page 221: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

1) Documents in vector space geometry of texts

2) Many methods to measure similarity and dissimilarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 44 / 44