Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Text as Data
Justin Grimmer
Associate ProfessorDepartment of Political Science
Stanford University
October 9th, 2014
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 1 / 44
The Vector Space Model of Text
1) Task:- Numerous tasks will suppose that we can measure document similarity
or dissimiliarity
2) Objective Function- For a variety of tasks, will impose some measure or definition of
similarity, dissimilarity, or distance.
d(X i ,X j) = Dissimilarity(Distance) Bigger implies further apart
s(X i ,X j) = Similarity Bigger implies closer together
- Objective functions determine which points we compare andaggregate similarity, dissimilarity, and distance
3) Optimization- Depends on the particular task, likely arranging/grouping objects to
find similarity
4) Validation- Are the mathematical definitions of similarity actually similar for our
particular purpose?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 2 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space
rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry
modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Vector Length
x_1
x_2
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector Length
x_1
x_2
a
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector Length
x_1
x_2
a
b
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector Length
x_1
x_2
a
bc =
(a^2 + b^2)^(1/2)
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector Length
x_1
x_2
a
bc =
(a^2 + b^2)^(1/2)
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector (Euclidean) Length
Definition
Suppose v ∈ <J . Then, we will define its length as
||v || = (v · v)1/2
= (v21 + v22 + v23 + . . .+ v2J )1/2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 6 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measuring the Distance Between DocumentsEuclidean Distance
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44
Measuring the Distance Between DocumentsEuclidean Distance
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44
Measuring the Distance Between DocumentsEuclidean Distance
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44
Measuring the Distance Between Documents
Definition
The Euclidean distance between documents X i and X j as
||X i − X j || =
√√√√ J∑m=1
(xim − xjm)2
Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:
||(1, 4)− (2, 1)|| =√
(1− 2)2 + (4− 1)2
=√
10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 9 / 44
Measuring the Distance Between Documents
Definition
The Euclidean distance between documents X i and X j as
||X i − X j || =
√√√√ J∑m=1
(xim − xjm)2
Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:
||(1, 4)− (2, 1)|| =√
(1− 2)2 + (4− 1)2
=√
10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 9 / 44
Measuring the Distance Between Documents
Many distance metrics
Consider the Minkowski family
Definition
The Minkowski Distance between documents X i and X j for value p is
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44
Measuring the Distance Between Documents
Many distance metrics Consider the Minkowski family
Definition
The Minkowski Distance between documents X i and X j for value p is
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44
Measuring the Distance Between Documents
Many distance metrics Consider the Minkowski family
Definition
The Minkowski Distance between documents X i and X j for value p is
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differences
If we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest difference
All other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measure
Decreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)
Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Are all differences equal?
Previous metrics treat all dimensions as equal
We may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as
,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweighting
Mahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as
,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as
,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ
. Then we can define theMahalanobis Distance between documents X i and X j as
,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definite
What does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 00 1
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 00 0.5
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(0.5 00 1
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 0.30.3 0.5
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 0.30.3 0.5
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 0.30.3 0.5
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)Then distance is EuclideanSpecial Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)
Then distance is EuclideanSpecial Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)Then distance is Euclidean
Special Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)Then distance is EuclideanSpecial Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)Then distance is EuclideanSpecial Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Measure 1: Inner product
(2, 1)′ · (1, 4) = 6
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 18 / 44
Measuring Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Measure 1: Inner product
(2, 1)′ · (1, 4) = 6
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 18 / 44
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Problem(?): length dependent
(4, 2)′(1, 4) = 12
a · b = ||a|| × ||b|| × cos θ
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Problem(?): length dependent
(4, 2)′(1, 4) = 12
a · b = ||a|| × ||b|| × cos θ
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Problem(?): length dependent
(4, 2)′(1, 4) = 12
a · b = ||a|| × ||b|| × cos θ
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
θ
Problem(?): length dependent
(4, 2)′(1, 4) = 12
a · b = ||a|| × ||b|| × cos θ
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)
(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
θ
cos θ: removes document length from similarity measure
Projects texts to unit length representation onto sphere
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
θ
cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
θ
cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimension
We might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative
- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measures
Inverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
nj
idf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
Why log ?
- Maximum at nj = 1
- Decreases at rate 1nj⇒ diminishing “penalty” for more common use
- Other functional forms are fine, embed assumptions aboutpenalization of common use
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44
Weighting Words: TF-IDF Weighting
Why log ?
- Maximum at nj = 1
- Decreases at rate 1nj⇒ diminishing “penalty” for more common use
- Other functional forms are fine, embed assumptions aboutpenalization of common use
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44
Weighting Words: TF-IDF Weighting
Why log ?
- Maximum at nj = 1
- Decreases at rate 1nj⇒ diminishing “penalty” for more common use
- Other functional forms are fine, embed assumptions aboutpenalization of common use
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44
Weighting Words: TF-IDF Weighting
Why log ?
- Maximum at nj = 1
- Decreases at rate 1nj⇒ diminishing “penalty” for more common use
- Other functional forms are fine, embed assumptions aboutpenalization of common use
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?
Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: Inner Product
Define:
Σ =
idf21 0 0 . . . 0
0 idf22 0 . . . 0...
......
. . ....
0 0 0 . . . idf2J
If we use tf-idf for our documents, then
d2(X i ,X j) =
√√√√ J∑m=1
(xim,idf − xjm,idf)2
=√
(X i − X j)′Σ(X i − X j)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44
Weighting Words: Inner Product
Define:
Σ =
idf21 0 0 . . . 0
0 idf22 0 . . . 0...
......
. . ....
0 0 0 . . . idf2J
If we use tf-idf for our documents, then
d2(X i ,X j) =
√√√√ J∑m=1
(xim,idf − xjm,idf)2
=√
(X i − X j)′Σ(X i − X j)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44
Weighting Words: Inner Product
Define:
Σ =
idf21 0 0 . . . 0
0 idf22 0 . . . 0...
......
. . ....
0 0 0 . . . idf2J
If we use tf-idf for our documents, then
d2(X i ,X j) =
√√√√ J∑m=1
(xim,idf − xjm,idf)2
=√
(X i − X j)′Σ(X i − X j)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44
Weighting Words: Inner Product
Define:
Σ =
idf21 0 0 . . . 0
0 idf22 0 . . . 0...
......
. . ....
0 0 0 . . . idf2J
If we use tf-idf for our documents, then
d2(X i ,X j) =
√√√√ J∑m=1
(xim,idf − xjm,idf)2
=√
(X i − X j)′Σ(X i − X j)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44
Final Product
Applying some measure of distance, similarity (if symmetric) yields:
D =
0 d(1, 2) d(1, 3) . . . d(1,N)
d(2, 1) 0 d(2, 3) . . . d(2,N)d(3, 1) d(3, 2) 0 . . . d(3,N)
......
.... . .
...d(N, 1) d(N, 2) d(N, 3) . . . 0
Lower Triangle contains unique information N(N − 1)/2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 29 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native Americans
Why?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity
simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin
- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur
- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence
Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Similarity and Dissimilarity of Many Things
Throughout the course we’ll measure similarity between documentsWe’ll also (implicitly) study similarity of probability distributionsDevelop a measure of distribution dissimilarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 33 / 44
Similarity of Probability Distributions
Definition
Suppose P is a continuous random variable with density p : < → < and Qis a continuous random variable with density q : < → q.We can define the KL-Divergence between P and Q as
KL(P||Q) =
∫ ∞−∞
p(x) logp(x)
q(x)dx
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 34 / 44
Assessing Similarity of Other Things
KL-divergence measures dissimilarity between two distributions.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 35 / 44
Consider a function. f (x) = −x2.
Maps numbers to other numbers.
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44
Consider a function. f (x) = −x2.Maps numbers to other numbers.
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44
Consider a function. f (x) = −x2.Maps numbers to other numbers.
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44
Take some input (-2 here)
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
−2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 37 / 44
Then obtain the value of f (−2)
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
−2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 38 / 44
Then obtain the value of f (−2) = −4
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
−2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 39 / 44
KL(q||p) is a functional.
A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.
KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.
For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)
KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
If q and p are the same distribution then KL(q||p) = 0.
Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44
If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.
Then make this approximation the best possible–minimize theKL-divergence.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44
If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44
A simple example.
Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44
A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).
Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44
A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44
A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44
Answer:
b =√
3
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44
Answer:b =√
3
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44
Answer:b =√
3
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44
1) Documents in vector space geometry of texts
2) Many methods to measure similarity and dissimilarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 44 / 44