Upload
jonathan-stray
View
29
Download
1
Embed Size (px)
DESCRIPTION
Jonathan Stray, Columbia University, Fall 2014Syllabus at http://www.compjournalism.com/?p=113
Citation preview
Fron%ers of Computa%onal Journalism
Columbia Journalism School
Week 2: Clustering
September 12, 2014
Classifica%on and Clustering
“Classifica%on is arguably one of the most central and generic of all our conceptual exercises. It is the founda%on not only for conceptualiza%on, language, and speech, but also for mathema%cs, sta%s%cs, and data analysis in general.”
-‐ Kenneth D. Bailey, Typologies and Taxonomies: An Introduc7on to Classifica7on Techniques
Each xi is a numerical or categorical feature N = number of features or “dimension”
x1x2x3xN
!
"
#######
$
%
&&&&&&&
Vector representa%on of objects
Examples of vector representa%ons Obvious – movies watched / items purchased – Legisla%ve vo%ng history for a poli%cian – crime loca%ons
Less obvious, but standard – document vector space model – psychological survey results
Tricky research problem: disparate field types – Corporate filing document – Wikileaks SIGACT
What can we do with vectors? Predict one variable based on others – this is called “regression” – supervised machine learning
Group similar items together – This is classifica%on or clustering – We may or may not know pre-‐exis%ng classes
Distance metric
Intui%vely: how (dis)similar are two items? Formally:
d(x, y) ≥ 0 d(x, x) = 0
d(x, y) = d(y, x) d(x, z) ≤ d(x, y) + d(y, z)
Distance metric
d(x, y) ≥ 0 -‐ distance is never nega%ve
d(x, x) = 0 -‐ “reflexivity”: zero distance to self
d(x, y) = d(y, x) -‐ “symmetry”: x to y same as y to x
d(x, z) ≤ d(x, y) + d(y, z) -‐ “triangle inequality”: going direct is shorter
Distance matrix Data matrix for M objects of N dimensions
Distance matrix
X =
x1x2xM
!
"
####
$
%
&&&&
=
x1,1 x1,2 x1,Nx2,1 x2,2 x1,M xM ,N
!
"
#####
$
%
&&&&&
Dij = Dji = d(xi , x j ) =
d1,1 d1,2 dM ,Md2,1 d2,2 d1,M dM ,M
!
"
#####
$
%
&&&&&
We think of a cluster like this…
Real data isn’t so simple…
Many possible defini%ons of a cluster
Many possible defini%ons of a cluster
• “every point inside is closer to the center of this cluster than the center of any other”
• “no point outside this cluster is closer than ε to any point inside”
• “every point in this cluster is closer to all points inside than any point outside”
Different clustering algorithms
• Par%%oning – keep adjus%ng clusters un%l convergence – e.g. K-‐means
• Agglomera%ve hierarchical – start with leaves, repeatedly merge clusters – e.g. MIN and MAX approaches
• Divisive hierarchical – start with root, repeatedly split clusters – e.g. binary split
K-‐means demo
hjp://www.paused21.net/off/kmeans/bin/
Agglomera%ve – combining clusters
put each item into a leaf node while num clusters > 1 find two closest clusters merge them
single link or “min” complete link or “max”
average
UK House of Lords vo%ng clusters Algorithm instructed to separate MPs into five clusters. Output: !!1 2 2 1 3 2 2 2 1 4 !1 1 1 1 1 1 5 2 1 1 !2 2 1 2 3 2 2 4 2 1 !2 3 2 1 3 1 1 2 1 2 !1 5 2 1 4 2 2 1 2 1 !
1 4 1 1 4 1 2 2 1 5 !1 1 1 2 3 3 2 2 2 5 !2 3 1 2 1 4 1 1 4 4 !1 1 2 1 1 2 2 2 2 1 !2 1 2 1 2 2 1 3 2 1 !1 2 2 1 2 3 4 2 2 2!
! ! ! ! ! ! ! .!! ! ! ! ! ! ! .!! ! ! ! ! ! ! . !
Vo%ng clusters with par%es LDem XB Lab LDem XB Lab XB Lab Con XB ! 1 2 2 1 3 2 2 2 1 4 ! Con Con LDem Con Con Con LDem Lab Con LDem ! 1 1 1 1 1 1 5 2 1 1 !
Lab Lab Con Lab XB XB Lab XB Lab Con ! 2 2 1 2 3 2 2 4 2 1 ! Lab XB Lab Con XB XB LDem Lab XB Lab !
2 3 2 1 3 1 1 2 1 2 ! Con Con Lab Con XB Lab Lab Con XB XB ! 1 5 2 1 4 2 2 1 2 1 ! Con XB Con Con XB Con Lab XB LDem Con !
1 4 1 1 4 1 2 2 1 5 ! Con Con Con Lab Bp XB Lab Lab Lab LDem ! 1 1 1 2 3 3 2 2 2 5 !
Lab XB Con Lab Con XB Con Con XB XB ! 2 3 1 2 1 4 1 1 4 4 ! Con Con Lab Con Con XB Lab Lab Lab Con ! 1 1 2 1 1 2 2 2 2 1 !
Lab LDem Lab Con Lab Lab Con XB Lab Con ! 2 1 2 1 2 2 1 3 2 1 ! Con Lab XB Con XB XB XB Lab Lab Lab ! 1 2 2 1 2 3 4 2 2 2!
! ! ! ! ! ! ! ! .!! ! ! ! ! ! ! ! .!! ! ! ! ! ! ! ! .!
!
Clustering Algorithm
Input: data points (feature vectors). Output: a set of clusters, each of which is a set of points.
Visualiza%on
Input: data points (feature vectors). Output: a picture of the points.
Dimensionality reduc%on
Problem: vector space is high-‐dimensional. Up to thousands of dimensions. The screen is two-‐dimensional. We have to go from
x ∈ RN to much lower dimensional points
y ∈ RK<<N Probably K=2 or K=3.
This is called "projec%on"
Projec%on from 3 to 2 dimensions
Linear projec%ons
Projects in a straight line to closest point on "screen." Mathema%cally,
y = Px
where P is a K by N matrix.
Projec%on from 2 to 1 dimensions
Think of this as rota%ng to align the "screen" with coordinate axes, then simply throwing out values of higher dimensions.
Projec%on from 3 to 2 dimensions
Which direc%on should we look from? Principal components analysis: find a linear projec%on that preserves greatest variance
Take first K eigenvectors of covariance matrix corresponding to largest eigenvalues. This gives a K-‐dimensional sub-‐space for projec%on.
Some%mes overlap is unavoidable
Real data isn’t so simple…
Nonlinear projec%ons
S%ll going from high-‐dimensional x to low-‐dimensional y, but now
y = f(x) for some func%on f(), not linear. So, may not preserve rela%ve distances, angles, etc.
Fish-‐eye projec%on from 3 to 2 dimensions
Mul%dimensional scaling
Idea: try to preserve distances between points "as much as possible." If we have the distances between all points in a distance matrix,
D = |xi – xj| for all i,j We can recover the original {xi} coordinates exactly (up to rigid transforma%ons.) Like working out a country map if you know how far away each city is from every other.
Mul%dimensional scaling Torgerson's "classical MDS" algorithm (1952)
Reducing dimension with MDS
No%ce: dimension N is not encoded in the distance matrix D (it’s M by M where M is number of points) MDS formula (theore%cally) allows us to recover point coordinates {x} in any number of dimensions k.
MDS Stress minimiza%on The formula actually minimizes “stress” Think of “springs” between every pair of points. Spring between xi, xj has rest length dij Stress is zero if all high-‐dimensional distances matched exactly in low dimension.
stress(x) = xi − x j − dij( )2
i, j∑
Mul%-‐dimensional Scaling
Like "flajening" a stretchy structure into 2D, so that distances between points are preserved (as much as possible")
House of Lords MDS plot
Robustness of results
Regarding these analyses of congressional vo%ng, we could s%ll ask: • Are we modeling the right thing? (What about other legisla%ve work, e.g. in commijee?)
• Are our underlying assump%ons correct? (do representa%ves really have “ideal points” in a preference space?)
• What are we trying to argue? What will be the effect of poin%ng out this result?
Why do clusters have meaning?
What is the connec%on between mathema%cal and seman%c proper%es?
No unique “right” clustering
Different distance metrics and clustering algorithms give different results. Should we sort incident reports by loca%on, %me, actor, event type, author, cost, casual%es…? There is only context-‐specific categoriza%on. And the computer doesn’t understand your context.
Different libraries, different categories