105
Topological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics, Stanford, CA 94305. Joint work with Persi Diaconis, Mehrdad Shahshahani and Sharad Goel. Thanks to Harold Widom, Gunnar Carlssen, John Chakarian, Leonid Pekelis for discussions, and NSF grant DMS 0241246 for funding.

Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Topological Data Analysis for detectingHidden Patterns in Data

Susan Holmes

Statistics, Stanford, CA 94305.

Joint work with Persi Diaconis, Mehrdad Shahshahani and

Sharad Goel.

Thanks to Harold Widom, Gunnar Carlssen, John Chakarian,

Leonid Pekelis for discussions, and NSF grant DMS 0241246

for funding.

Page 2: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

A la recherche du temps perdu: Gradients etOrdination

Many popular multivariate methods based on spectral

decompositions of distance methods or transformed

distances, Multidimensional Scaling, kernel PCA,

correspondence analysis, Metric MDS aim to detect hidden

underlying structure of points in high dimensions.

Page 3: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

A first type of dependence is a hidden gradient, placing

points close to a curve in high dimensional space. Ecologists,

archeologists have long known to look for horseshoes or

arches which are symptomatic of such structure.

Page 4: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

We take a political science example with data from 2005

U.S. House of Representatives roll call votes. MDS and

kernel PCA, in this case, output two ‘horseshoes’ that are

characteristic of dimensionality reduction techniques.

Page 5: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

PCA: Dimension Reduction

PCA seeks to replace the original (centered) matrix X by a

matrix of lower rank, this can be solved by doing the singular

value decomposition of X:

X = USV ′, with U ′DU = In and V ′QV = Ip and Sdiagonal

XX ′ = US2U ′, with U ′DU = In and S2 = ΛPCA is a linear nonparametric multivariate method for

dimension reduction.

Page 6: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Ordination : Finding Time (Le temps perdu...)

Early studies in archeology have aimed for seriation in time

Guttman, Kendall and Ter Braak have pointed out and

studied the arch or horseshoe effect.

Here is a linguistic example where I dated the works of Plato

according to their sentence endings using a particular

distance between the books called the Chisquare distance:

As an example we take data analysed by Cox and

Brandwood [?] who wanted to seriate Plato’s works using

the proportion of sentence endings in a given book, with a

given stress pattern. We propose the use of correspondence

analysis on the table of frequencies of sentence endings, for a

detailed analysis see Charnomordic and Holmes[?].

Page 7: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

The first 10 profiles (as percentages) look as follows:

Rep Laws Crit Phil Pol Soph TimUUUUU 1.1 2.4 3.3 2.5 1.7 2.8 2.4-UUUU 1.6 3.8 2.0 2.8 2.5 3.6 3.9U-UUU 1.7 1.9 2.0 2.1 3.1 3.4 6.0UU-UU 1.9 2.6 1.3 2.6 2.6 2.6 1.8UUU-U 2.1 3.0 6.7 4.0 3.3 2.4 3.4UUUU- 2.0 3.8 4.0 4.8 2.9 2.5 3.5--UUU 2.1 2.7 3.3 4.3 3.3 3.3 3.4-U-UU 2.2 1.8 2.0 1.5 2.3 4.0 3.4-UU-U 2.8 0.6 1.3 0.7 0.4 2.1 1.7-UUU- 4.6 8.8 6.0 6.5 4.0 2.3 3.3.......etc (there are 32 rows in all)

The eigenvalue decomposition (called the scree plot) of the

chisquare distance matrix (see [?]) shows that two axes out

of a possible 6 (the matrix is of rank 6) will provide a

Page 8: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

summary of 85% of the departure from independence, this

suggests that a planar representation will provide a good

visual summary of the data.

Eigenvalue inertia % cumulative %1 0.09170 68.96 68.962 0.02120 15.94 84.903 0.00911 6.86 91.764 0.00603 4.53 96.295 0.00276 2.07 98.366 0.00217 1.64 100.00

Page 9: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Tim

Laws Rep

Soph

Phil

PolCrit

Axis #1: 69%

Axis

#2:

16%

-0.2 0.0 0.2 0.4

0.0

-0.2

-0.3

-0.1

0.1

Correspondence Analysis of Plato’s WorksWe can see from the plot that there is a seriation that as in

most cases follows a parabola or arch [?] from Laws on one

extreme being the latest work and Republica being the

earliest.

Page 10: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Examples from Ecology

The Boomlake plant data:

Biplot representing both species and locations

Blue circles with letters are species scores

Sampling locations are green circles with numbers.

Page 11: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Sample 1 is actually in the lake, and sample 12 is far away.

Species are located closely to the samples they occur in. If

you looked carefully into the data matrix, you would find

that species R and Q are strictly aquatic, while species F is a

dryland plant (cribs). There is an arch effect.

Reference to site on ordination:

http://ordination.okstate.edu/CA.htm

Page 12: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Psychological DataColor confusion data (Ekman, 1954):

w434 w445 w465 w472 w490 w504 w537 w555 w584 w600 w610 w628 w651 w6741 0.00 0.86 0.42 0.42 0.18 0.06 0.07 0.04 0.02 0.07 0.09 0.12 0.13 0.162 0.86 0.00 0.50 0.44 0.22 0.09 0.07 0.07 0.02 0.04 0.07 0.11 0.13 0.143 0.42 0.50 0.00 0.81 0.47 0.17 0.10 0.08 0.02 0.01 0.02 0.01 0.05 0.034 0.42 0.44 0.81 0.00 0.54 0.25 0.10 0.09 0.02 0.01 0.00 0.01 0.02 0.045 0.18 0.22 0.47 0.54 0.00 0.61 0.31 0.26 0.07 0.02 0.02 0.01 0.02 0.006 0.06 0.09 0.17 0.25 0.61 0.00 0.62 0.45 0.14 0.08 0.02 0.02 0.02 0.017 0.07 0.07 0.10 0.10 0.31 0.62 0.00 0.73 0.22 0.14 0.05 0.02 0.02 0.008 0.04 0.07 0.08 0.09 0.26 0.45 0.73 0.00 0.33 0.19 0.04 0.03 0.02 0.029 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33 0.00 0.58 0.37 0.27 0.20 0.2310 0.07 0.04 0.01 0.01 0.02 0.08 0.14 0.19 0.58 0.00 0.74 0.50 0.41 0.2811 0.09 0.07 0.02 0.00 0.02 0.02 0.05 0.04 0.37 0.74 0.00 0.76 0.62 0.5512 0.12 0.11 0.01 0.01 0.01 0.02 0.02 0.03 0.27 0.50 0.76 0.00 0.85 0.6813 0.13 0.13 0.05 0.02 0.02 0.02 0.02 0.02 0.20 0.41 0.62 0.85 0.00 0.7614 0.16 0.14 0.03 0.04 0.00 0.01 0.00 0.02 0.23 0.28 0.55 0.68 0.76 0.00

Page 13: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Results

Color Confusion: Screeplot

●●

● ● ● ●

2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

Index

clas

s.co

l$ei

g

Planar configuration

−0.4 −0.2 0.0 0.2 0.4

−0.

4−

0.2

0.0

0.2

0.4

cmdscale(colorc)[,1]

cmds

cale

(col

orc)

[,2]

1 2

34

5

6

78

9

10

11

1213

14

Page 14: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Metric Multidimensional Scaling

Schoenberg (1935)

Page 15: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,
Page 16: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Decomposition of Distances

If we started with original data in Rp that are not centered:

Y , apply the centering matrix

X = HY, with H = (I − 1n11′), and 1′ = (1, 1, 1 . . . , 1)

Call B = XX ′, if D(2) is the matrix of squared distances

between rows of X in the euclidean coordinates, we can show

that

−12HD(2)H = B

We can go backwards from a matrix D to X by taking the

eigendecomposition of B in much the same way that PCA

Page 17: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

provides the best rank r approximation for data by taking

the singular value decomposition of X, or the

eigendecomposition of XX ′.

X(r) = US(r)V ′ with S(r) =

s1 0 0 0 ...

0 s2 0 0 ...

0 0 ... ... ...

0 0 ... sr ...

... ... ... 0 0

Page 18: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Another approach: Markov Chain associatedto the data

Consider data points X = {x1, . . . , xn} in a metric space

(X , d).

We define a matrix on X that preferentially moves to nearby

states via the transition kernel

K(xi, xj) =e−d(xi,xj)∑nk=1 e

−d(xi,xk).

K has stationary distribution

π(xi) ∝n∑k=1

e−d(xi,xk)

Page 19: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

and furthermore, (K,π) is reversible:

π(xi)K(xi, xj) = π(xj)K(xj, xi).

Because K is reversible, it is diagonalizable in L2(X, π) in a

real orthonormal basis of eigenfunctions f1, . . . , fn with

corresponding real eigenvalues,

1 = λ1 ≥ λ2 ≥ · · · ≥ λn > −1.

f1 ≡ 1 since K is stochastic. Having fixed an orthonormal

basis of eigenfunctions, the k-dimensional MDS is defined to

be

Γ : xi 7→ yi = (f2(xi), . . . , fk+1(xi))

We are generally interested in k � n, for example, k ≤ 3.

Γ is an optimal mapping of X into Rk in the sense that it

Page 20: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

minimizes∑1≤i,j≤n

π(xi)K(xi, xj)‖yi − yj‖2

over all Γ : X→ Rk such that

1.∑n

i=1 Γ(p)(xi)Γ(q)(xi)π(xi) = δpq 1 ≤ p, q ≤ k

2.∑n

i=1 Γ(p)(xi)π(xi) = 0 1 ≤ p ≤ k.

where Γ(p)(xi) is the pth coordinate of Γ(xi) ∈ Rk. 1 says that

the coordinate functions of Γ are orthonormal in L2(π) and 2

says that they are also orthogonal to constant functions.

Intuitively, Λ maps similar points in X (as measured via

q(xi, xj) = π(xi)K(xi, xj)) to nearby points in Rk.

Page 21: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

In the preceding, we started with a metric d on X and built a

similarity S(xi, xj) = e−d(xi,xj) which in turn lead to a Gram

matrix G. We could instead define G via alternative

measures of similarity, e.g.

S(xi, xj) = supxk,xl d(xk, xl)− d(xi, xj). More generally, we

could begin with an arbitrary reversible Markov chain on X,

bypassing the metric d altogether.

Page 22: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

The Voting Data

We are going to carefully analyze the output of

multidimensional scaling applied to the 2005 U.S. House of

Representatives roll call votes. The resultant 3-dimensional

mapping of legislators shows ‘horseshoes’ that are

characteristic of a number of dimensionality reduction

techniques, including principal components analysis and

correspondence analysis.

These patterns are heuristically attributed to a latent

ordering of the data, e.g. the ranking of politicians within a

left-right spectrum.

Page 23: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Roll Call Data

We apply the eigendecomposition algorithm to members of

the 2005 U.S. House of Representatives with the distance

between legislators defined via roll call votes [?].

A full House consists of 435 members, and in 2005 there

were 671 roll calls. The first two roll calls were a call of the

House by States and the election of the Speaker, and so were

excluded from our analysis. Hence, the data can be ordered

into a 435× 669 matrix Y = (yij) with Yij ∈ {1/2,−1/2, 0}indicating, respectively, a vote of ‘yea’, ‘nay’, or ‘not voting’

by Representative i on roll call j.

We further restricted our analysis to the 401 Representatives

that voted on at least 90% of the roll calls (220 Republicans,

Page 24: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

180 Democrats and 1 Independent) leading to a 401× 669matrix V of voting data.

Page 25: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

The Data

V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 -1 -1 1 -1 0 1 1 1 1 12 -1 -1 1 -1 0 1 1 1 1 13 1 1 -1 1 -1 1 1 -1 -1 -14 1 1 -1 1 -1 1 1 -1 -1 -15 1 1 -1 1 -1 1 1 -1 -1 -16 -1 -1 1 -1 0 1 1 1 1 17 -1 -1 1 -1 -1 1 1 1 1 18 -1 -1 1 -1 0 1 1 1 1 19 1 1 -1 1 -1 1 1 -1 -1 -110 -1 -1 1 -1 0 1 1 0 0 0

Page 26: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

This step removed, for example, the Speaker of House

Dennis Hastert (R-IL) who by custom votes only when his

vote would be decisive, and Robert T. Matsui (D-CA) who

passed away at the start the term.

We define a distance between legislators as

d̂(li, lj) =1

669

669∑k=1

|vik − vjk|.

Rougly, d̂(li, lj) is the percentage of roll calls on which

legislators li and lj disagreed. This interpretation would be

exact if not for the possibility of ‘not voting’.

Since we now have data points in a metric space, we can

apply the MDS algorithm. The figure shows the results of a

3-dimensional MDS mapping. The most striking feature of

Page 27: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

the mapping is that the data separate into ’twin horseshoes’.

In the next figure we have added color to indicate the

political party affiliation of each Representative (blue for

Democrat, red for Republican, and green for the lone

independent–Rep. Bernie Sanders of Vermont). The output

from MDS is qualitatively similar to that obtained from other

dimensionality reduction techniques, such as principal

components analysis applied directly to the voting matrix V .

We build and analyze a model for the data in an effort to

understand and interpret these pictures. Roughly our theory

predicts that the Democrats, for example, are ordered along

the blue curve in correspondence to their political ideology,

i.e. how far they lean to the left.

We discuss connections between the theory and the data. In

Page 28: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

particular, we explain why in the data, legislators at the

political extremes are not quite at the tips of the MDS

curves, but rather are positioned slightly toward the center.

Briefly, this amounts to the fact that there are distinct

groups of relatively-liberal Republicans, which accordingly

exhibit quite different voting patterns.

Page 29: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

!0.1!0.05

00.05

0.1!0.2

!0.1

0

0.1

0.2

!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

3-Dimensional MDS mapping of legislators based on the

2005 U.S. House of Representatives roll call votes.

Page 30: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

!0.1!0.05

00.05

0.1!0.2

!0.1

0

0.1

0.2

!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

3-Dimensional MDS mapping of legislators based on the 2005

U.S. House of Representatives roll call votes. Color has been

added to indicate the party affiliation of each representative.

Page 31: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

A Model for the Data

Following the standard paradigm of placing politicians within

a left-right spectrum, it is natural to identify legislators li1 ≤ i ≤ n with points in the interval I = [0, 1] in

correspondence with their political ideologies. We define the

distance between legislators to be

d(li, lj) = |li − lj|.

This assumption that legislators can be isometrically mapped

into an interval is key to our analysis.

To apply MDS to the voting data, we defined a distance

between legislators via roll call votes. We now introduce a

Page 32: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

‘cut-point model’ for voting that connects our distance d

above to the data-based roll call distance.

The Model: Each bill 1 ≤ k ≤ m on which the legislators

vote is represented as a pair

(Ck, Pk) ∈ [0, 1]× {0, 1}.

We can think of Pk as indicating whether the bill is liberal

(Pk = 0) or conservative (Pk = 1), and we can take Ck to be

the cut-point between legislators that vote ‘yea’ or ‘nay’. Let

Vik ∈ {1/2,−1/2} indicate how legislator li votes on bill k.

Then, in this model,

Vik ={

1/2− Pk li ≤ CkPk − 1/2 li > Ck

.

Page 33: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

As described, the model has n+ 2m parameters, one for

each legislator and two for each bill. We reduce the number

of parameters by assuming that the cut-points are

independent random variables uniform on I. Then,

P(Vik 6= Vjk) = d(li, lj) (1)

since legislators li and lj take opposites sides on a given bill

if and only if the cut-point Ck divides them. Observe that

the Pk do not affect the probability above.

Define the empirical distance between legislators li and lj by

d̂m(li, lj) =1m

m∑k=1

|Vik − Vjk| =1m

m∑k=1

1Vik 6=Vjk.

By 1, we can estimate the distance d between legislators by

Page 34: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

the distance d̂ which is computable from the voting record.

In particular,

limm→∞

d̂m(li, lj) = d(li, lj) a.s.

since we assumed the cut-points are independent. More

precisely, we have the following result:

Lemma. For m ≥ log(n/√ε)/ε2

P(∣∣∣d̂m(li, lj)− d(li, lj)

∣∣∣ ≤ ε ∀ 1 ≤ i, j ≤ n)≥ 1− ε.

Proof. By the Hoeffding inequality, for fixed li and lj

P(∣∣∣d̂m(li, lj)− d(li, lj)

∣∣∣ > ε)≤ 2e−2mε2.

Page 35: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Consequently,

P

⋃1≤i<j≤n

∣∣∣d̂m(li, lj)− d(li, lj)∣∣∣ > ε

≤∑

1≤i<j≤nP(∣∣∣d̂m(li, lj)− d(li, lj)

∣∣∣ > ε)

≤(

n

22e−2mε2

)≤ ε

for m ≥ log(n/√ε)/ε2 and the result follows.

In our model we identified latent variables with points in the

interval I = [0, 1] and accordingly defined the distance

between them to be d(li, lj) = |li − lj|. This general

description seems to be reasonable in a number of

applications. We then built a simple model for the data that

Page 36: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

facilitated empirical approximation of this distance. This

second step depends heavily on the application. In the rest

of the paper, we simply assume that the distance d can be

reasonably approximated from the data.

Page 37: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Analysis of the Model

In this section we analyze the MDS algorithm applied to

metric models satisfying

d(xi, xj) = |i/n− j/n|.

This corresponds to the case in which legislators are

uniformly spaced in I: li = i/n.

Page 38: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Similarity and Transition Matrices

Given a distance d on a state space X , there are several ways

to build a similarity S. Two standard transformations are:

1. S1(xi, xj) = e−d(xi,xj)

2. S2(xi, xj) = supzi,zj d(zi, zj)− d(xi, xj)

Once we have a similarity, we can define a Gram/Kernel

matrix K by normalizing the rows. That is,

K(xi, xj) =S(xi, xj)∑xkS(xi, xk)

.

To ease the analysis, sometimes we instead normalize the

Page 39: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

similarity matrix by the average row sum

z =1n

∑xi

∑xj

S(xi, xj).

That is, we set K(xi, xj) = S(xi, xj)/z.

Page 40: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Eigenvectors and Horseshoes

We find approximate eigenfunctions and eigenvalues for

models that satisfy

d(xi, xj) = |i/n− j/n|

with Gram matrices that are built with either a linear

similarity or an exponential similarity. The eigenfunctions are

found by continuizing the discrete Gram matrix, and then

solving the corresponding integral equation∫ 1

0K(x, y)f(y)dy = λf(x).

Standard matrix perturbation theory can then be applied to

Page 41: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

recover approximate eigenfunctions for the original, discrete

kernel.

The eigenfunctions that we derive are in agreement with

those arising from the voting data, and lend considerable

insight into our data analysis problem and also into general

features of MDS mappings.

Page 42: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Approximate Eigenfunctions

We now state a classical perturbation result that relates two

different notions of an approximate eigenfunction. For more

refined estimates, see Parlett[?].

Theorem. Consider an n × n symmetric matrix A witheigenvalues λ1 ≤ · · · ≤ λn. If for ε > 0

‖Af − λf‖2 ≤ ε

for some f, λ with ‖f‖2 = 1, then A has an eigenvalue λksuch that |λk − λ| ≤ ε.

If we further assume that s = mini:λi 6=λk |λi − λk| > ε

then A has an eigenfunction fk such that Afk = λkfk and‖f − fk‖2 ≤ ε/(s− ε).

Page 43: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Remark. The second statement of the theorem allowsnon-simple eigenvalues, but requires that the eigenvaluescorresponding to distinct eigenspaces be well-separated.Remark. The eigenfunction bound of the theorem isasymptotically tight in ε as the following exampleillustrates: Consider the matrix

A =[λ 00 λ+ s

]with s > 0. For ε < s define the function

f =

[ √1− ε2/s2

ε/s

].

Then ‖f‖2 = 1 and ‖Af − λf‖2 = ε. The theoremguarantees that there is an eigenfunction fk with eigenvalue

Page 44: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

λk such that |λ − λk| ≤ ε. Since the eigenvalues of A areλ and λ + s, and since s > ε, we must have λk = λ. LetVk = {fk : Afk = λkfk} = {ce1 : c ∈ R} where e1 is the firststandard basis vector. Then

minfk∈Vk

‖f − fk‖2 = ‖f − (f · e1)e1‖ = ε/s.

The bound of the theorem, ε/(s− ε), is only slightly larger.

Page 45: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Proof of Approximate Eigenfunction TheoremProof. First we show that mini |λi − λ| ≤ ε.If mini |λi − λ| = 0 we are done; otherwise A− λI is invertible. Then,

‖f‖2 ≤ ‖(A− λI)−1‖ · ‖(A− λ)f‖2≤ ε‖(A− λI)−1‖.

Since the eigenvalues of (A − λI)−1 are 1/(λ1 − λ), . . . , 1/(λn − λ), by

symmetry

‖(A− λI)−1‖ =1

mini |λi − λ|.

The result now follows since ‖f‖2 = 1.

Set λk = argmin|λi− λ|, and consider an orthonormal basis g1, . . . , gm of

the associated eigenspace Eλk. Define fk to be the projection of f onto

Eλk:

fk = 〈f, g1〉g1 + · · ·+ 〈f, gm〉gm.Then fk is an eigenfunction with eigenvalue λk. Writing f = fk+(f−fk)

Page 46: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

we have

(A− λI)f = (A− λI)fk + (A− λI)(f − fk)= (λk − λ)fk + (A− λI)(f − fk).

Since f − fk ∈ E⊥λk, by symmetry we have

〈fk, A(f − fk)〉 = 〈Afk, f − fk〉 = 〈λkfk, f − fk〉 = 0.

Consequently, 〈fk, (A− λI)(f − fk)〉 = 0 and by Pythagoras

‖Af − λf‖22 = (λk − λ)2‖fk‖2 + ‖(A− λI)(f − fk)‖22.

In particular, ε ≥ ‖Af − λf‖2 ≥ ‖(A− λI)(f − fk)‖2.For λi 6= λk, |λi − λ| ≥ s− ε. The result now follows since for h ∈ E⊥λk

‖(A− λI)h‖2 ≥ (s− ε)‖h‖2.

Page 47: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Centering Kernel matrices

If our kernel K is renormalized so that it has row sums 1.

K1n = 1

Then 1n is an eigenvector of K with eigenvalue 1.

As a consequence if we recenter K by applying the centering

matrix H = I− 1n11′, for any eigenvector v different from 1n

KHv = Kv − 1nK1n1′nv = λv

and also HKHv = λHv = λv

So we will not bother to recenter the K matrix.

Page 48: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Linear Similarity

When we make a continuous version of the discrete Kernel

matrix Kn, we get the continuous kernel

K(x, y) =32[1− |x− y|].

Once we guess that the solutions to the corresponding

integral equation are trigonometric, verifying this is

straightforward. We start with a simple integral computation.

Page 49: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Lemma. For a 6= 0∫ 1

0cos(ax+ b)[1− |c− x|]dx =

2a2(cos(ac+ b))

− 1a2 [a sin b− ac sin b− ac sin(a+ b) + cos(b) + cos(a+ b)] .

In particular,

1. For odd integers k∫ 1

0sin(kπ(x−1/2))[1−|c−x|]dx =

2(kπ)2 cos(kπ(c−1/2))

2. For solutions to (a/2) tan(a/2) = 1∫ 1

0cos [a(x− 1/2)] [1− |c− x|]dx =

2a2 cos [a(c− 1/2)] .

Page 50: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Proof. The result follows from a straightforward calculation. Set

fc(x) = cos(ax+ b)[1− |c− x|].

Then

Z 1

0fc(x)dx = (1− c)

Z c

0cos(ax+ b)dx+

Z c

0x cos(ax+ b)dx

+ (1 + c)

Z 1

ccos(ax+ b)dx−

Z 1

cx cos(ax+ b)dx.

Integration by parts shows that,

Zx cos(ax+ b) =

1

ax sin(ax+ b) +

1

a2cos(ax+ b).

Substituting into the above, we have

Z 1

0fc(x)dx =

1

a2[a(1− c) sin(ac+ b)− a(1− c) sin(b) + a(1 + c) sin(a+ b)

− a(1 + c) sin(ac+ b) + ac sin(ac+ b) + cos(ac+ b)− cos(b)

− a sin(a+ b)− cos(a+ b) + ac sin(ac+ b) + cos(ac+ b)].

At a = kπ and b = 0 for k an odd integer,

a sin b− ac sin b− ac sin(a+ b) + cos(b) + cos(a+ b) = 0

Page 51: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

and so Z 1

0cos(kπx)[1− |c− x|]dx =

2

(kπ)2cos(kπc).

Since for odd ksin(kπ(x− 1/2)) = cos(kπx− π(k + 1)/2) = (−1)

(k+1)/2cos(kπx)

the first part of the lemma follows. At b = −a/2 where a is a solution to (a/2) tan(a/2) = 1

a sin b− ac sin b− ac sin(a+ b) + cos(b) + cos(a+ b) = −a sin(a/2) + 2 cos(a/2)

= 0.

Consequently, Z 1

0cos(ax− a/2)[1− |c− x|]dx =

2

a2cos(ac− a/2).

for a a solution to (a/2) tan(a/2) = 1.

Page 52: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

The solutions of (a/2) tan(a/2) = 1 occur at approximately

a = 2kπ for integers k. More precisely, we have the following

result.

Lemma. The positive solutions of (a/2) tan(a/2) = 1 lie inthe set

(0, π) ∪∞⋃k=1

(2kπ, 2kπ + 2/kπ)

with exactly one solution per interval. Furthermore, a is asolution if and only if −a is a solution.

Page 53: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Proof. Let f(θ) = (θ/2) tan(θ/2). Then f is an even

function, so a is a solution to f(θ) = 1 if and only if −a is

a solution. Since f ′(θ) = (1/2) tan(θ/2) + (θ/4) sec2(θ/2),

f(θ) is non-negative and increasing in the first and second

quadrants, and furthermore

f(2kπ) = 0 < 1 < +∞ = limθ→(2k+1)π−

f(θ).

The third and fourth quadrants have no solutions since f(θ) ≤0 in those regions. This shows that the solutions to f(θ) = 1lie in the intervals

∞⋃k=0

(2kπ, 2kπ + π)

with exactly one solution per interval. Recall the power series

Page 54: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

expansion of tan θ for |θ| < π/2 is

tan θ = θ + θ3/3 + 2θ5/15 + 17θ7/315 + . . . .

In particular, for 0 ≤ θ < π/2, tan θ ≥ θ. Finally, for k ∈ Z≥1

f(2kπ + 2/kπ) = (kπ + 1/kπ) tan(kπ + 1/kπ)

= (kπ + 1/kπ) tan(1/kπ)

≥ (kπ + 1/kπ)(1/kπ)

> 1

which gives the result.

Page 55: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Remark. The first few positive solutions of (a/2) tan(a/2) =1 are

1. a = 1.72066717803876 . . .

2. a = 6.85123691896346 . . .

3. a = 12.87459635834389 . . .

4. a = 19.05866881072393 . . .

Page 56: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Lemma. For 1 ≤ i, j ≤ n, let

Kn(xi, xj) =3

2n

(1− |i− j|

n

).

Set fn,a(xi) = cos(a(i/n − 1/2)) where a is a positivesolution to (a/2) tan(a/2) = 1, and set gn,k(xi) =sin(kπ(i/n− 1/2) for k ≥ 1 an odd integer. Then∣∣∣∣Knfn,a(xi)−

3a2fn,a(xi)

∣∣∣∣ ≤ a+ 1n

and ∣∣∣∣Kngn,k(xi)−3

(kπ)2gn,k(xi)∣∣∣∣ ≤ kπ + 1

n.

That is, fn,a and gn,k are approximate eigenfunctions of Kn

with approximate eigenvalues proportional to their squaredperiods.

Page 57: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Proof. Once we guess that f and g are approximate

eigenfunctions of Kn, the proof of this fact follows from

the integral computation in the previous Lemma. We have,

Knfn,a(xi) =3

2n

n∑j=1

cos(a(j/n− 1/2))[1− |i/n− j/n|]

=32

∫ 1

0cos(a(x− 1/2))[1− |j/n− x|]dx+

32Rn

=3a2fn,a(xi) +

32Rn by Lemma

where the error term satisfies

|Rn| ≤M

2nforM ≥ sup

0≤x≤1

∣∣∣∣ ddx cos(a(x− 1/2))[1− |j/n− x|]∣∣∣∣

by the standard right-hand rule error bound. In particular, we

Page 58: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

can take M = a+ 1 independent of j, from which the result

for fn,a follows. The case of gn,k is analogous.

Lemma For 1 ≤ i, j ≤ n set

Kn(xi, xj) =3

2n

(1− |i− j|

n

)and let λ1, . . . , λn be the eigenvalues of Kn.

1. For positive solutions to (a/2) tan(a/2) = 1

min1≤i≤n

∣∣∣∣λi − 3a2

∣∣∣∣ ≤ 2(a+ 1)√n

.

2. For odd integers k ≥ 1

min1≤i≤n

∣∣∣∣λi − 3(kπ)2

∣∣∣∣ ≤ kπ + 1√n

.

Page 59: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Remark. By Remark the first few values of 3/a2 forsolutions to (a/2) tan(a/2) = 1 are

1. 1.01327541515878 . . .

2. 0.06391212873818 . . .

3. 0.01809897627265 . . .

4. 0.00825916473010 . . .

and the first few values of 3/(kπ)2 for k ≥ 1 an odd integerare

1. 0.30396355092701 . . .

2. 0.03377372788078 . . .

Page 60: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

3. 0.01215854203708 . . .

4. 0.00620333777402 . . .

Page 61: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Exponential Transformation of Similarity

The case of exponential similarity is analogous to that of

linear similarity. Continuizing the discrete Gram matrix Kn,

we get the kernel

K(x, y) =e

2e−|x−y|.

Once again, we find trigonometric solutions to Kf = λf .

Lemma. For constants a, c ∈ R∫ 1

0e−|x−c| cos[a(x− 1/2)]dx

=2 cos[a(c− 1/2)]

1 + a2 +

(e−c + ec−1

)(a sin(a/2)− cos(a/2))

1 + a2

Page 62: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

and∫ 1

0e−|x−c| sin[a(x− 1/2)]dx

=2 sin[a(c− 1/2)]

1 + a2 +

(e−c − ec−1

)(a cos(a/2) + sin(a/2))

1 + a2

In particular,

1. For a such that a tan(a/2) = 1∫ 1

0e−|x−c| cos[a(x− 1/2)]dx =

2 cos[a(c− 1/2)]1 + a2

2. For a such that a cot(a/2) = −1∫ 1

0e−|x−c| sin[a(x− 1/2)]dx =

2 sin[a(c− 1/2)]1 + a2

Page 63: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Proof. The lemma follows from a straightforward integration. First split the integral into two pieces:

Z 1

0e−|x−c|

cos[a(x− 1/2)]dx

=

Z c

0ex−c

cos[a(x− 1/2)]dx+

Z 1

cec−x

cos[a(x− 1/2)]dx.

By integration by parts applied twice,

Zex−c

cos[a(x− 1/2)]dx =aex−c sin(a(x− 1/2)) + ex−c cos(a(x− 1/2))

1 + a2

and Zec−x

cos[a(x− 1/2)]dx =aec−x sin(a(x− 1/2))− ec−x cos(a(x− 1/2))

1 + a2.

Evaluating these expressions at the appropriate limits of integration gives the first statement of the lemma. The computation

ofR 10 e−|x−c| sin[a(x− 1/2)]dx is analogous.

Page 64: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

The solution of a tan(a/2) = 1 are approximately 2kπ for

integers k and the solutions of a cot(a/2) = −1 are

approximately (2k + 1)π.

Lemma.

1. The positive solutions of a tan(a/2) = 1 lie in the set

(0, π) ∪∞⋃k=1

(2kπ, 2kπ + 1/kπ)

with exactly one solution per interval. Furthermore, a isa solution if and only if −a is a solution.

2. The positive solutions of a cot(a/2) = −1 lie in the set∞⋃k=0

((2k + 1)π, (2k + 1)π + 1/(kπ + π/2))

Page 65: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

with exactly one solution per interval. Furthermore, a isa solution if and only if −a is a solution.

Remark. The first few positive solutions of a tan(a/2) = 1 are

1. a = 1.30654237418881 . . .

2. a = 6.58462004256417 . . .

3. a = 12.72324078413133 . . .

4. a = 18.95497141084159 . . .

and the first few positive solutions of a cot(a/2) = −1 are

1. a = 3.67319440630425 . . .

2. a = 9.63168463569187 . . .

3. a = 15.83410536933241 . . .

Page 66: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

4. a = 22.08165963594259 . . .

Lemma. For 1 ≤ i, j ≤ n, let

Kn(xi, xj) =e

2ne−|i−j|/n.

Set fn,a(xi) = cos(a(i/n − 1/2)) where a is a positivesolution to a tan(a/2) = 1, and set gn,a(xi) = sin(a(i/n −1/2)) where a is a positive solution to a cot(a/2) = −1.Then ∣∣∣∣Knfn,a(xi)−

e

1 + a2fn,a(xi)∣∣∣∣ ≤ 2(a+ 1)

n∣∣∣∣Kngn,a(xi)−e

1 + a2gn,a(xi)∣∣∣∣ ≤ 2(a+ 1)

n.

That is, fn,a and gn,a are approximate eigenfunctions of Kn.

Page 67: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Lemma. For 1 ≤ i, j ≤ n set

Kn(xi, xj) =e

2ne−|i−j|/n

and let λ1, . . . , λn be the eigenvalues of Kn.

1. For positive solutions to a tan(a/2) = 1

min1≤i≤n

∣∣∣∣λi − e

1 + a2

∣∣∣∣ ≤ 4(a+ 1)√n

.

2. For positive solutions to a cot(a/2) = −1

min1≤i≤n

∣∣∣∣λi − e

1 + a2

∣∣∣∣ ≤ 4(a+ 1)√n

.

Remark. The first few values of e/(1 + a2) for solutions toa tan(a/2) = 1 are

Page 68: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

1. 1.00414799895293 . . .

2. 0.06128160783626 . . .

3. 0.01668877420197 . . .

4. 0.00754468546867 . . .

The first few values of e/(1 + a2) for solutions to a cot(a/2) = −1 are

1. 0.18756657740212 . . .

2. 0.02898902316936 . . .

3. 0.01079887885138 . . .

4. 0.00556341289490 . . .

Page 69: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Horseshoes and Twin Horseshoes

The 2-dimensional mapping is built out of the second andthird eigenfunctions of the Gram matrix. Above wecomputed several approximate eigenfunctions and

eigenvalues for the Gram matrix arising from the votingmodel. The linear and exponential similarity cases are

analogous, and so we only consider the latter here. In thiscase, we have the approximate eigenfunctions

1. fn,1(xi) = cos(1.3065(i/n− 1/2)) with eigenvalue λ ≈ 1.004

2. fn,2(xi) = sin(3.6732(i/n− 1/2)) with eigenvalue λ ≈ 0.1876

3. fn,3(xi) = cos(6.5846(i/n− 1/2)) with eigenvalue λ ≈ 0.06128.

Page 70: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

0 0.5 1!0.5

0

0.5

1

1.5

2

0 0.5 1

!1

!0.5

0

0.5

1

0 0.5 1

!1

!0.5

0

0.5

1

Approximate eigenfunctions f1, f2 and f3.

Page 71: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

!1 !0.8 !0.6 !0.4 !0.2 0 0.2 0.4 0.6 0.8 1!1

!0.8

!0.6

!0.4

!0.2

0

0.2

0.4

0.6

0.8

1

A horseshoe that results from plotting

Λ : xi 7→ (f2(xi), f3(xi)).

In particular, from Λ it is possible to deduce the relative

order of the representatives in the interval I. Since −f2 is

also an eigenfunction, it is not in general possible to

Page 72: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

determine the absolute order knowing only that Λ comes

from the eigenfunctions.

You need a crib!

Page 73: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Voting Data

With the voting data, we see not one, but two horseshoes.

To see how this can happen, consider the two population

state space X = {x1, . . . , xn1, y1, . . . , yn2} with distance

d(xi, xj) = |i/n1 − j/n1|, d(yi, yj) = |i/n2 − j/n2| and

d(xi, yj) = +∞. This leads to the partitioned Gram matrix

Kn1+n2 =

[Kn1 0

0 Kn2

].

The approximate eigenfunctions and eigenvalues that we

found above for the single population model can now be

used to build higher dimensional eigenspaces for the two

population model. In particular, we have the following

Page 74: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

approximate eigenspaces:Eigenspace with eigenvalue λ ≈ 1.004 containing orthogonal functions

gn,1(xi) =

sn1 + n2

n1fn1,1

(xi) · 11≤i≤n1+

sn1 + n2

n2fn2,1

(xi − n1) · 1n1<i≤n2

gn,2(xi) =

sn1 + n2

n1fn1,1

(xi) · 11≤i≤n1−

sn1 + n2

n2fn2,1

(xi − n1) · 1n1<i≤n2

Eigenspace with eigenvalue λ ≈ 0.1876 containing orthogonal functions

gn,3(xi) = a

sn1 + n2

n1fn1,2

(xi) · 11<i≤n1+

sn1 + n2

n2fn2,2

(xi − n1) · 1n1<i≤n2

gn,4(xi) =

sn1 + n2

n1fn1,2

(xi) · 11<i≤n1− a

sn1 + n2

n2fn2,2

(xi − n1) · 1n1<i≤n2

These functions are graphed for the case n1 = n2 and

a = 1/5. Moreover, plotting the 3-dimensional mapping

Λ : xi 7→ (g2(xi), g3(xi)g4(xi)) results in twin horseshoes.

Page 75: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

0 1 2!3

!2

!1

0

1

2

3

4

0 1 2

!3

!2

!1

0

1

2

3

0 1 2

!3

!2

!1

0

1

2

3

0 1 2

!3

!2

!1

0

1

2

3

Approximate

eigenfunctions g1, g2, g3 and g4 for the Gram matrix arising

from the two population model.

Page 76: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

!1!0.5

00.5

1

!1

!0.5

0

0.5

1!1

!0.5

0

0.5

1

Twin horseshoes

that result from plotting Λ : xi 7→ (g2(xi), g3(xi)g4(xi)).

The approximate eigenfunctions derived above are stable to

noise. Numerically, this is the case, as seen below.

Page 77: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

That figure has generated by adding normal N(0, 1/5) noise

to the Gram matrix K200 before normalizing by the average

row sum. The specific form of the noise does not noticeably

affect the results.

0 100 200

!0.1

!0.05

0

0.05

0.1

0 100 200

!0.1

!0.05

0

0.05

0.1

0.15

0 100 200

!0.1

!0.05

0

0.05

0.1

0.15

Page 78: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Connecting the Model to the Data

When we apply eigendecomposition to the voting data, the

first few eigenvalues are:1

0.17709857573272 . . .0.01037622989886 . . .0.00831940284881 . . .0.00484075498479 . . .0.00344207632723 . . .0.00266158512355 . . .0.00248175112290 . . .

Page 79: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

0 200 400 600!0.08

!0.06

!0.04

!0.02

0

0.02

0.04

0.06

0 200 400 600!0.15

!0.1

!0.05

0

0.05

0.1

0.15

0.2

0 200 400 600!0.15

!0.1

!0.05

0

0.05

0.1

0.15

0.2

The re-indexed second, third and fourth eigenfunctions

outputted from the MDS algorithm applied to the 2005 U.S.

House of Representatives roll call votes. Colors indicate

political parties.

Page 80: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Since legislators are not a priori ordered, the eigenfunctions

are difficult to interpret. However, our model suggests the

following ordering: Split the legislators into two groups G1

and G2 based on the sign of f2(xi); then the norm of f3 is

larger on one group, say G1, so we sort G1 based on

increasing values of f3, and similarly, sort G2 via f4.

Page 81: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

0 200 400!0.08

!0.06

!0.04

!0.02

0

0.02

0.04

0.06

0 200 400!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

0 200 400!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

The re-indexed second, third and fourth eigenfunctions

outputted from the MDS algorithm applied to the 2004 U.S.

House of Representatives roll call votes. Colors indicate

political parties.

Page 82: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Our analysis suggests that if legislators are in fact

isometrically embedded in the interval I (relative to the roll

call distance), then the MDS rank will be consistent with the

order of legislators in the interval. This appears to be the

case in the data, for instance the following figure which

shows a graph of d(li, ·) for selected legislators li. For

example, as we would predict, d(l1, ·) is an increasing

function and d(ln, ·) is decreasing. Moreover, the data seem

to be in rough agreement with the metric assumption of our

two population model, namely that the two groups are

well-separated and that the within group distance is given by

d(li, lj) = |i/n− j/n|.

Page 83: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

The empirical roll call derived distance function d(li, ·) for

selected legislators li = 1, 90, 181, 182, 290, 401. The x-axis

orders legislators according to their MDS rank.

Our voting model suggests that the MDS obtained ordering

of legislators should correspond to political ideology. To test

this, we compared the MDS results to the assessment of

legislators by Americans for Democratic Action [?]. Each

Page 84: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

year, ADA selects 20 votes it considers the most important

during that session, for example, the Patriot Act

reauthorization. Legislators are assigned a Liberal Quotient:

the percentage of those 20 votes on which the Representative

voted in accordance with what ADA considered to be the

liberal position. For example, a representative who voted the

liberal position on all 20 votes would receive an LQ of 100%.

Figure below shows a plot of LQ vs. the MDS derived rank.

Page 85: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

0 50 100 150 200 250 300 350 400 4500

10

20

30

40

50

60

70

80

90

100

Eigenmap Rank

Libe

ral Q

uotie

nt

Comparison of the

MDS derived rank for Representatives with the Liberal

Quotient as defined by Americans for Democratic Action.

This figure results because this notion of proximity, although

related, does correspond directly to political ideology. The

MDS and ADA rankings complement one another in the

Page 86: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

sense that together they facilitate identification of two

distinct, yet relatively liberal groups of Republicans. That is,

although these two groups are relatively liberal, they are

considered to be liberal for different reasons.

Page 87: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

90

100

Eigenmap Rank

Natio

nal J

ourn

al S

core

Comparison of the MDS derived rank for Representatives

with the National Journal’s liberal score

Page 88: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Practical Questions:

• Which transformations of distances work well for detecting

gradients:√1− exp(−d(x, y)) work well in practice.

• Are most Toeplitz eigenvectors are simple to approximate.

• Can we prove the eigenvectors are robust to Noise (for

instance the physicists Bohigas, Bogomolny and Schmit

show that for uniformly distributed points on a segment

(the one dimensional Anderson model) the eigenstructure is

the same.

• How do we extend this to a two dimensional (spatial)

Page 89: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

gradient?

Page 90: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

A little immunology

T-lymphocyte cells (T-cells) originally derive from stem cells

of the bone marrow. At around the time of birth,

lymphocytes derived in this way leave the marrow and pass

to the thymus gland in the chest, where they multiply.

Page 91: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

The lymphocytes are processed by the thymus gland, so that

between them they carry the genetic information necessary

to react with a multitude of possible antigens.

Page 92: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Biological Questions

• Do cancer patients show differential expression in any genes

expressed in T-cells?

• Are there any differences between naive effector and memory

T-cells?

• What are the steps involved in T-cell differentiation?

Page 93: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Differences between the three cell types?

• Linear Model

N E M

Apop

• Parallel Model

N

M

ApopE

Page 94: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Genes differentially expressed

Using the variance stabilized data (vsn) and multtest using

Westfall and Young’s maxT: I ranked the genes by their

adjusted p-value.

I made my collaborator choose a stopping point on the list:

156 significant genes.

Page 95: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

MDS Analysis

Transform the data from continuous to discrete: cutoff

decided through genes known to be expressed in some arrays

and not in others. (Biological not Statistical Criteria)

Page 96: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

87% of the variation in the first plane:

−0.2 −0.1 0.0 0.1 0.2

−0.

3−

0.2

−0.

10.

00.

10.

2

Kt.ev$vectors[, 29:30][,1]

Kt.e

v$ve

ctor

s[, 2

9:30

][,2]

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

NAI

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

Page 97: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,
Page 98: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Topological Problems in Spaces ofPhylogenetic Trees

Biology now requires the use of non standard parameters

generalising work done on multivariate Euclidean spaces to

spaces of parameters that are not embeddable in Euclidean

structures. Visualisation of distances often provides much

more information that the simple distributions.

Page 99: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,
Page 100: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,
Page 101: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,
Page 102: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Less symmetrical Phylogenies

Linguistics use trees to map out the history of language.

Linguists use trees, but they have an ancient form and a

novel form. So their trees do not have symmetry between

siblings.

Page 103: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,
Page 104: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Examples include :

• Comparing Phylogenetic trees from different DNA data.

• Comparing Bootstrap Trees with the tree computed from

the original data sets.

• Comparing Hierarchical clustering trees on melanoma

patients.

• Constructing confidence sets for non standard data.

• Testing for mixtures of trees (Mossel, Vigoda show how

important this can be).

• Trying to detect horizontal gene transfer.

Page 105: Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

• Output of many trees sampled from a Bayesian posterior

distribution on trees.

• Sets of trees built with different data (DNA tree, behavioral

trees, pheno typic trees).

• Confidence regions of trees from Bayesian posteriors or

Bootstrap resamples.

• Neighborhood Explorations, how many neighbours? What

are the curvatures of the boundaries?