Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,

Topological Data Analysis for detectingHidden Patterns in Data

Susan Holmes

Statistics, Stanford, CA 94305.

Joint work with Persi Diaconis, Mehrdad Shahshahani and

Sharad Goel.

Thanks to Harold Widom, Gunnar Carlssen, John Chakarian,

Leonid Pekelis for discussions, and NSF grant DMS 0241246

for funding.

A la recherche du temps perdu: Gradients etOrdination

Many popular multivariate methods based on spectral

decompositions of distance methods or transformed

distances, Multidimensional Scaling, kernel PCA,

correspondence analysis, Metric MDS aim to detect hidden

underlying structure of points in high dimensions.

A first type of dependence is a hidden gradient, placing

points close to a curve in high dimensional space. Ecologists,

archeologists have long known to look for horseshoes or

arches which are symptomatic of such structure.

We take a political science example with data from 2005

U.S. House of Representatives roll call votes. MDS and

kernel PCA, in this case, output two ‘horseshoes’ that are

characteristic of dimensionality reduction techniques.

PCA: Dimension Reduction

PCA seeks to replace the original (centered) matrix X by a

matrix of lower rank, this can be solved by doing the singular

value decomposition of X:

X = USV ′, with U ′DU = In and V ′QV = Ip and Sdiagonal

XX ′ = US2U ′, with U ′DU = In and S2 = ΛPCA is a linear nonparametric multivariate method for

dimension reduction.

Ordination : Finding Time (Le temps perdu...)

Early studies in archeology have aimed for seriation in time

Guttman, Kendall and Ter Braak have pointed out and

studied the arch or horseshoe effect.

Here is a linguistic example where I dated the works of Plato

according to their sentence endings using a particular

distance between the books called the Chisquare distance:

As an example we take data analysed by Cox and

Brandwood [?] who wanted to seriate Plato’s works using

the proportion of sentence endings in a given book, with a

given stress pattern. We propose the use of correspondence

analysis on the table of frequencies of sentence endings, for a

detailed analysis see Charnomordic and Holmes[?].

The first 10 profiles (as percentages) look as follows:

Rep Laws Crit Phil Pol Soph TimUUUUU 1.1 2.4 3.3 2.5 1.7 2.8 2.4-UUUU 1.6 3.8 2.0 2.8 2.5 3.6 3.9U-UUU 1.7 1.9 2.0 2.1 3.1 3.4 6.0UU-UU 1.9 2.6 1.3 2.6 2.6 2.6 1.8UUU-U 2.1 3.0 6.7 4.0 3.3 2.4 3.4UUUU- 2.0 3.8 4.0 4.8 2.9 2.5 3.5--UUU 2.1 2.7 3.3 4.3 3.3 3.3 3.4-U-UU 2.2 1.8 2.0 1.5 2.3 4.0 3.4-UU-U 2.8 0.6 1.3 0.7 0.4 2.1 1.7-UUU- 4.6 8.8 6.0 6.5 4.0 2.3 3.3.......etc (there are 32 rows in all)

The eigenvalue decomposition (called the scree plot) of the

chisquare distance matrix (see [?]) shows that two axes out

of a possible 6 (the matrix is of rank 6) will provide a

summary of 85% of the departure from independence, this

suggests that a planar representation will provide a good

visual summary of the data.

Eigenvalue inertia % cumulative %1 0.09170 68.96 68.962 0.02120 15.94 84.903 0.00911 6.86 91.764 0.00603 4.53 96.295 0.00276 2.07 98.366 0.00217 1.64 100.00

Tim

Laws Rep

Soph

Phil

PolCrit

Axis #1: 69%

Axis

#2:

16%

-0.2 0.0 0.2 0.4

0.0

-0.2

-0.3

-0.1

0.1

Correspondence Analysis of Plato’s WorksWe can see from the plot that there is a seriation that as in

most cases follows a parabola or arch [?] from Laws on one

extreme being the latest work and Republica being the

earliest.

Examples from Ecology

The Boomlake plant data:

Biplot representing both species and locations

Blue circles with letters are species scores

Sampling locations are green circles with numbers.

Sample 1 is actually in the lake, and sample 12 is far away.

Species are located closely to the samples they occur in. If

you looked carefully into the data matrix, you would find

that species R and Q are strictly aquatic, while species F is a

dryland plant (cribs). There is an arch effect.

Reference to site on ordination:

http://ordination.okstate.edu/CA.htm

Psychological DataColor confusion data (Ekman, 1954):

w434 w445 w465 w472 w490 w504 w537 w555 w584 w600 w610 w628 w651 w6741 0.00 0.86 0.42 0.42 0.18 0.06 0.07 0.04 0.02 0.07 0.09 0.12 0.13 0.162 0.86 0.00 0.50 0.44 0.22 0.09 0.07 0.07 0.02 0.04 0.07 0.11 0.13 0.143 0.42 0.50 0.00 0.81 0.47 0.17 0.10 0.08 0.02 0.01 0.02 0.01 0.05 0.034 0.42 0.44 0.81 0.00 0.54 0.25 0.10 0.09 0.02 0.01 0.00 0.01 0.02 0.045 0.18 0.22 0.47 0.54 0.00 0.61 0.31 0.26 0.07 0.02 0.02 0.01 0.02 0.006 0.06 0.09 0.17 0.25 0.61 0.00 0.62 0.45 0.14 0.08 0.02 0.02 0.02 0.017 0.07 0.07 0.10 0.10 0.31 0.62 0.00 0.73 0.22 0.14 0.05 0.02 0.02 0.008 0.04 0.07 0.08 0.09 0.26 0.45 0.73 0.00 0.33 0.19 0.04 0.03 0.02 0.029 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33 0.00 0.58 0.37 0.27 0.20 0.2310 0.07 0.04 0.01 0.01 0.02 0.08 0.14 0.19 0.58 0.00 0.74 0.50 0.41 0.2811 0.09 0.07 0.02 0.00 0.02 0.02 0.05 0.04 0.37 0.74 0.00 0.76 0.62 0.5512 0.12 0.11 0.01 0.01 0.01 0.02 0.02 0.03 0.27 0.50 0.76 0.00 0.85 0.6813 0.13 0.13 0.05 0.02 0.02 0.02 0.02 0.02 0.20 0.41 0.62 0.85 0.00 0.7614 0.16 0.14 0.03 0.04 0.00 0.01 0.00 0.02 0.23 0.28 0.55 0.68 0.76 0.00

Results

Color Confusion: Screeplot

●

●

●

●

●●

● ● ● ●

2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

Index

clas

s.co

l$ei

g

Planar configuration

−0.4 −0.2 0.0 0.2 0.4

−0.

4−

0.2

0.0

0.2

0.4

cmdscale(colorc)[,1]

cmds

cale

(col

orc)

[,2]

1 2

34

5

6

78

9

10

11

1213

14

Metric Multidimensional Scaling

Schoenberg (1935)

Decomposition of Distances

If we started with original data in Rp that are not centered:

Y , apply the centering matrix

X = HY, with H = (I − 1n11′), and 1′ = (1, 1, 1 . . . , 1)

Call B = XX ′, if D(2) is the matrix of squared distances

between rows of X in the euclidean coordinates, we can show

that

−12HD(2)H = B

We can go backwards from a matrix D to X by taking the

eigendecomposition of B in much the same way that PCA

provides the best rank r approximation for data by taking

the singular value decomposition of X, or the

eigendecomposition of XX ′.

X(r) = US(r)V ′ with S(r) =

s1 0 0 0 ...

0 s2 0 0 ...

0 0 ... ... ...

0 0 ... sr ...

... ... ... 0 0

Another approach: Markov Chain associatedto the data

Consider data points X = {x1, . . . , xn} in a metric space

(X , d).

We define a matrix on X that preferentially moves to nearby

states via the transition kernel

K(xi, xj) =e−d(xi,xj)∑nk=1 e

−d(xi,xk).

K has stationary distribution

π(xi) ∝n∑k=1

e−d(xi,xk)

and furthermore, (K,π) is reversible:

π(xi)K(xi, xj) = π(xj)K(xj, xi).

Because K is reversible, it is diagonalizable in L2(X, π) in a

real orthonormal basis of eigenfunctions f1, . . . , fn with

corresponding real eigenvalues,

1 = λ1 ≥ λ2 ≥ · · · ≥ λn > −1.

f1 ≡ 1 since K is stochastic. Having fixed an orthonormal

basis of eigenfunctions, the k-dimensional MDS is defined to

be

Γ : xi 7→ yi = (f2(xi), . . . , fk+1(xi))

We are generally interested in k � n, for example, k ≤ 3.

Γ is an optimal mapping of X into Rk in the sense that it

minimizes∑1≤i,j≤n

π(xi)K(xi, xj)‖yi − yj‖2

over all Γ : X→ Rk such that

1.∑n

i=1 Γ(p)(xi)Γ(q)(xi)π(xi) = δpq 1 ≤ p, q ≤ k

2.∑n

i=1 Γ(p)(xi)π(xi) = 0 1 ≤ p ≤ k.

where Γ(p)(xi) is the pth coordinate of Γ(xi) ∈ Rk. 1 says that

the coordinate functions of Γ are orthonormal in L2(π) and 2

says that they are also orthogonal to constant functions.

Intuitively, Λ maps similar points in X (as measured via

q(xi, xj) = π(xi)K(xi, xj)) to nearby points in Rk.

In the preceding, we started with a metric d on X and built a

similarity S(xi, xj) = e−d(xi,xj) which in turn lead to a Gram

matrix G. We could instead define G via alternative

measures of similarity, e.g.

S(xi, xj) = supxk,xl d(xk, xl)− d(xi, xj). More generally, we

could begin with an arbitrary reversible Markov chain on X,

bypassing the metric d altogether.

The Voting Data

We are going to carefully analyze the output of

multidimensional scaling applied to the 2005 U.S. House of

Representatives roll call votes. The resultant 3-dimensional

mapping of legislators shows ‘horseshoes’ that are

characteristic of a number of dimensionality reduction

techniques, including principal components analysis and

correspondence analysis.

These patterns are heuristically attributed to a latent

ordering of the data, e.g. the ranking of politicians within a

left-right spectrum.

Roll Call Data

We apply the eigendecomposition algorithm to members of

the 2005 U.S. House of Representatives with the distance

between legislators defined via roll call votes [?].

A full House consists of 435 members, and in 2005 there

were 671 roll calls. The first two roll calls were a call of the

House by States and the election of the Speaker, and so were

excluded from our analysis. Hence, the data can be ordered

into a 435× 669 matrix Y = (yij) with Yij ∈ {1/2,−1/2, 0}indicating, respectively, a vote of ‘yea’, ‘nay’, or ‘not voting’

by Representative i on roll call j.

We further restricted our analysis to the 401 Representatives

that voted on at least 90% of the roll calls (220 Republicans,

180 Democrats and 1 Independent) leading to a 401× 669matrix V of voting data.

The Data

V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 -1 -1 1 -1 0 1 1 1 1 12 -1 -1 1 -1 0 1 1 1 1 13 1 1 -1 1 -1 1 1 -1 -1 -14 1 1 -1 1 -1 1 1 -1 -1 -15 1 1 -1 1 -1 1 1 -1 -1 -16 -1 -1 1 -1 0 1 1 1 1 17 -1 -1 1 -1 -1 1 1 1 1 18 -1 -1 1 -1 0 1 1 1 1 19 1 1 -1 1 -1 1 1 -1 -1 -110 -1 -1 1 -1 0 1 1 0 0 0

This step removed, for example, the Speaker of House

Dennis Hastert (R-IL) who by custom votes only when his

vote would be decisive, and Robert T. Matsui (D-CA) who

passed away at the start the term.

We define a distance between legislators as

d̂(li, lj) =1

669

669∑k=1

|vik − vjk|.

Rougly, d̂(li, lj) is the percentage of roll calls on which

legislators li and lj disagreed. This interpretation would be

exact if not for the possibility of ‘not voting’.

Since we now have data points in a metric space, we can

apply the MDS algorithm. The figure shows the results of a

3-dimensional MDS mapping. The most striking feature of

the mapping is that the data separate into ’twin horseshoes’.

In the next figure we have added color to indicate the

political party affiliation of each Representative (blue for

Democrat, red for Republican, and green for the lone

independent–Rep. Bernie Sanders of Vermont). The output

from MDS is qualitatively similar to that obtained from other

dimensionality reduction techniques, such as principal

components analysis applied directly to the voting matrix V .

We build and analyze a model for the data in an effort to

understand and interpret these pictures. Roughly our theory

predicts that the Democrats, for example, are ordered along

the blue curve in correspondence to their political ideology,

i.e. how far they lean to the left.

We discuss connections between the theory and the data. In

particular, we explain why in the data, legislators at the

political extremes are not quite at the tips of the MDS

curves, but rather are positioned slightly toward the center.

Briefly, this amounts to the fact that there are distinct

groups of relatively-liberal Republicans, which accordingly

exhibit quite different voting patterns.

!0.1!0.05

00.05

0.1!0.2

!0.1

0

0.1

0.2

!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

3-Dimensional MDS mapping of legislators based on the

2005 U.S. House of Representatives roll call votes.

!0.1!0.05

00.05

0.1!0.2

!0.1

0

0.1

0.2

!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

3-Dimensional MDS mapping of legislators based on the 2005

U.S. House of Representatives roll call votes. Color has been

added to indicate the party affiliation of each representative.

A Model for the Data

Following the standard paradigm of placing politicians within

a left-right spectrum, it is natural to identify legislators li1 ≤ i ≤ n with points in the interval I = [0, 1] in

correspondence with their political ideologies. We define the

distance between legislators to be

d(li, lj) = |li − lj|.

This assumption that legislators can be isometrically mapped

into an interval is key to our analysis.

To apply MDS to the voting data, we defined a distance

between legislators via roll call votes. We now introduce a

‘cut-point model’ for voting that connects our distance d

above to the data-based roll call distance.

The Model: Each bill 1 ≤ k ≤ m on which the legislators

vote is represented as a pair

(Ck, Pk) ∈ [0, 1]× {0, 1}.

We can think of Pk as indicating whether the bill is liberal

(Pk = 0) or conservative (Pk = 1), and we can take Ck to be

the cut-point between legislators that vote ‘yea’ or ‘nay’. Let

Vik ∈ {1/2,−1/2} indicate how legislator li votes on bill k.

Then, in this model,

Vik ={

1/2− Pk li ≤ CkPk − 1/2 li > Ck

.

As described, the model has n+ 2m parameters, one for

each legislator and two for each bill. We reduce the number

of parameters by assuming that the cut-points are

independent random variables uniform on I. Then,

P(Vik 6= Vjk) = d(li, lj) (1)

since legislators li and lj take opposites sides on a given bill

if and only if the cut-point Ck divides them. Observe that

the Pk do not affect the probability above.

Define the empirical distance between legislators li and lj by

d̂m(li, lj) =1m

m∑k=1

|Vik − Vjk| =1m

m∑k=1

1Vik 6=Vjk.

By 1, we can estimate the distance d between legislators by

the distance d̂ which is computable from the voting record.

In particular,

limm→∞

d̂m(li, lj) = d(li, lj) a.s.

since we assumed the cut-points are independent. More

precisely, we have the following result:

Lemma. For m ≥ log(n/√ε)/ε2

P(∣∣∣d̂m(li, lj)− d(li, lj)

∣∣∣ ≤ ε ∀ 1 ≤ i, j ≤ n)≥ 1− ε.

Proof. By the Hoeffding inequality, for fixed li and lj

P(∣∣∣d̂m(li, lj)− d(li, lj)

∣∣∣ > ε)≤ 2e−2mε2.

Consequently,

P

⋃1≤i<j≤n

∣∣∣d̂m(li, lj)− d(li, lj)∣∣∣ > ε

≤∑

1≤i<j≤nP(∣∣∣d̂m(li, lj)− d(li, lj)

∣∣∣ > ε)

≤(

n

22e−2mε2

)≤ ε

for m ≥ log(n/√ε)/ε2 and the result follows.

In our model we identified latent variables with points in the

interval I = [0, 1] and accordingly defined the distance

between them to be d(li, lj) = |li − lj|. This general

description seems to be reasonable in a number of

applications. We then built a simple model for the data that

facilitated empirical approximation of this distance. This

second step depends heavily on the application. In the rest

of the paper, we simply assume that the distance d can be

reasonably approximated from the data.

Analysis of the Model

In this section we analyze the MDS algorithm applied to

metric models satisfying

d(xi, xj) = |i/n− j/n|.

This corresponds to the case in which legislators are

uniformly spaced in I: li = i/n.

Similarity and Transition Matrices

Given a distance d on a state space X , there are several ways

to build a similarity S. Two standard transformations are:

1. S1(xi, xj) = e−d(xi,xj)

2. S2(xi, xj) = supzi,zj d(zi, zj)− d(xi, xj)

Once we have a similarity, we can define a Gram/Kernel

matrix K by normalizing the rows. That is,

K(xi, xj) =S(xi, xj)∑xkS(xi, xk)

.

To ease the analysis, sometimes we instead normalize the

similarity matrix by the average row sum

z =1n

∑xi

∑xj

S(xi, xj).

That is, we set K(xi, xj) = S(xi, xj)/z.

Eigenvectors and Horseshoes

We find approximate eigenfunctions and eigenvalues for

models that satisfy

d(xi, xj) = |i/n− j/n|

with Gram matrices that are built with either a linear

similarity or an exponential similarity. The eigenfunctions are

found by continuizing the discrete Gram matrix, and then

solving the corresponding integral equation∫ 1

0K(x, y)f(y)dy = λf(x).

Standard matrix perturbation theory can then be applied to

recover approximate eigenfunctions for the original, discrete

kernel.

The eigenfunctions that we derive are in agreement with

those arising from the voting data, and lend considerable

insight into our data analysis problem and also into general

features of MDS mappings.

Approximate Eigenfunctions

We now state a classical perturbation result that relates two

different notions of an approximate eigenfunction. For more

refined estimates, see Parlett[?].

Theorem. Consider an n × n symmetric matrix A witheigenvalues λ1 ≤ · · · ≤ λn. If for ε > 0

‖Af − λf‖2 ≤ ε

for some f, λ with ‖f‖2 = 1, then A has an eigenvalue λksuch that |λk − λ| ≤ ε.

If we further assume that s = mini:λi 6=λk |λi − λk| > ε

then A has an eigenfunction fk such that Afk = λkfk and‖f − fk‖2 ≤ ε/(s− ε).

Remark. The second statement of the theorem allowsnon-simple eigenvalues, but requires that the eigenvaluescorresponding to distinct eigenspaces be well-separated.Remark. The eigenfunction bound of the theorem isasymptotically tight in ε as the following exampleillustrates: Consider the matrix

A =[λ 00 λ+ s

]with s > 0. For ε < s define the function

f =

[ √1− ε2/s2

ε/s

].

Then ‖f‖2 = 1 and ‖Af − λf‖2 = ε. The theoremguarantees that there is an eigenfunction fk with eigenvalue

λk such that |λ − λk| ≤ ε. Since the eigenvalues of A areλ and λ + s, and since s > ε, we must have λk = λ. LetVk = {fk : Afk = λkfk} = {ce1 : c ∈ R} where e1 is the firststandard basis vector. Then

minfk∈Vk

‖f − fk‖2 = ‖f − (f · e1)e1‖ = ε/s.

The bound of the theorem, ε/(s− ε), is only slightly larger.

Proof of Approximate Eigenfunction TheoremProof. First we show that mini |λi − λ| ≤ ε.If mini |λi − λ| = 0 we are done; otherwise A− λI is invertible. Then,

‖f‖2 ≤ ‖(A− λI)−1‖ · ‖(A− λ)f‖2≤ ε‖(A− λI)−1‖.

Since the eigenvalues of (A − λI)−1 are 1/(λ1 − λ), . . . , 1/(λn − λ), by

symmetry

‖(A− λI)−1‖ =1

mini |λi − λ|.

The result now follows since ‖f‖2 = 1.

Set λk = argmin|λi− λ|, and consider an orthonormal basis g1, . . . , gm of

the associated eigenspace Eλk. Define fk to be the projection of f onto

Eλk:

fk = 〈f, g1〉g1 + · · ·+ 〈f, gm〉gm.Then fk is an eigenfunction with eigenvalue λk. Writing f = fk+(f−fk)

we have

(A− λI)f = (A− λI)fk + (A− λI)(f − fk)= (λk − λ)fk + (A− λI)(f − fk).

Since f − fk ∈ E⊥λk, by symmetry we have

〈fk, A(f − fk)〉 = 〈Afk, f − fk〉 = 〈λkfk, f − fk〉 = 0.

Consequently, 〈fk, (A− λI)(f − fk)〉 = 0 and by Pythagoras

‖Af − λf‖22 = (λk − λ)2‖fk‖2 + ‖(A− λI)(f − fk)‖22.

In particular, ε ≥ ‖Af − λf‖2 ≥ ‖(A− λI)(f − fk)‖2.For λi 6= λk, |λi − λ| ≥ s− ε. The result now follows since for h ∈ E⊥λk

‖(A− λI)h‖2 ≥ (s− ε)‖h‖2.

Centering Kernel matrices

If our kernel K is renormalized so that it has row sums 1.

K1n = 1

Then 1n is an eigenvector of K with eigenvalue 1.

As a consequence if we recenter K by applying the centering

matrix H = I− 1n11′, for any eigenvector v different from 1n

KHv = Kv − 1nK1n1′nv = λv

and also HKHv = λHv = λv

So we will not bother to recenter the K matrix.

Linear Similarity

When we make a continuous version of the discrete Kernel

matrix Kn, we get the continuous kernel

K(x, y) =32[1− |x− y|].

Once we guess that the solutions to the corresponding

integral equation are trigonometric, verifying this is

straightforward. We start with a simple integral computation.

Lemma. For a 6= 0∫ 1

0cos(ax+ b)[1− |c− x|]dx =

2a2(cos(ac+ b))

− 1a2 [a sin b− ac sin b− ac sin(a+ b) + cos(b) + cos(a+ b)] .

In particular,

1. For odd integers k∫ 1

0sin(kπ(x−1/2))[1−|c−x|]dx =

2(kπ)2 cos(kπ(c−1/2))

2. For solutions to (a/2) tan(a/2) = 1∫ 1

0cos [a(x− 1/2)] [1− |c− x|]dx =

2a2 cos [a(c− 1/2)] .

Proof. The result follows from a straightforward calculation. Set

fc(x) = cos(ax+ b)[1− |c− x|].

Then

Z 1

0fc(x)dx = (1− c)

Z c

0cos(ax+ b)dx+

Z c

0x cos(ax+ b)dx

+ (1 + c)

Z 1

ccos(ax+ b)dx−

Z 1

cx cos(ax+ b)dx.

Integration by parts shows that,

Zx cos(ax+ b) =

1

ax sin(ax+ b) +

1

a2cos(ax+ b).

Substituting into the above, we have

Z 1

0fc(x)dx =

1

a2[a(1− c) sin(ac+ b)− a(1− c) sin(b) + a(1 + c) sin(a+ b)

− a(1 + c) sin(ac+ b) + ac sin(ac+ b) + cos(ac+ b)− cos(b)

− a sin(a+ b)− cos(a+ b) + ac sin(ac+ b) + cos(ac+ b)].

At a = kπ and b = 0 for k an odd integer,

a sin b− ac sin b− ac sin(a+ b) + cos(b) + cos(a+ b) = 0

and so Z 1

0cos(kπx)[1− |c− x|]dx =

2

(kπ)2cos(kπc).

Since for odd ksin(kπ(x− 1/2)) = cos(kπx− π(k + 1)/2) = (−1)

(k+1)/2cos(kπx)

the first part of the lemma follows. At b = −a/2 where a is a solution to (a/2) tan(a/2) = 1

a sin b− ac sin b− ac sin(a+ b) + cos(b) + cos(a+ b) = −a sin(a/2) + 2 cos(a/2)

= 0.

Consequently, Z 1

0cos(ax− a/2)[1− |c− x|]dx =

2

a2cos(ac− a/2).

for a a solution to (a/2) tan(a/2) = 1.

The solutions of (a/2) tan(a/2) = 1 occur at approximately

a = 2kπ for integers k. More precisely, we have the following

result.

Lemma. The positive solutions of (a/2) tan(a/2) = 1 lie inthe set

(0, π) ∪∞⋃k=1

(2kπ, 2kπ + 2/kπ)

with exactly one solution per interval. Furthermore, a is asolution if and only if −a is a solution.

Proof. Let f(θ) = (θ/2) tan(θ/2). Then f is an even

function, so a is a solution to f(θ) = 1 if and only if −a is

a solution. Since f ′(θ) = (1/2) tan(θ/2) + (θ/4) sec2(θ/2),

f(θ) is non-negative and increasing in the first and second

quadrants, and furthermore

f(2kπ) = 0 < 1 < +∞ = limθ→(2k+1)π−

f(θ).

The third and fourth quadrants have no solutions since f(θ) ≤0 in those regions. This shows that the solutions to f(θ) = 1lie in the intervals

∞⋃k=0

(2kπ, 2kπ + π)

with exactly one solution per interval. Recall the power series

expansion of tan θ for |θ| < π/2 is

tan θ = θ + θ3/3 + 2θ5/15 + 17θ7/315 + . . . .

In particular, for 0 ≤ θ < π/2, tan θ ≥ θ. Finally, for k ∈ Z≥1

f(2kπ + 2/kπ) = (kπ + 1/kπ) tan(kπ + 1/kπ)

= (kπ + 1/kπ) tan(1/kπ)

≥ (kπ + 1/kπ)(1/kπ)

> 1

which gives the result.

Remark. The first few positive solutions of (a/2) tan(a/2) =1 are

1. a = 1.72066717803876 . . .

2. a = 6.85123691896346 . . .

3. a = 12.87459635834389 . . .

4. a = 19.05866881072393 . . .

Lemma. For 1 ≤ i, j ≤ n, let

Kn(xi, xj) =3

2n

(1− |i− j|

n

).

Set fn,a(xi) = cos(a(i/n − 1/2)) where a is a positivesolution to (a/2) tan(a/2) = 1, and set gn,k(xi) =sin(kπ(i/n− 1/2) for k ≥ 1 an odd integer. Then∣∣∣∣Knfn,a(xi)−

3a2fn,a(xi)

∣∣∣∣ ≤ a+ 1n

and ∣∣∣∣Kngn,k(xi)−3

(kπ)2gn,k(xi)∣∣∣∣ ≤ kπ + 1

n.

That is, fn,a and gn,k are approximate eigenfunctions of Kn

with approximate eigenvalues proportional to their squaredperiods.

Proof. Once we guess that f and g are approximate

eigenfunctions of Kn, the proof of this fact follows from

the integral computation in the previous Lemma. We have,

Knfn,a(xi) =3

2n

n∑j=1

cos(a(j/n− 1/2))[1− |i/n− j/n|]

=32

∫ 1

0cos(a(x− 1/2))[1− |j/n− x|]dx+

32Rn

=3a2fn,a(xi) +

32Rn by Lemma

where the error term satisfies

|Rn| ≤M

2nforM ≥ sup

0≤x≤1

∣∣∣∣ ddx cos(a(x− 1/2))[1− |j/n− x|]∣∣∣∣

by the standard right-hand rule error bound. In particular, we

can take M = a+ 1 independent of j, from which the result

for fn,a follows. The case of gn,k is analogous.

Lemma For 1 ≤ i, j ≤ n set

Kn(xi, xj) =3

2n

(1− |i− j|

n

)and let λ1, . . . , λn be the eigenvalues of Kn.

1. For positive solutions to (a/2) tan(a/2) = 1

min1≤i≤n

∣∣∣∣λi − 3a2

∣∣∣∣ ≤ 2(a+ 1)√n

.

2. For odd integers k ≥ 1

min1≤i≤n

∣∣∣∣λi − 3(kπ)2

∣∣∣∣ ≤ kπ + 1√n

.

Remark. By Remark the first few values of 3/a2 forsolutions to (a/2) tan(a/2) = 1 are

1. 1.01327541515878 . . .

2. 0.06391212873818 . . .

3. 0.01809897627265 . . .

4. 0.00825916473010 . . .

and the first few values of 3/(kπ)2 for k ≥ 1 an odd integerare

1. 0.30396355092701 . . .

2. 0.03377372788078 . . .

3. 0.01215854203708 . . .

4. 0.00620333777402 . . .

Exponential Transformation of Similarity

The case of exponential similarity is analogous to that of

linear similarity. Continuizing the discrete Gram matrix Kn,

we get the kernel

K(x, y) =e

2e−|x−y|.

Once again, we find trigonometric solutions to Kf = λf .

Lemma. For constants a, c ∈ R∫ 1

0e−|x−c| cos[a(x− 1/2)]dx

=2 cos[a(c− 1/2)]

1 + a2 +

(e−c + ec−1

)(a sin(a/2)− cos(a/2))

1 + a2

and∫ 1

0e−|x−c| sin[a(x− 1/2)]dx

=2 sin[a(c− 1/2)]

1 + a2 +

(e−c − ec−1

)(a cos(a/2) + sin(a/2))

1 + a2

In particular,

1. For a such that a tan(a/2) = 1∫ 1

0e−|x−c| cos[a(x− 1/2)]dx =

2 cos[a(c− 1/2)]1 + a2

2. For a such that a cot(a/2) = −1∫ 1

0e−|x−c| sin[a(x− 1/2)]dx =

2 sin[a(c− 1/2)]1 + a2

Proof. The lemma follows from a straightforward integration. First split the integral into two pieces:

Z 1

0e−|x−c|

cos[a(x− 1/2)]dx

=

Z c

0ex−c

cos[a(x− 1/2)]dx+

Z 1

cec−x

cos[a(x− 1/2)]dx.

By integration by parts applied twice,

Zex−c

cos[a(x− 1/2)]dx =aex−c sin(a(x− 1/2)) + ex−c cos(a(x− 1/2))

1 + a2

and Zec−x

cos[a(x− 1/2)]dx =aec−x sin(a(x− 1/2))− ec−x cos(a(x− 1/2))

1 + a2.

Evaluating these expressions at the appropriate limits of integration gives the first statement of the lemma. The computation

ofR 10 e−|x−c| sin[a(x− 1/2)]dx is analogous.

The solution of a tan(a/2) = 1 are approximately 2kπ for

integers k and the solutions of a cot(a/2) = −1 are

approximately (2k + 1)π.

Lemma.

1. The positive solutions of a tan(a/2) = 1 lie in the set

(0, π) ∪∞⋃k=1

(2kπ, 2kπ + 1/kπ)

with exactly one solution per interval. Furthermore, a isa solution if and only if −a is a solution.

2. The positive solutions of a cot(a/2) = −1 lie in the set∞⋃k=0

((2k + 1)π, (2k + 1)π + 1/(kπ + π/2))

with exactly one solution per interval. Furthermore, a isa solution if and only if −a is a solution.

Remark. The first few positive solutions of a tan(a/2) = 1 are

1. a = 1.30654237418881 . . .

2. a = 6.58462004256417 . . .

3. a = 12.72324078413133 . . .

4. a = 18.95497141084159 . . .

and the first few positive solutions of a cot(a/2) = −1 are

1. a = 3.67319440630425 . . .

2. a = 9.63168463569187 . . .

3. a = 15.83410536933241 . . .

4. a = 22.08165963594259 . . .

Lemma. For 1 ≤ i, j ≤ n, let

Kn(xi, xj) =e

2ne−|i−j|/n.

Set fn,a(xi) = cos(a(i/n − 1/2)) where a is a positivesolution to a tan(a/2) = 1, and set gn,a(xi) = sin(a(i/n −1/2)) where a is a positive solution to a cot(a/2) = −1.Then ∣∣∣∣Knfn,a(xi)−

e

1 + a2fn,a(xi)∣∣∣∣ ≤ 2(a+ 1)

n∣∣∣∣Kngn,a(xi)−e

1 + a2gn,a(xi)∣∣∣∣ ≤ 2(a+ 1)

n.

That is, fn,a and gn,a are approximate eigenfunctions of Kn.

Lemma. For 1 ≤ i, j ≤ n set

Kn(xi, xj) =e

2ne−|i−j|/n

and let λ1, . . . , λn be the eigenvalues of Kn.

1. For positive solutions to a tan(a/2) = 1

min1≤i≤n

∣∣∣∣λi − e

1 + a2

∣∣∣∣ ≤ 4(a+ 1)√n

.

2. For positive solutions to a cot(a/2) = −1

min1≤i≤n

∣∣∣∣λi − e

1 + a2

∣∣∣∣ ≤ 4(a+ 1)√n

.

Remark. The first few values of e/(1 + a2) for solutions toa tan(a/2) = 1 are

1. 1.00414799895293 . . .

2. 0.06128160783626 . . .

3. 0.01668877420197 . . .

4. 0.00754468546867 . . .

The first few values of e/(1 + a2) for solutions to a cot(a/2) = −1 are

1. 0.18756657740212 . . .

2. 0.02898902316936 . . .

3. 0.01079887885138 . . .

4. 0.00556341289490 . . .

Horseshoes and Twin Horseshoes

The 2-dimensional mapping is built out of the second andthird eigenfunctions of the Gram matrix. Above wecomputed several approximate eigenfunctions and

eigenvalues for the Gram matrix arising from the votingmodel. The linear and exponential similarity cases are

analogous, and so we only consider the latter here. In thiscase, we have the approximate eigenfunctions

1. fn,1(xi) = cos(1.3065(i/n− 1/2)) with eigenvalue λ ≈ 1.004

2. fn,2(xi) = sin(3.6732(i/n− 1/2)) with eigenvalue λ ≈ 0.1876

3. fn,3(xi) = cos(6.5846(i/n− 1/2)) with eigenvalue λ ≈ 0.06128.

0 0.5 1!0.5

0

0.5

1

1.5

2

0 0.5 1

!1

!0.5

0

0.5

1

0 0.5 1

!1

!0.5

0

0.5

1

Approximate eigenfunctions f1, f2 and f3.

!1 !0.8 !0.6 !0.4 !0.2 0 0.2 0.4 0.6 0.8 1!1

!0.8

!0.6

!0.4

!0.2

0

0.2

0.4

0.6

0.8

1

A horseshoe that results from plotting

Λ : xi 7→ (f2(xi), f3(xi)).

In particular, from Λ it is possible to deduce the relative

order of the representatives in the interval I. Since −f2 is

also an eigenfunction, it is not in general possible to

determine the absolute order knowing only that Λ comes

from the eigenfunctions.

You need a crib!

Voting Data

With the voting data, we see not one, but two horseshoes.

To see how this can happen, consider the two population

state space X = {x1, . . . , xn1, y1, . . . , yn2} with distance

d(xi, xj) = |i/n1 − j/n1|, d(yi, yj) = |i/n2 − j/n2| and

d(xi, yj) = +∞. This leads to the partitioned Gram matrix

Kn1+n2 =

[Kn1 0

0 Kn2

].

The approximate eigenfunctions and eigenvalues that we

found above for the single population model can now be

used to build higher dimensional eigenspaces for the two

population model. In particular, we have the following

approximate eigenspaces:Eigenspace with eigenvalue λ ≈ 1.004 containing orthogonal functions

gn,1(xi) =

sn1 + n2

n1fn1,1

(xi) · 11≤i≤n1+

sn1 + n2

n2fn2,1

(xi − n1) · 1n1<i≤n2

gn,2(xi) =

sn1 + n2

n1fn1,1

(xi) · 11≤i≤n1−

sn1 + n2

n2fn2,1

(xi − n1) · 1n1<i≤n2

Eigenspace with eigenvalue λ ≈ 0.1876 containing orthogonal functions

gn,3(xi) = a

sn1 + n2

n1fn1,2

(xi) · 11<i≤n1+

sn1 + n2

n2fn2,2

(xi − n1) · 1n1<i≤n2

gn,4(xi) =

sn1 + n2

n1fn1,2

(xi) · 11<i≤n1− a

sn1 + n2

n2fn2,2

(xi − n1) · 1n1<i≤n2

These functions are graphed for the case n1 = n2 and

a = 1/5. Moreover, plotting the 3-dimensional mapping

Λ : xi 7→ (g2(xi), g3(xi)g4(xi)) results in twin horseshoes.

0 1 2!3

!2

!1

0

1

2

3

4

0 1 2

!3

!2

!1

0

1

2

3

0 1 2

!3

!2

!1

0

1

2

3

0 1 2

!3

!2

!1

0

1

2

3

Approximate

eigenfunctions g1, g2, g3 and g4 for the Gram matrix arising

from the two population model.

!1!0.5

00.5

1

!1

!0.5

0

0.5

1!1

!0.5

0

0.5

1

Twin horseshoes

that result from plotting Λ : xi 7→ (g2(xi), g3(xi)g4(xi)).

The approximate eigenfunctions derived above are stable to

noise. Numerically, this is the case, as seen below.

That figure has generated by adding normal N(0, 1/5) noise

to the Gram matrix K200 before normalizing by the average

row sum. The specific form of the noise does not noticeably

affect the results.

0 100 200

!0.1

!0.05

0

0.05

0.1

0 100 200

!0.1

!0.05

0

0.05

0.1

0.15

0 100 200

!0.1

!0.05

0

0.05

0.1

0.15

Connecting the Model to the Data

When we apply eigendecomposition to the voting data, the

first few eigenvalues are:1

0.17709857573272 . . .0.01037622989886 . . .0.00831940284881 . . .0.00484075498479 . . .0.00344207632723 . . .0.00266158512355 . . .0.00248175112290 . . .

0 200 400 600!0.08

!0.06

!0.04

!0.02

0

0.02

0.04

0.06

0 200 400 600!0.15

!0.1

!0.05

0

0.05

0.1

0.15

0.2

0 200 400 600!0.15

!0.1

!0.05

0

0.05

0.1

0.15

0.2

The re-indexed second, third and fourth eigenfunctions

outputted from the MDS algorithm applied to the 2005 U.S.

House of Representatives roll call votes. Colors indicate

political parties.

Since legislators are not a priori ordered, the eigenfunctions

are difficult to interpret. However, our model suggests the

following ordering: Split the legislators into two groups G1

and G2 based on the sign of f2(xi); then the norm of f3 is

larger on one group, say G1, so we sort G1 based on

increasing values of f3, and similarly, sort G2 via f4.

0 200 400!0.08

!0.06

!0.04

!0.02

0

0.02

0.04

0.06

0 200 400!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

0 200 400!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

The re-indexed second, third and fourth eigenfunctions

outputted from the MDS algorithm applied to the 2004 U.S.

House of Representatives roll call votes. Colors indicate

political parties.

Our analysis suggests that if legislators are in fact

isometrically embedded in the interval I (relative to the roll

call distance), then the MDS rank will be consistent with the

order of legislators in the interval. This appears to be the

case in the data, for instance the following figure which

shows a graph of d(li, ·) for selected legislators li. For

example, as we would predict, d(l1, ·) is an increasing

function and d(ln, ·) is decreasing. Moreover, the data seem

to be in rough agreement with the metric assumption of our

two population model, namely that the two groups are

well-separated and that the within group distance is given by

d(li, lj) = |i/n− j/n|.

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

0 200 400 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Legislators

Distance

The empirical roll call derived distance function d(li, ·) for

selected legislators li = 1, 90, 181, 182, 290, 401. The x-axis

orders legislators according to their MDS rank.

Our voting model suggests that the MDS obtained ordering

of legislators should correspond to political ideology. To test

this, we compared the MDS results to the assessment of

legislators by Americans for Democratic Action [?]. Each

year, ADA selects 20 votes it considers the most important

during that session, for example, the Patriot Act

reauthorization. Legislators are assigned a Liberal Quotient:

the percentage of those 20 votes on which the Representative

voted in accordance with what ADA considered to be the

liberal position. For example, a representative who voted the

liberal position on all 20 votes would receive an LQ of 100%.

Figure below shows a plot of LQ vs. the MDS derived rank.

0 50 100 150 200 250 300 350 400 4500

10

20

30

40

50

60

70

80

90

100

Eigenmap Rank

Libe

ral Q

uotie

nt

Comparison of the

MDS derived rank for Representatives with the Liberal

Quotient as defined by Americans for Democratic Action.

This figure results because this notion of proximity, although

related, does correspond directly to political ideology. The

MDS and ADA rankings complement one another in the

sense that together they facilitate identification of two

distinct, yet relatively liberal groups of Republicans. That is,

although these two groups are relatively liberal, they are

considered to be liberal for different reasons.

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

90

100

Eigenmap Rank

Natio

nal J

ourn

al S

core

Comparison of the MDS derived rank for Representatives

with the National Journal’s liberal score

Practical Questions:

• Which transformations of distances work well for detecting

gradients:√1− exp(−d(x, y)) work well in practice.

• Are most Toeplitz eigenvectors are simple to approximate.

• Can we prove the eigenvectors are robust to Noise (for

instance the physicists Bohigas, Bogomolny and Schmit

show that for uniformly distributed points on a segment

(the one dimensional Anderson model) the eigenstructure is

the same.

• How do we extend this to a two dimensional (spatial)

gradient?

A little immunology

T-lymphocyte cells (T-cells) originally derive from stem cells

of the bone marrow. At around the time of birth,

lymphocytes derived in this way leave the marrow and pass

to the thymus gland in the chest, where they multiply.

The lymphocytes are processed by the thymus gland, so that

between them they carry the genetic information necessary

to react with a multitude of possible antigens.

Biological Questions

• Do cancer patients show differential expression in any genes

expressed in T-cells?

• Are there any differences between naive effector and memory

T-cells?

• What are the steps involved in T-cell differentiation?

Differences between the three cell types?

• Linear Model

N E M

Apop

• Parallel Model

N

M

ApopE

Genes differentially expressed

Using the variance stabilized data (vsn) and multtest using

Westfall and Young’s maxT: I ranked the genes by their

adjusted p-value.

I made my collaborator choose a stopping point on the list:

156 significant genes.

MDS Analysis

Transform the data from continuous to discrete: cutoff

decided through genes known to be expressed in some arrays

and not in others. (Biological not Statistical Criteria)

87% of the variation in the first plane:

−0.2 −0.1 0.0 0.1 0.2

−0.

3−

0.2

−0.

10.

00.

10.

2

Kt.ev$vectors[, 29:30][,1]

Kt.e

v$ve

ctor

s[, 2

9:30

][,2]

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

NAI

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

EFF

MEM

NAI

Topological Problems in Spaces ofPhylogenetic Trees

Biology now requires the use of non standard parameters

generalising work done on multivariate Euclidean spaces to

spaces of parameters that are not embeddable in Euclidean

structures. Visualisation of distances often provides much

more information that the simple distributions.

Less symmetrical Phylogenies

Linguistics use trees to map out the history of language.

Linguists use trees, but they have an ancient form and a

novel form. So their trees do not have symmetry between

siblings.

Examples include :

• Comparing Phylogenetic trees from different DNA data.

• Comparing Bootstrap Trees with the tree computed from

the original data sets.

• Comparing Hierarchical clustering trees on melanoma

patients.

• Constructing confidence sets for non standard data.

• Testing for mixtures of trees (Mossel, Vigoda show how

important this can be).

• Trying to detect horizontal gene transfer.

• Output of many trees sampled from a Bayesian posterior

distribution on trees.

• Sets of trees built with different data (DNA tree, behavioral

trees, pheno typic trees).

• Confidence regions of trees from Bayesian posteriors or

Bootstrap resamples.

• Neighborhood Explorations, how many neighbours? What

are the curvatures of the boundaries?

Documents

Topological Data Analysis for detecting Hidden …statweb.stanford.edu/~susan/talks/AIMTopDA.pdfTopological Data Analysis for detecting Hidden Patterns in Data Susan Holmes Statistics,