Download pdf - Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Statistical techniques in topological data analysis

Andrew J. Blumberg ([email protected])

June 17th, 2014

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.


Overview

Goal of talk








Overview

Goal of talk








Overview

Goal of talk








Overview

Goal of talk








Overview

Goal of talk








Recollections about TDA methodology

Pipeline

Finite metric space→ Filtered space→ barcode

(X , ∂X ) 7→ {Zk} 7→ B.

Question

What does the barcode mean?


Recollections about TDA methodology

Pipeline

Finite metric space→ Filtered space→ barcode

(X , ∂X ) 7→ {Zk} 7→ B.

Question

What does the barcode mean?





Reflections on what those pictures tell us

Data set might not come from an object with obviousgeometric structure.

Data might have meaningful features at varying scales.

Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.

Even small amounts of maliciously placed corruption canchange results substantially.




















Refinement of TDA pipeline to integrate with statistics

Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.

Refined pipeline

mm-space→ finite metric space→ filtered complex→ barcode

(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B

where X ′ ⊂ X .

This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′

k}.




Refined pipeline


(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B

where X ′ ⊂ X .


k}.




Refined pipeline


(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B

where X ′ ⊂ X .


k}.


Barcode space

Barcode space is a metric space:

Definition

Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is

dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),

where C varies over all matchings between B1 and B2.

Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.

(There are also a variety of other metrics.)


Barcode space


Definition


dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),





Barcode space


Definition


dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),





Barcode space


Definition


dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),





Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.






























Summarizing distributions of barcodes

Question

How do you summarize a distribution of barcodes? (What are themoments?)

Two approaches:

1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]

2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.



Question


Two approaches:





Question


Two approaches:





Question


Two approaches:




Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.






























Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)



Claim








Claim








Claim








Claim







Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.


Efficiency







Efficiency







Efficiency







Efficiency







Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]




In barcode space:








In barcode space:








In barcode space:








In barcode space:








In barcode space:








In barcode space:







As above, can first project to distributions on R or Rn (or the like):

1 Similar issues, although easier to specify hypotheses.

2 Confidence intervals for various statistics, in some casesanalytics.

Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).




















Euler characteristic

Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:

Observation

The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.

Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].

Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].




Observation







Observation







Observation





Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.



Outstanding issue








Outstanding issue








Outstanding issue








Outstanding issue