127
Statistical Inference for Geometric Data Jisu KIM Carnege Mellon University Sep 19, 2018 1 / 62

StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Statistical Inference for Geometric Data

Jisu KIM

Carnege Mellon University

Sep 19, 2018

1 / 62

Page 2: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

2 / 62

Page 3: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

High dimensional data suffers from the curse ofdimensionality.

1

1[Hastie et al., 2009, Ch2, Figure 2.6]3 / 62

Page 4: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The curse of dimensionality is mitigated when there is a lowdimensional geometric structure.

2

2http://www.skybluetrades.net/blog/posts/2011/10/30/machine-learning/4 / 62

Page 5: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Geometric structures in the data provide information.

3

3http://www.mpa-garching.mpg.de/galform/virgo/millennium/poster_half.jpg5 / 62

Page 6: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Statistic Inference for Geometric Data is explored.

I Minimax Rates for Geometric Parameters of a ManifoldI Minimax Rates for Estimating the Dimension of a Manifold (Kim,

Rinaldo, Wasserman, 2016)I The Origin of the Reach: Better Understanding Regularity Through

Minimax Estimation Theory (Aamari, Kim, Chazal, Michel, Rinaldo,Wasserman, 2017)

I Statistical Inference For Homological FeaturesI Statistical Inference for Cluster Trees (Kim, Chen, Balakrishnan,

Rinaldo, Wasserman, 2016)I Statistical Inference and Computation for Persistent Homology

I Statistical inference on persistent homology of KDE filtration on ripscomplex (Shin, Kim, Rinaldo, Wasserman, 2018?)

I R Package TDA: Statistical Tools for Topological Data Analysis(Fasy, Kim, Lecci, Maria, Milman, Rouvreau, 2014a)

6 / 62

Page 7: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

7 / 62

Page 8: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

A manifold is a low dimensional geometric structure thatlocally resembles Euclidean space.

4

4http://www.skybluetrades.net/blog/posts/2011/10/30/machine-learning/8 / 62

Page 9: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The maximum risk of an estimator is its worst expectederror.

I the maximum risk of an estimator θn is the worst expected error thatthe estimator θn can make.

I

supP∈P

EP(n)

[`(θn(X ), θ(P)

)]

I X = (X1, · · · ,Xn) is drawn from a fixed distribution P, where P iscontained in set of distributions P.

I estimator θn is any function of data X .I The loss function `(·, ·) measures the error of the estimator θn.

9 / 62

Page 10: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The minimax rate describes the statistical difficulty ofestimating a parameter.

I The minimax rate Rn is the risk of an estimator that performs bestin the worst case, as a function of sample size.

I

Rn = infθn

supP∈P

EP(n)

[`(θn(X ), θ(P)

)]

I X = (X1, · · · ,Xn) is drawn from a fixed distribution P, where P iscontained in set of distributions P.

I estimator θn is any function of data X .I The loss function `(·, ·) measures the error of the estimator θn.

10 / 62

Page 11: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We measure the statistical difficulty of estimating geometricparameters of a manifold by their minimax rate.

I Minimax Rates for Estimating the Dimension of a Manifold (Kim,Rinaldo, Wasserman, 2016)

I The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory (Aamari, Kim, Chazal, Michel, Rinaldo,Wasserman, 2017)

11 / 62

Page 12: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

12 / 62

Page 13: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Manifold learning finds an underlying manifold to reducedimension.

13 / 62

Page 14: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The intrinsic dimension of a manifold needs to be estimated.

I Most manifold learning algorithms require the intrinsic dimension ofthe manifold as input.

I The intrinsic dimension is rarely known in advance and therefore hasto be estimated.

14 / 62

Page 15: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Minimax rate for estimating the dimension

I

Rn = infˆdimn

supP∈P

EP(n)

[1(

ˆdimn(X ) 6= dim(P))]

I X = (X1, · · · ,Xn) is drawn from a fixed distribution P, where P iscontained in set of distributions P.

I estimator ˆdimn is any function of data X .I 0− 1 loss function is considered, so for all x , y ∈ R,`(x , y) = 1(x 6= y).

15 / 62

Page 16: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Minimax rate for estimating the dimension

Theorem(Proposition 28 and 29)

n−2n . infˆdimn

supP∈P

EP(n)

[1(

ˆdimn 6= dim(P))]

. n−1

m−1 n.

16 / 62

Page 17: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

17 / 62

Page 18: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach is the maximum radius of a ball that can roll overthe manifold.

DefinitionThe reach of M, denoted by τ(M), can be defined as

τ(M) = infa 6=b∈M

‖b − a‖222d(b − a, TaM)

,

where TaM is the tangent space of M at a.

τ(M)

M M

τ(M)

18 / 62

Page 19: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach is a regularity parameter in many geometricalinference problem.

I The reach is a key paramter in:I Dimension estimationI Homology inferenceI Volume estimationI Manifold clusteringI Diffusion maps

19 / 62

Page 20: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Minimax rate for estimating the reach

I

Rn = infτn

supP∈P

EP(n)

[∣∣∣∣1

τ(P)− 1τn(X )

∣∣∣∣q]

I X = (X1, · · · ,Xn) is drawn from a fixed distribution P, where P iscontained in set of distributions P.

I estimator τn is any function of data X .I inverse lq loss function is considered, so for all x , y ∈ R,`(x , y) =

∣∣∣ 1x− 1

y

∣∣∣q.

20 / 62

Page 21: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

An estimator of the reach

The reach of M is

τ(M) = infa 6=b∈M

‖b − a‖222d(b − a, TaM)

.

DefinitionGiven observation X = (X1, . . . ,Xn), we estimate the reach as

τn(X ) = inf1≤i 6=j≤n

‖Xj − Xi‖222d(Xj − Xi , TXiM)

.

21 / 62

Page 22: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Minimax rate for estimating the reach

Theorem(Theorem 45 and Proposition 50)

n−qd . inf

τnsupP∈P

EP(n)

[∣∣∣∣1

τ(P)− 1τn

∣∣∣∣q]

. n−2q

3d−1 .

22 / 62

Page 23: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

23 / 62

Page 24: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Geometric holes in the data provide information.

24 / 62

Page 25: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The number of holes is used to summarize geometricalfeatures.

I Geometrical objects :I A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W,

X, Y, Z,

I , , , , .5

I The number of holes of different dimensions is considered.

1. β0 =# of connected components

2. β1 =# of loops (holes inside 1-dim sphere)

3. β2 =# of voids (holes inside 2-dim sphere) : if dim ≥ 3

5Professor Valerie Ventura25 / 62

Page 26: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Example : Objects are classified by homologies.

1. β0 =# of connected components

2. β1 =# of loops

β0 \ β1 0 1 2

C, G, I, J, L, M,

1 N, S, U, V, W, Z, A, R, D, O, , P, Q BE, F, T, Y, H, K, X

2

3 ,

26 / 62

Page 27: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Statistical inference for homological features.

I Statistical Inference for Cluster Trees (Kim, Chen, Balakrishnan,Rinaldo, Wasserman, 2016)

27 / 62

Page 28: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

28 / 62

Page 29: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The cluster tree is the hierarchy of the high density clusters.

DefinitionFor a density function p, its cluster tree Tp is a function where Tp(λ) isthe set of connected components of the upper level set x : p(x) ≥ λ.

p(x)

x

p(x)

x

29 / 62

Page 30: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

A confidence set helps denoising the empirical tree.I An asymptotic 1− α confidence set Cα is a collection of trees with

the property that

P(Tp ∈ Cα) = 1− α+ o(1).

Ring data, alpha = 0.05la

mbd

a

0.0 0.2 0.4 0.6 0.8 1.0

00.

208

0.27

20.

529

30 / 62

Page 31: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We use the bootstrap to compute 1− α confidence set Cα.

I We let Tph be the cluster tree from the kernel density estimator ph,where

ph(x) =1

nhd

n∑

i=1

K

(x − Xi

h

),

and the confidence set as the ball centered at Tph and radius tα, i.e.

Cα = T : d∞(T ,Tph) ≤ tα .

Theorem(Theorem 57) Above confidence set Cα satisfies

P(Th ∈ Cα

)= 1− α+ O

(log7 n

nhm

)1/6 .

31 / 62

Page 32: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The pruned trees according to the confidence set recover theactual cluster trees.

Ring data, alpha = 0.05

lam

bda

0.0 0.2 0.4 0.6 0.8 1.0

00.

208

0.27

20.

529

Mickey mouse data, alpha = 0.05

lam

bda

0.0 0.2 0.4 0.6 0.8 1.0

00.

255

0.29

1

Yingyang data, alpha = 0.05

lam

bda

0.0 0.2 0.4 0.6 0.8 1.00

0.03

50.

044

0.05

20.

07

32 / 62

Page 33: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

33 / 62

Page 34: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Homology of finite sample is different from homology ofunderlying manifold, hence it cannot be directly used for theinference.

I When analyzing data, we prefer robust features where features of theunderlying manifold can be inferred from features of finite samples.

I Homology is not robust:

Underlying circle: β0 = 1, β1 = 1

100 samples: β0 = 100, β1 = 0

34 / 62

Page 35: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Persistent homology computes homologies on collection ofsets, and tracks when topological features are born andwhen they die.

35 / 62

Page 36: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Persistent homology of the underlying manifold can beinferred from persistent homology of finite samples.

Circle

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

200 samples

0.0 0.5 1.0 1.50.

00.

51.

01.

5Birth

Dea

th

36 / 62

Page 37: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Confidence band for persistent homology separateshomological signal from homological noise.

Circle

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

200 samples

0.0 0.5 1.0 1.50.

00.

51.

01.

5Birth

Dea

th

37 / 62

Page 38: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Statistical inference for persistent homology.

I Persistent homology of KDE filtration on rips complex (Shin, Kim,Rinaldo, Wasserman, 2018?)

I R Package TDA: Statistical Tools for Topological Data Analysis(Fasy, Kim, Lecci, Maria, Milman, Rouvreau, 2014a)

38 / 62

Page 39: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

39 / 62

Page 40: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Computing a confidence band for the persistent homologyincurs computing on a grid of points, which is infeasible inhigh dimensional space.

40 / 62

Page 41: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Computing the persistent homology of density function ondata points reduces computational complexity.

41 / 62

Page 42: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

How can we compute a confidence band for the persistenthomology with computation on data points?

I (Shin, Kim, Rinaldo, Wasserman, 2018?) : extending work from Fasyet al. [2014b], Bobrowski et al. [2014], Chazal et al. [2011].

Circle

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

200 samples

0.0 0.5 1.0 1.50.

00.

51.

01.

5

Birth

Dea

th

42 / 62

Page 43: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We rely on the kernel density estimator to extracttopological information of the underlying distribution.

I The kernel density estimator is

ph(x) =1

nhd

n∑

i=1

K

(x − Xi

h

).

43 / 62

Page 44: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We are considering the upper level set of the average kerneldensity estimator on the support.

I Let X1, . . . ,Xn ∼ P, then the average kernel density estimator is

ph(x) = E [ph(x)] =1hd

E[K

(x − X

h

)].

I We are considering the upper level sets of the average kernel densityestimator

DLL>0 , where DL := x ∈ supp(P) : ph(x) ≥ L .

44 / 62

Page 45: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We are considering the upper level set of the average kerneldensity estimator on the support.

I We are considering the upper level sets of the average KDE

DLL>0 , where DL := x ∈ supp(P) : ph(x) ≥ L .

45 / 62

Page 46: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We are targeting the persistent homology of the upper levelset of the average kernel density estimator on the support.

I We are considering the upper level sets of the average KDE

DLL>0 , where DL := x ∈ supp(P) : ph(x) ≥ L ,

and targeting its persistent homology PHsupp(P)∗ (ph).

46 / 62

Page 47: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We use the Rips complex to estimate the target persistenthomology.

I For X ⊂ Rm and r > 0, the Rips complex R(X , r) is defined as

R(X , r) =[Xi1 , . . . ,Xik ] : d(Xij ,Xil ) < 2r , 1 ≤ ∀j 6= l ≤ k, k = 1, . . . , n

.

47 / 62

Page 48: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We estimate the target level set by considering the Ripscomplex generated from the level set of the KDE.

I For X ⊂ Rm and r > 0, the Rips complex R(X , r) is defined as

R(X , r) =[Xi1 , . . . ,Xik ] : d(Xij ,Xil ) < 2r , 1 ≤ ∀j 6= l ≤ k, k = 1, . . . , n

.

I The KDE (kernel density estimator) is

ph(x) =1

nhd

n∑

i=1

K

(x − Xi

h

).

I For Xn = X1, . . . ,Xn, we consider the Rips complex generatedfrom the level set of the KDE

R(X ph

n,L, r)

L>0, where X ph

n,L := Xi ∈ Xn : ph(Xi ) ≥ L .

48 / 62

Page 49: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We estimate the target level set by considering the Ripscomplex generated from the level set of the KDE.

I For Xn = X1, . . . ,Xn, we estimate the target level set by the levelsets of the KDE on Rips complexes,

R(X ph

n,L, r)

L>0, where X ph

n,L := Xi ∈ Xn : ph(Xi ) ≥ L .

49 / 62

Page 50: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We estimate the target persistent homology by thepersistent homology of the KDE filtration on Ripscomplexes.

I We estimate the target persistent homology by the persistenthomology of the level sets of the KDE on Rips complexes,

R(X ph

n,L, r)

L>0, where X ph

n,L = Xi ∈ Xn : ph(Xi ) ≥ L .

and denote the persistent homology as PHR∗ (ph, r).

50 / 62

Page 51: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We estimate the target level set by Rips complexes from theKDE level sets.

I We approximate the target level set

DLL>0 , where DL := x ∈ supp(P) : ph(x) ≥ L ,by the level sets of the KDE on Rips complexes,

R(X ph

n,L, r)

L>0, where X ph

n,L = Xi ∈ Xn : ph(Xi ) ≥ L .

51 / 62

Page 52: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We estimate the target persistent homology by thepersistent homology of the KDE filtration on Ripscomplexes.

I We estimate the target persistent homology

PHsupp(P)∗ (ph),

by the persistent homology of the KDE filtration on Rips complexes,

PHR∗ (ph, r).

52 / 62

Page 53: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The persistent homology of the KDE filtration on Ripscomplexes is consistent.

Theorem(Theorem 74)

dB(PHR∗ (phn , rn),PH

supp(P)∗ (phn)

)= OP

(√log(1/hn)

nhdn+ ‖rn‖∞

).

53 / 62

Page 54: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Confidence set

I An asymptotic 1− α confidence set Cα is a random set of persistenthomologies satisfying

P(PHsupp(P)∗ (phn) ∈ Cα) ≥ 1− α+ o(1).

54 / 62

Page 55: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Confidence set for the persistent homology of the KDEfiltration.

I We let the confidence set as the ball centered at PHR∗ (phn , rn) and

radius bα, i.e.

Cα =P :, dB

(P,PHR

∗ (phn , rn))≤ bα

.

This is a valid confidence set by the following theorem.

Theorem(Theorem 78)

P(PH

supp(P)∗ (phn) ∈ Cα

)≥ 1− α+ o(1).

55 / 62

Page 56: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

56 / 62

Page 57: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

R Package TDA provides an R interface for C++ librariesfor Topological Data Analysis.

I website:https://cran.r-project.org/web/packages/TDA/index.html

I Author: Brittany Terese Fasy, Jisu Kim, Fabrizio Lecci, ClémentMaria, David Milman, and Vincent Rouvreau.

I R is a programming language for statistical computing and graphics.I R has short development time, while C/C++ has short execution

time.I R package TDA provides an R interface for C++ library

GUDHI/Dionysus/PHAT, which are for Topological Data Analysis.

57 / 62

Page 58: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

R Package TDA provides an R interface for C++ librariesfor Topological Data Analysis.

I # of downloads (2014-08-18 - 2018-09-19) : 21820

58 / 62

Page 59: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Introduction

Minimax Rates for Geometric Parameters of a ManifoldMinimax Rates for Estimating the Dimension of a ManifoldThe Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Statistical Inference For Homological FeaturesStatistical Inference for Cluster Trees

Statistical Inference and Computation for Persistent HomologyPersistent homology of KDE filtration on rips complexR Package TDA: Statistical Tools for Topological Data Analysis

Conclusion

59 / 62

Page 60: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Presented papers

I Jaehyeok Shin, Jisu Kim, Alessandro Rinaldo, Larry Wasserman,Persistent homology of KDE filtration on rips complex

I Eddie Aamari, Jisu Kim, Frédéric Chazal, Bertrand Michel,Alessandro Rinaldo, Larry Wasserman, Estimating the Reach of aManifold [2017] [ arXiv | HAL | BibTeX ]

I Jisu Kim, Yen-Chi Chen, Sivaraman Balakrishnan, AlessandroRinaldo, Larry Wasserman, Statistical Inference for Cluster Trees[2017] [ arXiv | NIPS | BibTeX ]

I Jisu Kim, Alessandro Rinaldo, Larry Wasserman, Minimax Rates forEstimating the Dimension of a Manifold [2016] [ arXiv | BibTeX ]

I Brittany T. Fasy, Jisu Kim, Fabrizio Lecci, Clément Maria, David L.Millman, Vincent Rouvreau, Introduction to the R package TDA.[2014] [ arXiv | BibTeX ]

60 / 62

Page 61: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Other papers

I Kwangho Kim, Jisu Kim, Barnabás Póczos, Distribution Regressionin Semi-supervised Learning

I Kwangho Kim, Jisu Kim, Edward H. Kennedy, Causal effects basedon distributional distances [2018] [ arXiv | BibTeX ]

I Kijung Shin, Bryan Hooi, Jisu Kim, Christos Faloutsos, Think beforeYou Discard: Accurate Triangle Counting in Graph Streams withDeletions [2018] [ paper | appendix | BibTeX ]

I Kijung Shin, Bryan Hooi, Jisu Kim, Christos Faloutsos, DenseAlert:Incremental Dense-Subtensor Detection in Tensor Streams [2017] [paper | appendix | BibTeX ]

I Kijung Shin, Bryan Hooi, Jisu Kim, Christos Faloutsos, D-Cube:Dense-Block Detection in Terabyte-Scale Tensors [2017] [ paper |appendix | BibTeX ]

61 / 62

Page 62: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Thank you!

Eddie Aamari SivaramanBalakrishnan

Frédéric Chazal

Yen-Chi Chen Jessi Cisewski Christos Faloutsos

Brittany T. Fasy

Stephan Günnemann

Giseon Heo Bryan Hooi Edward Kennedy

Kwangho Kim Jay-Yoon Lee Clément Maria

Bertrand Michel

David L. Millman

Jaeseong Oh BarnabásPóczos

Alessandro Rinaldo

Vincent Rouvreau

Jaehyeok Shin Kijung Shin Daren Wang Larry Wasserman

62 / 62

Page 63: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

1 / 65

Page 64: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Bottleneck distance gives a metric on the space of persistenthomology.

DefinitionLet D1, D2 be multiset of points. Bottleneck distance is defined as

dB(D1, D2) = infγ

supx∈D1

‖x − γ(x)‖∞,

where γ ranges over all bijections from D1 to D2.

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

2 / 65

Page 65: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Bottleneck distance gives a metric on the space of persistenthomology.

DefinitionLet D1, D2 be multiset of points. Bottleneck distance is defined as

dB(D1, D2) = infγ

supx∈D1

‖x − γ(x)‖∞,

where γ ranges over all bijections from D1 to D2.

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

3 / 65

Page 66: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Confidence band for the persistent homology is a randomquantity containing the persistent homology with highprobability.

Let M be a compact manifold, and X = X1, · · · ,Xn be n samples. LetfM and fX be corresponding functions whose persistent homology is ofinterest. Given the significance level α ∈ (0, 1), (1− α) confidence bandcn = cn(X ) is a random variable satisfying

P (dB(Dgm(fM), Dgm(fX )) ≤ cn) ≥ 1− α.

Circle

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

200 samples

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

4 / 65

Page 67: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Confidence band for the persistent homology is a randomquantity containing the persistent homology with highprobability.

Let M be a compact manifold, and X = X1, · · · ,Xn be n samples. LetfM and fX be corresponding functions whose persistent homology is ofinterest. Given the significance level α ∈ (0, 1), (1− α) confidence bandcn = cn(X ) is a random variable satisfying

P (dB(Dgm(fM), Dgm(fX )) ≤ cn) ≥ 1− α.

Circle

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

200 samples

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Birth

Dea

th

5 / 65

Page 68: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Confidence band for the persistent homology can becomputed using the bootstrap algorithm.

1. Given a sample X = x1, . . . , xn, compute the kernel densityestimator ph.

2. Draw X ∗ = x∗1 , . . . , x∗n from X = x1, . . . , xn (with replacement),and compute θ∗ =

√n||p∗h(x)− ph(x)||∞, where p∗h is the density

estimator computed using X ∗.3. Repeat the previous step B times to obtain θ∗1 , . . . , θ

∗B

4. Compute qα = infq : 1

B

∑Bj=1 I (θ

∗j ≥ q) ≤ α

5. The (1− α) confidence band for E[ph] is[ph − qα√

n, ph +

qα√n

].

6 / 65

Page 69: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

7 / 65

Page 70: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

8 / 65

Page 71: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The supporting manifold M is assumed to be bounded.

M ⊂ I := [−KI ,KI ]m ⊂ Rm with KI ∈ (0,∞)

9 / 65

Page 72: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach is assumed to be lower bounded to avoid anarbitrarily complicated manifold.

I P is a set of distributions P that is supported on a bounded manifoldM, with its reach τ(M) ≥ τg , and with other regularity assumptions.

πM (x)

x

≤ τg

M

10 / 65

Page 73: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach is assumed to be lower bounded to avoid anarbitrarily complicated manifold.

I M is of local reach ≥ τ`, if for all points p ∈ M, there exists aneighborhood Up ⊂ M such that Up is of reach ≥ τ`.

11 / 65

Page 74: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Density is bounded away from ∞ with respect to theuniform measure.

I Distribution P is absolutely continuous to induced Lebesgue measurevolM , and dP

dvolM≤ Kp for fixed Kp.

I This implies that the distribution on the manifold is of essentialdimension d .

I Pdκl ,κg ,Kp

denotes set of distributions P that is supported ond-dimensional manifold of (global) reach ≥ τg , local reach ≥ τ`, anddensity is bounded by Kp.

12 / 65

Page 75: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

13 / 65

Page 76: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The Maximum Risk of any chosen Estimator Provides anUpper Bound on the Minimax Rate.

Rn = infˆdimn

supP∈P

EP(n)

[`(

ˆdimn(X ), dim(P))]

≤ supP∈P

EP(n)

[`(

ˆdimn(X ), dim(P))]

︸ ︷︷ ︸the maximum risk of any chosen estimator

14 / 65

Page 77: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

TSP(Travelling Salesman Problem) Path Finds ShortestPath that Visits Each Points exactly Once.

6

6http://www.heatonresearch.com/fun/tsp/anneal15 / 65

Page 78: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Our Estimator estimates Dimension to be d2 if d1-squaredLength of TSP Generated by the Data is Long.

I When intrinsic dimesion is higher, length of TSP path is likely to belonger.

I

ˆdimn(X ) = d1 ⇐⇒

∃σ ∈ Sn s.tn−1∑

i=1

‖Xσ(i+1) − Xσ(i)‖d1Rm ≤ C ,

where C is some constant that depends only on KI , d1, and m.

16 / 65

Page 79: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Our Estimator has Maximum Risk of O(n−(

d2d1−1)n).

I Our estimator makes error with probability at most O(n−(

d2d1−1)n

)

if intrinsic dimension is d2.I Our estimator is always correct when the intrinsic dimension is d1.

17 / 65

Page 80: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Our Estimator makes Error with Probability at most

O

(n−(

d2d1−1)n)

if Intrinsic Dimension is d2.

I Based on the following lemma:

Lemma(Lemma 18) Let X1, · · · ,Xn ∼ P ∈ Pd2

κl ,κg ,Kp, then

P(n)

[n−1∑

i=1

‖Xi+1 − Xi‖d1 ≤ L

]. n−

d2d1

n.

18 / 65

Page 81: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Our Estimator is always Correct when the IntrinsicDimension is d1.

I Based on following lemma:

Lemma(Lemma 19) Let M be a d1-dimensional manifold with global reach ≥ τgand local reach ≥ τ`, and X1, · · · ,Xn ∈ M. Then there exists C whichdepends only on m, d1 and KI , and there exists σ ∈ Sn such that

n−1∑

i=1

‖Xσ(i+1) − Xσ(i)‖d1Rm ≤ C .

19 / 65

Page 82: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Our estimator is always correct when the intrinsic dimensionis d1.

n−1∑

i=1

‖Xσ(i+1) − Xσ(i)‖d1Rm ≤ C .

I When d1 = 1 so that the manifold is a curve, length of TSP path isbounded by length of curve volM(M).

Xσ(1)

Xσ(2)

Xσ(3)

Xσ(n−1)

Xσ(n)

. . .

Y1

Y2

Yn−1

∑Yi ≤ volM (M)

M

Xσ(n−2)

Yn−2

I Global reach≥ τg implies volM(M) is bounded.20 / 65

Page 83: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Our estimator is always correct when the intrinsic dimensionis d1.

n−1∑

i=1

‖Xσ(i+1) − Xσ(i)‖d1Rm ≤ C .

I When d1 > 1, Several conditions implied by regularity conditionscombined with Hölder continuity of d1-dimensional space-fillingcurve is used.

21 / 65

Page 84: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Our estimator is always correct when the intrinsic dimensionis d1.

n−1∑

i=1

‖Xσ(i+1) − Xσ(i)‖d1Rm ≤ C .

I When d1 > 1, Several conditions implied by regularity conditionscombined with Hölder continuity of d1-dimensional space-fillingcurve is used.

Lemma(Lemma 85, Space-filling curve) There exists surjective mapψd : R→ Rd which is Hölder continuous of order 1/d , i.e.

0 ≤ ∀s, t ≤ 1, ‖ψd(s)− ψd(t)‖Rd ≤ 2√d + 3|s − t|1/d .

22 / 65

Page 85: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Mimimax rate is upper bounded by O

(n−(

d2d1−1)n).

Proposition(Proposition 21) Let 1 ≤ d1 < d2 ≤ m. Then

infˆdimn

supP∈Pd1∪Pd2

EP(n)

[l(

ˆdimn, dim(P))]

. n−(d2d1−1)n.

23 / 65

Page 86: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

24 / 65

Page 87: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

A subset T ⊂ [−KI ,KI ]n and set of distributions Pd1

1 , Pd22

are found so that, whenever X = (X1, · · · ,Xn) ∈ T , wecannot distinguish two models.

I The lower bound measures how hard it is to tell whether the datacome from a d1 or d2 -dimensional manifold.

I T , Pd11 and Pd2

2 are linked to the lower bound by using Le Cam’slemma.

25 / 65

Page 88: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Le Cam’s Lemma provides lower bounds based on theminimum of two densities q1 ∧ q2, where q1, q2 are inconvex hull of Pd1

1 and convex hull of Pd22 , respectively.

Lemma(Lemma 22, Le Cam’s Lemma) Let P be a set of probability measures,and Pd1 ,Pd2 ⊂ P be such that for all P ∈ Pdi , θ(P) = θi for i = 1, 2.For any Qi ∈ co(Pi ), let qi be density of Qi with respect to measure ν.Then

infθsupP∈P

EP [d(θ, θ(P))] ≥d(θ1, θ2)

4sup

Qi∈co(Pdi )

∫[q1(x) ∧ q2(x)]dν(x).

26 / 65

Page 89: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

T is constructed so that for any x = (x1, · · · , xn) ∈ T ,there exists a d1-dimensional manifold that satisfiesregularity conditions and passes through x1, · · · , xn.

I Ti ’s are cylinder sets in [−KI ,KI ]d2 , and then T is constructed as

T = Snn∏

i=1Ti , where the permutation group Sn acts on

n∏i=1

Ti as a

coordinate change.

T1 T2

T4 T3

T5 T6

T8 T7

τ`

2KI

2KI

27 / 65

Page 90: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

T is constructed so that for any x = (x1, · · · , xn) ∈ T ,there exists a d1-dimensional manifold that satisfiesregularity conditions and passes through x1, · · · , xn.

I Given x1, · · · , xn ∈ T (blue points), manifold of global reach ≥ τgand local reach ≥ τ` (red line) passes through x1, · · · , xn.

T1 T2

x4

x1

x6

x2

x3

x5

x7x8

28 / 65

Page 91: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Pd11 is constructed as set of distributions that are supported

on manifolds that passes through x1, · · · , xn forx = (x1, · · · , xn) ∈ T , and Pd2

2 is a singleton set consistingof the uniform distirbution on [−KI ,KI ]

d2.

29 / 65

Page 92: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

If X ∈ T , it is hard to determine whether X is sampledfrom distribution P in either Pd1

1 or Pd22 .

I There exists Q1 ∈ co(Pd11 ) and Q2 ∈ co(Pd2

2 ) such thatq1(x) ≥ Cq2(x) for every x ∈ T with C < 1.

I Then q1(x) ∧ q2(x) ≥ Cq2(x) if x ∈ T , so C∫Tq2(x)dx can serve

as lower bound of minimax rate.I Based on following claim:

Claim(Claim 25) Let T = Sn

n∏i=1

Ti . Then for all x ∈ intT , there exists C > 0

that depends only on κl , KI , and rx > 0 such that for all r < rx ,

Q1 (B(xi , r)) ≥ CQ2 (B(xi , r)) .

30 / 65

Page 93: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Mimimax rate is lower bounded by Ω(n−2(d2−d1)n

).

I Lower bound below is now combination of Le Cam’s lemma,constructions of T , Pd1

1 , Pd22 , and claim.

Proposition(Proposition 26)

infˆdim

supP∈Pd1∪Pd2

EP(n) [l( ˆdimn, dim(P))] & n−2(d2−d1)n.

31 / 65

Page 94: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

32 / 65

Page 95: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Multinary Classification and 0− 1 Loss are Considered.

I

Rn = infˆdimn

supP∈P

EP(n)

[`(

ˆdimn(X ), dim(P))]

I Now the manifolds are of any dimensions between 1 and m, so

considered distribution set is P =m⋃

d=1Pd .

I 0− 1 loss function is considered, so for all x , y ∈ R,`(x , y) = I (x = y).

33 / 65

Page 96: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Mimimax Rate is Upper Bounded by O(n−

1m−1n

), and

Lower Bounded by Ω(n−2n

).

Proposition(Proposition 28 and 29)

n−2n . infˆdimn

supP∈P

EP(n)

[l(

ˆdimn, dim(P))]

. n−1

m−1 n.

34 / 65

Page 97: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

35 / 65

Page 98: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

36 / 65

Page 99: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The medial axis of a set M is the set of points that have atleast two nearest neighbors on the set M .

I

Med(M) = z ∈ Rm : there exists p 6= q ∈ M with‖p − z‖ = ‖q − z‖ = d(z ,M).

τM

Med(M)

M

37 / 65

Page 100: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach of M , denoted by τM , is the minimum distancefrom Med(M) to M .

I

τM = infx∈Med(M),y∈M

‖x − y‖ .

τM

Med(M)

M

38 / 65

Page 101: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach τM gives the maximum offset size of M on whichthe projection is well defined.

I

τM = infx∈Med(M),y∈M

‖x − y‖ .

τM

Med(M)

M

39 / 65

Page 102: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach τM gives the maximum radius of a ball that youcan roll over M .

I When M ⊂ Rm is a manifold,

τM = infq2 6=q1∈M

‖q2 − q1‖22d(q2 − q1,Tq1M)

.

M

q1 + Tq1M

d (q2 − q1, Tq1M)

‖q2 − q1‖‖q2−q1‖2

2d(q2−q1,Tq1M)

C

q2

q1

40 / 65

Page 103: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach τM gives the maximum radius of a ball that youcan roll over M .

I When M ⊂ Rm is a manifold,

τM = infq2 6=q1∈M

‖q2 − q1‖22d(q2 − q1,Tq1M)

.

τMM

41 / 65

Page 104: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The bottleneck is a geometric structure where the manifoldis nearly self-intersecting.

Definition(Definition 34) A pair of points (q1, q2) in M is said to be a bottleneck ofM if there exists z0 ∈ Med(M) such that q1, q2 ∈ B(z0, τM) and‖q1 − q2‖ = 2τM .

q1

Med(M)

B(z0, τM )

M

‖q1 − q2‖ = 2τM

q2z0

42 / 65

Page 105: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach is attained either from the bottleneck (globalcase) or the area of high curvature (local case).

Theorem(Theorem 37) At least one of the following two assertions holds:

I (Global Case) M has a bottleneck (q1, q2) ∈ M2.I (Local case) There exists q0 ∈ M and an arc-length parametrized γ0

such that γ0(0) = q0 and ‖γ′′0 (0)‖ = 1τM

.

q1

Med(M)

B(z0, τM )

M

‖q1 − q2‖ = 2τM

q2z0

q0

z0

τM

B(z0, τM )

Med(M)M

43 / 65

Page 106: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

44 / 65

Page 107: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach τM gives the maximum radius of a ball that youcan roll over M .

I When M ⊂ Rm is a manifold,

τM = infq 6=p∈M

‖q − p‖22d(q − p,TpM)

.

M

q1 + Tq1M

d (q2 − q1, Tq1M)

‖q2 − q1‖‖q2−q1‖2

2d(q2−q1,Tq1M)

C

q2

q1

45 / 65

Page 108: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We define the reach estimator τ as the maximum radius of aball that you can roll over the point cloud.

I Let X = x1, . . . , xn be a finite point cloud, then the reachestimator τ is a plugin estimator as

τ(X ) = infxi 6=xj∈X

‖xj − xi‖22d(xj − xi ,TxiM)

.

M

q1 + Tq1M

d (q2 − q1, Tq1M)

‖q2 − q1‖‖q2−q1‖2

2d(q2−q1,Tq1M)

C

q2

q1

46 / 65

Page 109: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The statistical efficiency of the reach estimator τ is analyzedthrough its risk.

I The risk of the estimator τ is the expected loss the estimator.

EP(n) [` (τ(X ), τM)] .

I X = X1, . . . ,Xn is drawn from a fixed distribution P with itssupport M.

I The loss function used is `(τ, τ ′) =∣∣ 1τ− 1

τ ′

∣∣p, p ≥ 1.

47 / 65

Page 110: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The risk of the reach estimator τ is analyzed.

I The risk of the estimator τ is the expected loss the estimator

EP(n)

[∣∣∣∣1τM− 1τ(X )

∣∣∣∣p].

I X = X1, . . . ,Xn is drawn from a fixed distribution P with itssupport M.

I The loss function used is `(τ, τ ′) =∣∣ 1τ− 1

τ ′

∣∣p, p ≥ 1.

48 / 65

Page 111: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach estimator has the risk of O(n−

2p3d−1

).

I The reach estimator has the risk of O(n−

pd

)for the global case.

I The reach estimator has the risk of O(n−

2p3d−1

)for the local case.

q1

Med(M)

B(z0, τM )

M

‖q1 − q2‖ = 2τM

q2z0

q0

z0

τM

B(z0, τM )

Med(M)M

49 / 65

Page 112: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach estimator has the maximum risk of O(n−

pd

)for

the global case.Proposition(Proposition 40) Assume that the support M has a bottleneck. Then,

EPn

[∣∣∣∣1τM− 1τ(X )

∣∣∣∣p]

. n−pd .

q1

Med(M)

B(z0, τM )

M

‖q1 − q2‖ = 2τM

q2z0

50 / 65

Page 113: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The reach estimator has the maximum risk of O(n−

2p3d−1

)

for the local case.Proposition(Proposition 44) Suppose there exists q0 ∈ M and a geodesic γ0 withγ0(0) = q0 and ‖γ′′0 (0)‖ = 1

τM. Then,

EPn

[∣∣∣∣1τM− 1τ(X )

∣∣∣∣p]

. n−2p

3d−1 .

q0

z0

τM

B(z0, τM )

Med(M)M

51 / 65

Page 114: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

52 / 65

Page 115: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The statistical difficulty of the reach estimation problem isanalyzed by the minimax rate.

I Minimax rate is the risk of an estimator that performs best in theworst case, as a function of sample size.

I

Rn = infτn

supP∈P

EPn [` (τn(X ), τM)] .

I X = X1, . . . ,Xn is drawn from a fixed distribution P with itssupport M, where P is contained in set of distributions P.

I An estimator τn is any function of data X .I The loss function used is `(τ, τ ′) =

∣∣ 1τ− 1

τ ′

∣∣p, p ≥ 1.

53 / 65

Page 116: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The statistical difficulty of the reach estimation problem isanalyzed by the minimax rate.

I Minimax rate is the risk of an estimator that performs best in theworst case, as a function of sample size.

I

Rn = infτn

supP∈P

EPn

[∣∣∣∣1τM− 1τn(X )

∣∣∣∣p].

I X = X1, . . . ,Xn is drawn from a fixed distribution P with itssupport M, where P is contained in set of distributions P.

I An estimator τn is any function of data X .I The loss function used is `(τ, τ ′) =

∣∣ 1τ− 1

τ ′

∣∣p, p ≥ 1.

54 / 65

Page 117: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

The maximum risk of our estimator provides an upperbound on the minimax rate.

Rn = infτn

supP∈P

EPn

[∣∣∣∣1τM− 1τn(X )

∣∣∣∣p]

≤ supP∈P

EPn

[∣∣∣∣1τM− 1τ(X )

∣∣∣∣p]

︸ ︷︷ ︸the maximum risk of our estimator

55 / 65

Page 118: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Minimax rate is upper bounded by O(n−

2p3d−1

).

Theorem(Theorem 45)

infτn

supP∈P

EPn

[∣∣∣∣1τM− 1τn

∣∣∣∣p]

. n−2p

3d−1 .

56 / 65

Page 119: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Le Cam’s lemma provides a lower bound based on the reachdifference and the statistical difference of two distributions.

I Total variance distance between two distributions is defined as

TV (P,P ′) = supA∈B(RD )

|P(A)− P ′(A)| .

Lemma(Lemma 46) Let P,P ′ ∈ P with respective supports M and M ′. Then

infτn

supP∈P

EPn

[∣∣∣∣1τM− 1τn

∣∣∣∣p]

&

∣∣∣∣1τM− 1τM′

∣∣∣∣p

(1− TV (P,P ′))2n.

57 / 65

Page 120: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Two distributions P , P ′ are found so that their reaches differbut they are statistically difficult to distinguish.

I

infτn

supP∈P

EPn

[∣∣∣∣1τM− 1τn

∣∣∣∣p]

&

∣∣∣∣1τM− 1τM′

∣∣∣∣p

(1− TV (P,P ′))2n.

I The lower bound measures how hard it is to tell whether the data isfrom distributions with different reaches.

I P and P ′ are found so that∣∣∣ 1τM− 1

τM′

∣∣∣p

is large while

(1− TV (P,P ′))2n is small.

58 / 65

Page 121: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

P is a distribution supported on a sphere while P ′ is adistribution supported on a bumped sphere.

M ′

M

59 / 65

Page 122: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Mimimax rate is lower bounded by Ω(n−

pd

).

Proposition(Proposition 50)

infτn

supP∈P

EPn

[∣∣∣∣1τM− 1τn

∣∣∣∣p]

& n−pd .

60 / 65

Page 123: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

61 / 65

Page 124: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

We can use `∞ metric to measure a distance between trees.

DefinitionThe l∞ metric between trees are defined as

d∞(Tp,Tq) = sup |p(x)− q(x)| .

62 / 65

Page 125: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Pruning finds the simpler trees that are in the confidence set.

I We propose two pruning schemes to find trees that are simpler theempirical tree Tph and are in the fconfidence set.

I Pruning only leaves: remove all leaves of length less than 2tα.I Pruning leaves and internal branches: iteratively remove all branches

of cumulative length less than 2tα.

L1

L2

L3 L4

L5 L6

E1

E2

E3

E5

E4

63 / 65

Page 126: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

Stability and Statistical Inference for Persistent Homology

Minimax Rates for Estimating the Dimension of a ManifoldRegularity conditionsUpper BoundLower BoundUpper Bound and Lower Bound for General Case

The Origin of the Reach: Better Understanding Regularity ThroughMinimax Estimation Theory

Reach and its GeometryReach estimator and its analysisMinimax Estimates

Statistical Inference for Cluster Trees

Reference

64 / 65

Page 127: StatisticalInferenceforGeometricDatastat.cmu.edu/~jisuk/files/thesis_Jisu_KIM_slides.pdf · Otherpapers. I. KwanghoKim,JisuKim,BarnabásPóczos,DistributionRegression inSemi-supervisedLearning

ReferenceEddie Aamari, Jisu Kim, Frédéric Chazal, Bertrand Michel, AlessandroRinaldo, and Larry Wasserman. Estimating the Reach of a Manifold.ArXiv e-prints, May 2017.

O. Bobrowski, S. Mukherjee, and J. E. Taylor. Topological consistencyvia kernel estimation. ArXiv e-prints, July 2014.

Frédéric Chazal, Leonidas J Guibas, Steve Y Oudot, and Primoz Skraba.Scalar field analysis over point cloud data. Discrete & ComputationalGeometry, 46(4):743–775, 2011.

Brittany T. Fasy, Jisu Kim, Fabrizio Lecci, Clément Maria, David L.Millman, and Vincent Rouvreau. Introduction to the R package TDA.CoRR, abs/1411.1830, 2014a. URLhttp://arxiv.org/abs/1411.1830.

Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, LarryWasserman, Sivaraman Balakrishnan, and Aarti Singh. Confidence setsfor persistence diagrams. Ann. Statist., 42(6):2301–2339, 12 2014b.doi: 10.1214/14-AOS1252. URLhttp://dx.doi.org/10.1214/14-AOS1252.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements ofstatistical learning: data mining, inference and prediction. Springer, 2edition, 2009. ISBN 978-0-3878-4857-0. URLhttp://statweb.stanford.edu/~tibs/ElemStatLearn/.

Jisu Kim, Yen-Chi Chen, Sivaraman Balakrishnan, Alessandro Rinaldo,and Larry Wasserman. Statistical inference for cluster trees. In D. D.Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages1839–1847. Curran Associates, Inc., 2016. URLhttp://papers.nips.cc/paper/6508-statistical-inference-for-cluster-trees.pdf.

Jisu Kim, Alessandro Rinaldo, and Larry Wasserman. Minimax Rates forEstimating the Dimension of a Manifold. ArXiv e-prints, May 2016.

65 / 65