Statistical techniques in topological data analysis
Andrew J. Blumberg ([email protected])
June 17th, 2014
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Overview
Goal of talk
Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).
This integration is essential to using TDA for data analysis:
Provides ways to understand what the results of TDA mean.
Provides methodology for handling noise.
Provides approaches to taming computational burden of TDA.
Particularly salient when considering “big data” applications.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Overview
Goal of talk
Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).
This integration is essential to using TDA for data analysis:
Provides ways to understand what the results of TDA mean.
Provides methodology for handling noise.
Provides approaches to taming computational burden of TDA.
Particularly salient when considering “big data” applications.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Overview
Goal of talk
Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).
This integration is essential to using TDA for data analysis:
Provides ways to understand what the results of TDA mean.
Provides methodology for handling noise.
Provides approaches to taming computational burden of TDA.
Particularly salient when considering “big data” applications.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Overview
Goal of talk
Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).
This integration is essential to using TDA for data analysis:
Provides ways to understand what the results of TDA mean.
Provides methodology for handling noise.
Provides approaches to taming computational burden of TDA.
Particularly salient when considering “big data” applications.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Overview
Goal of talk
Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).
This integration is essential to using TDA for data analysis:
Provides ways to understand what the results of TDA mean.
Provides methodology for handling noise.
Provides approaches to taming computational burden of TDA.
Particularly salient when considering “big data” applications.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Overview
Goal of talk
Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).
This integration is essential to using TDA for data analysis:
Provides ways to understand what the results of TDA mean.
Provides methodology for handling noise.
Provides approaches to taming computational burden of TDA.
Particularly salient when considering “big data” applications.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Recollections about TDA methodology
Pipeline
Finite metric space→ Filtered space→ barcode
(X , ∂X ) 7→ {Zk} 7→ B.
Question
What does the barcode mean?
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Recollections about TDA methodology
Pipeline
Finite metric space→ Filtered space→ barcode
(X , ∂X ) 7→ {Zk} 7→ B.
Question
What does the barcode mean?
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Reflections on what those pictures tell us
Data set might not come from an object with obviousgeometric structure.
Data might have meaningful features at varying scales.
Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.
Even small amounts of maliciously placed corruption canchange results substantially.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Reflections on what those pictures tell us
Data set might not come from an object with obviousgeometric structure.
Data might have meaningful features at varying scales.
Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.
Even small amounts of maliciously placed corruption canchange results substantially.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Reflections on what those pictures tell us
Data set might not come from an object with obviousgeometric structure.
Data might have meaningful features at varying scales.
Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.
Even small amounts of maliciously placed corruption canchange results substantially.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Reflections on what those pictures tell us
Data set might not come from an object with obviousgeometric structure.
Data might have meaningful features at varying scales.
Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.
Even small amounts of maliciously placed corruption canchange results substantially.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Refinement of TDA pipeline to integrate with statistics
Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.
Refined pipeline
mm-space→ finite metric space→ filtered complex→ barcode
(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B
where X ′ ⊂ X .
This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′
k}.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Refinement of TDA pipeline to integrate with statistics
Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.
Refined pipeline
mm-space→ finite metric space→ filtered complex→ barcode
(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B
where X ′ ⊂ X .
This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′
k}.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Refinement of TDA pipeline to integrate with statistics
Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.
Refined pipeline
mm-space→ finite metric space→ filtered complex→ barcode
(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B
where X ′ ⊂ X .
This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′
k}.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Barcode space
Barcode space is a metric space:
Definition
Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is
dB(B1,B2) = infC
sup(I ,J)∈C
d∞(I , J),
where C varies over all matchings between B1 and B2.
Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.
(There are also a variety of other metrics.)
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Barcode space
Barcode space is a metric space:
Definition
Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is
dB(B1,B2) = infC
sup(I ,J)∈C
d∞(I , J),
where C varies over all matchings between B1 and B2.
Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.
(There are also a variety of other metrics.)
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Barcode space
Barcode space is a metric space:
Definition
Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is
dB(B1,B2) = infC
sup(I ,J)∈C
d∞(I , J),
where C varies over all matchings between B1 and B2.
Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.
(There are also a variety of other metrics.)
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Barcode space
Barcode space is a metric space:
Definition
Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is
dB(B1,B2) = infC
sup(I ,J)∈C
d∞(I , J),
where C varies over all matchings between B1 and B2.
Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.
(There are also a variety of other metrics.)
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Properties of barcode space
What kind of metric space is barcode space?
Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])
This in particular means that the standard metrics ondistributions that metrize weak convergence.
Bad news: Barcode space is positively curved.
This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Properties of barcode space
What kind of metric space is barcode space?
Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])
This in particular means that the standard metrics ondistributions that metrize weak convergence.
Bad news: Barcode space is positively curved.
This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Properties of barcode space
What kind of metric space is barcode space?
Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])
This in particular means that the standard metrics ondistributions that metrize weak convergence.
Bad news: Barcode space is positively curved.
This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Properties of barcode space
What kind of metric space is barcode space?
Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])
This in particular means that the standard metrics ondistributions that metrize weak convergence.
Bad news: Barcode space is positively curved.
This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Properties of barcode space
What kind of metric space is barcode space?
Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])
This in particular means that the standard metrics ondistributions that metrize weak convergence.
Bad news: Barcode space is positively curved.
This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Summarizing distributions of barcodes
Question
How do you summarize a distribution of barcodes? (What are themoments?)
Two approaches:
1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]
2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Summarizing distributions of barcodes
Question
How do you summarize a distribution of barcodes? (What are themoments?)
Two approaches:
1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]
2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Summarizing distributions of barcodes
Question
How do you summarize a distribution of barcodes? (What are themoments?)
Two approaches:
1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]
2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Summarizing distributions of barcodes
Question
How do you summarize a distribution of barcodes? (What are themoments?)
Two approaches:
1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]
2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Projections to distributions on more tractable spaces
A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.
1 Many possible maps:
Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.
2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]
Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Projections to distributions on more tractable spaces
A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.
1 Many possible maps:
Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.
2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]
Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Projections to distributions on more tractable spaces
A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.
1 Many possible maps:
Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.
2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]
Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Projections to distributions on more tractable spaces
A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.
1 Many possible maps:
Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.
2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]
Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Projections to distributions on more tractable spaces
A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.
1 Many possible maps:
Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.
2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]
Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Sampling at a fixed finite rate as an invariant
Claim
The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]
1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.
2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.
3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.
4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Sampling at a fixed finite rate as an invariant
Claim
The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]
1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.
2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.
3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.
4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Sampling at a fixed finite rate as an invariant
Claim
The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]
1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.
2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.
3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.
4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Sampling at a fixed finite rate as an invariant
Claim
The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]
1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.
2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.
3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.
4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Sampling at a fixed finite rate as an invariant
Claim
The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]
1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.
2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.
3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.
4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Efficiency
Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).
Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.
Perspective that the distribution is the invariant allows us toassemble information from many small samples.
Easy to compute in parallel.
Necessary for potential applications to “big data”.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Efficiency
Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).
Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.
Perspective that the distribution is the invariant allows us toassemble information from many small samples.
Easy to compute in parallel.
Necessary for potential applications to “big data”.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Efficiency
Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).
Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.
Perspective that the distribution is the invariant allows us toassemble information from many small samples.
Easy to compute in parallel.
Necessary for potential applications to “big data”.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Efficiency
Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).
Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.
Perspective that the distribution is the invariant allows us toassemble information from many small samples.
Easy to compute in parallel.
Necessary for potential applications to “big data”.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Efficiency
Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).
Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.
Perspective that the distribution is the invariant allows us toassemble information from many small samples.
Easy to compute in parallel.
Necessary for potential applications to “big data”.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
Can try to perform statistical inference either directly in barcodespace or on distributions in R.
In barcode space:
1 Hypothesis testing is difficult:
Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])
2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.
3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
Can try to perform statistical inference either directly in barcodespace or on distributions in R.
In barcode space:
1 Hypothesis testing is difficult:
Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])
2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.
3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
Can try to perform statistical inference either directly in barcodespace or on distributions in R.
In barcode space:
1 Hypothesis testing is difficult:
Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])
2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.
3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
Can try to perform statistical inference either directly in barcodespace or on distributions in R.
In barcode space:
1 Hypothesis testing is difficult:
Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])
2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.
3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
Can try to perform statistical inference either directly in barcodespace or on distributions in R.
In barcode space:
1 Hypothesis testing is difficult:
Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])
2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.
3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
Can try to perform statistical inference either directly in barcodespace or on distributions in R.
In barcode space:
1 Hypothesis testing is difficult:
Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])
2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.
3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
Can try to perform statistical inference either directly in barcodespace or on distributions in R.
In barcode space:
1 Hypothesis testing is difficult:
Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])
2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.
3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
As above, can first project to distributions on R or Rn (or the like):
1 Similar issues, although easier to specify hypotheses.
2 Confidence intervals for various statistics, in some casesanalytics.
Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
As above, can first project to distributions on R or Rn (or the like):
1 Similar issues, although easier to specify hypotheses.
2 Confidence intervals for various statistics, in some casesanalytics.
Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
As above, can first project to distributions on R or Rn (or the like):
1 Similar issues, although easier to specify hypotheses.
2 Confidence intervals for various statistics, in some casesanalytics.
Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Hypothesis testing and confidence intervals
As above, can first project to distributions on R or Rn (or the like):
1 Similar issues, although easier to specify hypotheses.
2 Confidence intervals for various statistics, in some casesanalytics.
Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Euler characteristic
Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:
Observation
The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.
Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].
Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Euler characteristic
Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:
Observation
The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.
Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].
Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Euler characteristic
Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:
Observation
The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.
Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].
Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Euler characteristic
Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:
Observation
The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.
Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].
Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Questions for the future:
Outstanding issue
So far, limited experience using these tools on very large data sets.
(Except “Mapper”, which I’ll talk about this afternoon.)
1 What about guided sampling?
2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?
3 More and better “topological” summaries and hypothesisspecification tools.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Questions for the future:
Outstanding issue
So far, limited experience using these tools on very large data sets.
(Except “Mapper”, which I’ll talk about this afternoon.)
1 What about guided sampling?
2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?
3 More and better “topological” summaries and hypothesisspecification tools.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Questions for the future:
Outstanding issue
So far, limited experience using these tools on very large data sets.
(Except “Mapper”, which I’ll talk about this afternoon.)
1 What about guided sampling?
2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?
3 More and better “topological” summaries and hypothesisspecification tools.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Questions for the future:
Outstanding issue
So far, limited experience using these tools on very large data sets.
(Except “Mapper”, which I’ll talk about this afternoon.)
1 What about guided sampling?
2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?
3 More and better “topological” summaries and hypothesisspecification tools.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis
Questions for the future:
Outstanding issue
So far, limited experience using these tools on very large data sets.
(Except “Mapper”, which I’ll talk about this afternoon.)
1 What about guided sampling?
2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?
3 More and better “topological” summaries and hypothesisspecification tools.
Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis