85
,,,,,,,,,, Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems Tuebingen, Germany Joint work with Ulrike von Luxburg

Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Nearest Neighbor Cluster Trees

Samory KpotufeMax Planck Institute for Intelligent Systems

Tuebingen, Germany

Joint work with Ulrike von Luxburg

Page 2: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 3: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 4: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 5: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 6: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 7: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 8: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We’ll discuss:

• An interesting notion of “clusters” (Hartigan 1982):Clusters are regions of high density of the data distribution µ.

• The richness of k-NN graphs Gn:Subgraphs of Gn encode the underlying cluster structure of µ.

• How to identify false cluster structures:A simple pruning procedure with strong guarantees (a first).

Page 9: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

General motivation

More understanding of clustering

• Density yields intuitive (and clean) notion of clusters.

• Clusters take any shape =⇒ reveals complexity of clustering?

• Popular approches (e.g. DBscan, single linkage) aredensity-based methods.

More understanding of k-NN graphs

These appear everywhere in various forms!

Page 10: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

General motivation

More understanding of clustering

• Density yields intuitive (and clean) notion of clusters.

• Clusters take any shape =⇒ reveals complexity of clustering?

• Popular approches (e.g. DBscan, single linkage) aredensity-based methods.

More understanding of k-NN graphs

These appear everywhere in various forms!

Page 11: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Outline• Density-based clustering• Richness of k-NN graphs

• Guaranteed removal of false clusters

Page 12: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 13: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 14: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 15: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 16: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 17: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Given: data from some unknown distribution.Goal: discover “true” high density regions.

Resolution matters!

Page 18: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Clusters are G(λ) ≡ CCs of Lλ.= {x : f(x) ≥ λ}.

Page 19: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Clusters are G(λ) ≡ CCs of Lλ.= {x : f(x) ≥ λ}.

Page 20: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

Clusters are G(λ) ≡ CCs of Lλ.= {x : f(x) ≥ λ}.

Page 21: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Density based clustering

The cluster tree of f is the infinite hierarchy {G(λ)}λ≥0.

Page 22: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Formal estimation problem:

Given: n i.i.d. samples X = {xi}i∈[n] from dist. with density f .

Clustering outputs: A hierarchy {Gn(λ)}λ≥0 of subsets of X.

We at least want consistency, i.e. for any λ > 0

P(Disjoint A,A′ ∈ G(λ) are in disjoint empirical clusters

)→ 1.

Page 23: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Formal estimation problem:

Given: n i.i.d. samples X = {xi}i∈[n] from dist. with density f .

Clustering outputs: A hierarchy {Gn(λ)}λ≥0 of subsets of X.

We at least want consistency, i.e. for any λ > 0

P(Disjoint A,A′ ∈ G(λ) are in disjoint empirical clusters

)→ 1.

Page 24: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Formal estimation problem:

Given: n i.i.d. samples X = {xi}i∈[n] from dist. with density f .

Clustering outputs: A hierarchy {Gn(λ)}λ≥0 of subsets of X.

We at least want consistency, i.e. for any λ > 0

P(Disjoint A,A′ ∈ G(λ) are in disjoint empirical clusters

)→ 1.

Page 25: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Formal estimation problem:

Given: n i.i.d. samples X = {xi}i∈[n] from dist. with density f .

Clustering outputs: A hierarchy {Gn(λ)}λ≥0 of subsets of X.

We at least want consistency, i.e. for any λ > 0

P(Disjoint A,A′ ∈ G(λ) are in disjoint empirical clusters

)→ 1.

Page 26: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

A good procedure should satisfy:

Consistency!

Every level should be recovered for sufficiently large n.

Finite sample behavior:

• Fast discovery of real clusters.

• “No false clusters !!!”

Page 27: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

A good procedure should satisfy:

Consistency!

Every level should be recovered for sufficiently large n.

Finite sample behavior:

• Fast discovery of real clusters.

• “No false clusters !!!”

Page 28: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

A good procedure should satisfy:

Consistency!

Every level should be recovered for sufficiently large n.

Finite sample behavior:

• Fast discovery of real clusters.

• “No false clusters !!!”

Page 29: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Earlier example is sampled from a bi-modal mixture of Gaussians!!!

My visual procedure yields false clusters at low resolution. §

Page 30: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 31: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 32: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 33: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 34: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What we’ll show:

k-NN graphs guarantees

• Finite sample: Salient clusters recovered as subgraphs.

• Consistency: All clusters eventually recovered.

Generic pruning guarantees:

• Finite sample: No false clusters + salient clusters remain.

• Consistency: Pruned tree remains a consistent estimator.

Page 35: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

People you might look up:

Wasserman, Tsybakov, Wishart, Rinaldo, Nugent, Stueltze,Rigollet, Wong, Lane, Dasgupta, Chauduri, Maeir, von Luxburg,Steinwart ...

Page 36: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

People you might look up:

Wasserman, Tsybakov, Wishart, Rinaldo, Nugent, Stueltze,Rigollet, Wong, Lane, Dasgupta, Chauduri, Maeir, von Luxburg,Steinwart ...

Page 37: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 38: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 39: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 40: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 41: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Consistency

• (fn → f) =⇒ (cluster tree of fn → cluster tree of f). ©No known practical estimators. §

• Various practical estimators of a single level set.Can these be extended to all levels at once?

• Recent: First consistent practical estimator (Ch. and Das).

A generalization of single linkage (by Wishart)©

Page 42: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Empirical tree contains good clusters ... but which? §

We need pruning guarantees!

Page 43: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Empirical tree contains good clusters ... but which? §

We need pruning guarantees!

Page 44: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 45: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 46: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 47: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 48: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What was known:

Pruning

Consisted of removing small clusters!Problem: Not all false clusters are “small”!!

Page 49: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Outline• Ground-truth: Density-based clustering

• Richness of k-NN graphs• Guaranteed removal of false clusters

Page 50: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Richness of k-NN graphs

k-NN density estimate: fn(x).= k/n · vol(Bk,n(x)).

Procedure: Remove Xi from Gn in increasing order of fn(Xi).

Level λ of the tree: Gn(λ) ≡ subgraph with Xi s.t. fn(Xi) ≥ λ.

Page 51: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Richness of k-NN graphs

k-NN density estimate: fn(x).= k/n · vol(Bk,n(x)).

Procedure: Remove Xi from Gn in increasing order of fn(Xi).

Level λ of the tree: Gn(λ) ≡ subgraph with Xi s.t. fn(Xi) ≥ λ.

Page 52: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Richness of k-NN graphs

k-NN density estimate: fn(x).= k/n · vol(Bk,n(x)).

Procedure: Remove Xi from Gn in increasing order of fn(Xi).

Level λ of the tree: Gn(λ) ≡ subgraph with Xi s.t. fn(Xi) ≥ λ.

Page 53: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Richness of k-NN graphs

k-NN density estimate: fn(x).= k/n · vol(Bk,n(x)).

Procedure: Remove Xi from Gn in increasing order of fn(Xi).

Level λ of the tree: Gn(λ) ≡ subgraph with Xi s.t. fn(Xi) ≥ λ.

Page 54: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Sample from 2-modes mixture of gaussians

Page 55: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem I:

Let log n . k . n1/O(d):

A A′

& 1/√k

S

λ

&(knλ

)1/dAll such A ∩X and A′ ∩X belong to disjoint CCs of

Gn(λ−O(1/√k)).

Assumptions: f(x) ≤ F and ∀x, x′, |f(x)− f(x′)| ≤ L ‖x− x′‖α.

Page 56: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem I:

Let log n . k . n1/O(d):

A A′

& 1/√k

S

λ

&(knλ

)1/dAll such A ∩X and A′ ∩X belong to disjoint CCs of

Gn(λ−O(1/√k)).

Assumptions: f(x) ≤ F and ∀x, x′, |f(x)− f(x′)| ≤ L ‖x− x′‖α.

Page 57: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem I:

Let log n . k . n1/O(d):

A A′

& 1/√k

S

λ

&(knλ

)1/dAll such A ∩X and A′ ∩X belong to disjoint CCs of

Gn(λ−O(1/√k)).

Assumptions: f(x) ≤ F and ∀x, x′, |f(x)− f(x′)| ≤ L ‖x− x′‖α.

Page 58: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Note on key quantities:

• 1/√k & (density estimation error on samples Xi).

• (k/nλ)1/d & (k-NN distances of Xi in Lλ).

A A′

& 1/√k

S

λ

&(knλ

)1/dConsistency: both quantities → 0, so eventually An ∩A′n = ∅.

Page 59: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Note on key quantities:

• 1/√k & (density estimation error on samples Xi).

• (k/nλ)1/d & (k-NN distances of Xi in Lλ).

A A′

& 1/√k

S

λ

&(knλ

)1/dConsistency: both quantities → 0, so eventually An ∩A′n = ∅.

Page 60: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Note on key quantities:

• 1/√k & (density estimation error on samples Xi).

• (k/nλ)1/d & (k-NN distances of Xi in Lλ).

A A′

& 1/√k

S

λ

&(knλ

)1/dConsistency: both quantities → 0, so eventually An ∩A′n = ∅.

Page 61: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Main technicality:Showing that A ∩X remains connected in

Gn(λ−O(1/√k)).

Cover high density path with balls {Bt}• Bt’s have to be large so they contain points.

• Bt’s have to be small so points are connected.

So let Bt have mass about k/n!

Page 62: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Main technicality:Showing that A ∩X remains connected in

Gn(λ−O(1/√k)).

Cover high density path with balls {Bt}• Bt’s have to be large so they contain points.

• Bt’s have to be small so points are connected.

So let Bt have mass about k/n!

Page 63: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Main technicality:Showing that A ∩X remains connected in

Gn(λ−O(1/√k)).

Cover high density path with balls {Bt}• Bt’s have to be large so they contain points.

• Bt’s have to be small so points are connected.

So let Bt have mass about k/n!

Page 64: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Main technicality:Showing that A ∩X remains connected in

Gn(λ−O(1/√k)).

Cover high density path with balls {Bt}• Bt’s have to be large so they contain points.

• Bt’s have to be small so points are connected.

So let Bt have mass about k/n!

Page 65: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Outline• Ground-truth: Density-based clustering

• Richness of k-NN graphs

• Guaranteed removal of false clusters

Page 66: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Guaranteed removal of false clusters

Sample from 2-modes mixture of gaussians

Page 67: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Guaranteed removal of false clusters

Sample from 2-modes mixture of gaussians

Page 68: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

What are false clusters?

Intuitively:

An and A′n in X should be in one (empirical) cluster if they are inthe same (true) cluster at every level containing An ∪A′n.

Page 69: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Intuition:key connecting points are missing!!!

Sample from 2-modes mixture of gaussians

Pruning: Connect Gn(0).Re-connect An, A′n in Gn(λn) if they are connected in Gn(λn− ε̃).

How do we set ε̃?

Page 70: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Intuition:key connecting points are missing!!!

Sample from 2-modes mixture of gaussians

Pruning: Connect Gn(0).Re-connect An, A′n in Gn(λn) if they are connected in Gn(λn− ε̃).

How do we set ε̃?

Page 71: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Intuition:key connecting points are missing!!!

Sample from 2-modes mixture of gaussians

Pruning: Connect Gn(0).Re-connect An, A′n in Gn(λn) if they are connected in Gn(λn− ε̃).

How do we set ε̃?

Page 72: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Pruning Intuition:key connecting points are missing!!!

Sample from 2-modes mixture of gaussians

Pruning: Connect Gn(0).Re-connect An, A′n in Gn(λn) if they are connected in Gn(λn− ε̃).

How do we set ε̃?

Page 73: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem II:

Suppose ε̃ & 1/√k.

• An and A′n belong to disjoint A and A′ in some G(λ).

• A ∩X and A′ ∩X belong to disjoint An and A′n ofGn(λ−O(1/

√k)).

• (ε̃, k, n)-salient modes map 1-1 to leaves of empirical tree.

Page 74: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem II:

Suppose ε̃ & 1/√k.

• An and A′n belong to disjoint A and A′ in some G(λ).

• A ∩X and A′ ∩X belong to disjoint An and A′n ofGn(λ−O(1/

√k)).

• (ε̃, k, n)-salient modes map 1-1 to leaves of empirical tree.

Page 75: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem II:

Suppose ε̃ & 1/√k.

• An and A′n belong to disjoint A and A′ in some G(λ).

• A ∩X and A′ ∩X belong to disjoint An and A′n ofGn(λ−O(1/

√k)).

• (ε̃, k, n)-salient modes map 1-1 to leaves of empirical tree.

Page 76: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Theorem II:

Suppose ε̃ & 1/√k.

• An and A′n belong to disjoint A and A′ in some G(λ).

• A ∩X and A′ ∩X belong to disjoint An and A′n ofGn(λ−O(1/

√k)).

• (ε̃, k, n)-salient modes map 1-1 to leaves of empirical tree.

Page 77: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Consistency even after pruning:We just require ε̃→ 0 as n→∞.

Page 78: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Some last technical points:

[Ch. and Das. 2010] seem to be first to allow any cluster shapebesides mild requirements on envelopes of clusters.

We allow any cluster shape up to smoothness of f and canexplicitely relate empirical clusters to true clusters!

Page 79: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Some last technical points:

[Ch. and Das. 2010] seem to be first to allow any cluster shapebesides mild requirements on envelopes of clusters.

We allow any cluster shape up to smoothness of f and canexplicitely relate empirical clusters to true clusters!

Page 80: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Some last technical points:

[Ch. and Das. 2010] seem to be first to allow any cluster shapebesides mild requirements on envelopes of clusters.

We allow any cluster shape up to smoothness of f and canexplicitely relate empirical clusters to true clusters!

Page 81: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We have thus discussed:

• Density based clustering - Hartigan 1982).

• The richness of k-NN graphs Gn.Subgraphs of Gn consistently recover cluster tree of µ.

• Guaranteed pruning of false clusters.While discovering salient clusters and maintaining consistency!

Page 82: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We have thus discussed:

• Density based clustering - Hartigan 1982).

• The richness of k-NN graphs Gn.Subgraphs of Gn consistently recover cluster tree of µ.

• Guaranteed pruning of false clusters.While discovering salient clusters and maintaining consistency!

Page 83: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We have thus discussed:

• Density based clustering - Hartigan 1982).

• The richness of k-NN graphs Gn.Subgraphs of Gn consistently recover cluster tree of µ.

• Guaranteed pruning of false clusters.While discovering salient clusters and maintaining consistency!

Page 84: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

We have thus discussed:

• Density based clustering - Hartigan 1982).

• The richness of k-NN graphs Gn.Subgraphs of Gn consistently recover cluster tree of µ.

• Guaranteed pruning of false clusters.While discovering salient clusters and maintaining consistency!

Page 85: Pruning Nearest Neighbor Cluster Treesskk2175/Papers/ctpPresentationLong.pdf · Pruning Nearest Neighbor Cluster Trees Samory Kpotufe Max Planck Institute for Intelligent Systems

, , , , , , , , , ,

Thank you! ©