106
Analysis of hierarchical metric-tree indexing schemes for similarity search in high-dimensional datasets Vladimir Pestov [email protected] http://aix1.uottawa.ca/˜ vpest283 Department of Mathematics and Statistics University of Ottawa Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.1/25

Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Analysis of hierarchical metric-treeindexing schemes

for similarity search in high-dimensional datasets

Vladimir Pestov

[email protected]

http://aix1.uottawa.ca/˜vpest283

Department of Mathematics and Statistics

University of Ottawa

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.1/25

Page 2: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

General setting

Workload: W = (Ω, X,Q), where:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.2/25

Page 3: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

General setting

Workload: W = (Ω, X,Q), where:

Ω is the domain,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.2/25

Page 4: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

General setting

Workload: W = (Ω, X,Q), where:

Ω is the domain,

X ⊂ Ω finite subset (dataset, or instance), and

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.2/25

Page 5: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

General setting

Workload: W = (Ω, X,Q), where:

Ω is the domain,

X ⊂ Ω finite subset (dataset, or instance), and

Q ⊆ 2Ω is the set of queries.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.2/25

Page 6: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

General setting

Workload: W = (Ω, X,Q), where:

Ω is the domain,

X ⊂ Ω finite subset (dataset, or instance), and

Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.2/25

Page 7: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

General setting

Workload: W = (Ω, X,Q), where:

Ω is the domain,

X ⊂ Ω finite subset (dataset, or instance), and

Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q is listing all x ∈ X ∩ Q.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.2/25

Page 8: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

General setting

Workload: W = (Ω, X,Q), where:

Ω is the domain,

X ⊂ Ω finite subset (dataset, or instance), and

Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q is listing all x ∈ X ∩ Q.

A (dis)similarity measure s : Ω × Ω → R,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.2/25

Page 9: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

General setting

Workload: W = (Ω, X,Q), where:

Ω is the domain,

X ⊂ Ω finite subset (dataset, or instance), and

Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q is listing all x ∈ X ∩ Q.

A (dis)similarity measure s : Ω × Ω → R,e.g. a metric, or a pseudometric.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.2/25

Page 10: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

General setting

Workload: W = (Ω, X,Q), where:

Ω is the domain,

X ⊂ Ω finite subset (dataset, or instance), and

Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q is listing all x ∈ X ∩ Q.

A (dis)similarity measure s : Ω × Ω → R,e.g. a metric, or a pseudometric.

A range similarity query centred at ω ∈ Ω:

Q = x ∈ Ω: s(ω, x) < ε

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.2/25

Page 11: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Similarity workloads

ε

Ω

W = (Ω, d,X, Bε(x))

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.3/25

Page 12: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Similarity workloads

ε

Ω

W = (Ω, d,X, Bε(x))

k-nearest neighbours (k-NN) query centred at x∗ ∈ Ω,where k ∈ N.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.3/25

Page 13: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Example

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.4/25

Page 14: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Example

Ω = strings of length m = 10 from the alphabet Σ of 20standard amino acids: Ω = Σ10.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.4/25

Page 15: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Example

Ω = strings of length m = 10 from the alphabet Σ of 20standard amino acids: Ω = Σ10.

X = all peptide fragments of length 10 in the SwissProtdatabase (as of 19-Oct-2002). |X| = 23, 817, 598.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.4/25

Page 16: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Example

Ω = strings of length m = 10 from the alphabet Σ of 20standard amino acids: Ω = Σ10.

X = all peptide fragments of length 10 in the SwissProtdatabase (as of 19-Oct-2002). |X| = 23, 817, 598.

Similarity measure given by the most common scoringmatrix in sequence comparison, BLOSUM62, bys(a, b) =

∑mi=1 s(ai, bi) (the ungapped score).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.4/25

Page 17: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Example

Ω = strings of length m = 10 from the alphabet Σ of 20standard amino acids: Ω = Σ10.

X = all peptide fragments of length 10 in the SwissProtdatabase (as of 19-Oct-2002). |X| = 23, 817, 598.

Similarity measure given by the most common scoringmatrix in sequence comparison, BLOSUM62, bys(a, b) =

∑mi=1 s(ai, bi) (the ungapped score).

Converted into quasi-metric d(a, b) = s(a, a) − s(a, b),generating the same set of queries (range and k-NN).

(joint with A. Stojmirovic)

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.4/25

Page 18: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Inner vs outer

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.5/25

Page 19: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Inner vs outer

Inner workload if X = Ω,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.5/25

Page 20: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Inner vs outer

Inner workload if X = Ω,

Outer workload if |X| ≪ |Ω|.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.5/25

Page 21: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Inner vs outer

Inner workload if X = Ω,

Outer workload if |X| ≪ |Ω|.

Fragment example: outer,

|X|/|Ω| = 23, 817, 598/2010 ≈ 0.0000023

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.5/25

Page 22: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Inner vs outer

Inner workload if X = Ω,

Outer workload if |X| ≪ |Ω|.

Fragment example: outer,

|X|/|Ω| = 23, 817, 598/2010 ≈ 0.0000023

Most points ω ∈ Ω have NN x ∈ X within ε = 25 (high biological relevance).

0 5 10 15 20 25 30 35

0.0

0.2

0.4

0.6

0.8

1.0

DISTANCE

µ(N

ε(X))

Quasi−metricMetric

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.5/25

Page 23: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Hierarchical tree index structures

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.6/25

Page 24: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Hierarchical tree index structures

A sequence of refining partitions of the domain:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.6/25

Page 25: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Hierarchical tree index structures

A sequence of refining partitions of the domain:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.6/25

Page 26: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Hierarchical tree index structures

A sequence of refining partitions of the domain:

Space O(n).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.6/25

Page 27: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Hierarchical tree index structures

A sequence of refining partitions of the domain:

Space O(n).To process a range query Bε(ω), we traverse the tree all theway down to the leaf level.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.6/25

Page 28: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Hierarchical tree index structures

A sequence of refining partitions of the domain:

Space O(n).To process a range query Bε(ω), we traverse the tree all theway down to the leaf level.

What happens in each node?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.6/25

Page 29: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Pruning

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.7/25

Page 30: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Pruning

• If Bε(ω) ∩ B = ∅, the sub-tree descending from the node Bcan be pruned:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.7/25

Page 31: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Pruning

• If Bε(ω) ∩ B = ∅, the sub-tree descending from the node Bcan be pruned:

B

ω

A B

A ε ε

ε

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.7/25

Page 32: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Pruning

• If Bε(ω) ∩ B = ∅, the sub-tree descending from the node Bcan be pruned:

B

ω

A B

A ε ε

ε

that is, if it can be certified thatω /∈ Bε = x ∈ Ω: d(x,B) < ε.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.7/25

Page 33: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Pruning

• If Bε(ω) ∩ B = ∅, the sub-tree descending from the node Bcan be pruned:

B

ω

A B

A ε ε

ε

A

ω

B

A B

ε ε

ε

that is, if it can be certified thatω /∈ Bε = x ∈ Ω: d(x,B) < ε.

• Otherwise the search branches out.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.7/25

Page 34: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Pruning

• If Bε(ω) ∩ B = ∅, the sub-tree descending from the node Bcan be pruned:

B

ω

A B

A ε ε

ε

A

ω

B

A B

ε ε

ε

that is, if it can be certified thatω /∈ Bε = x ∈ Ω: d(x,B) < ε.

• Otherwise the search branches out.

How to “certify” that Bε(ω) ∩ B = ∅?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.7/25

Page 35: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Decision functions

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.8/25

Page 36: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.8/25

Page 37: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f B ≤ 0.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.8/25

Page 38: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f B ≤ 0. Then f Bε < ε,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.8/25

Page 39: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f B ≤ 0. Then f Bε < ε,

x

f

B 0

f(x)

ε

y

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.8/25

Page 40: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f B ≤ 0. Then f Bε < ε,

x

f

B 0

f(x)

ε

y

that is, f(ω) ≥ ε

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.8/25

Page 41: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f B ≤ 0. Then f Bε < ε,

x

f

B 0

f(x)

ε

y

that is, f(ω) ≥ ε is a certificate that Bε(ω) ∩ B = ∅

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.8/25

Page 42: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Metric trees

A metric tree for a metric similarity workload (Ω, ρ,X):

a binary rooted tree T ,

a collection of partially defined 1-Lipschitz functionsft : Bt → R for every inner node t (decision functions),

a collection of bins Bt ⊆ Ω for every leaf node t,containing pointers to elements X ∩ Bt,

such that

Broot(T ) = Ω,

∀ inner node t and child nodes t−, t+, Bt ⊆ Bt− ∪ Bt+.

When processing a range query Bε(ω),

t− [ t+ ] is accessed ⇐⇒ ft(ω) < ε [resp. ft(ω) > −ε].

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.9/25

Page 43: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

What happens in practice?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.10/25

Page 44: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

What happens in practice?

The best indexing schemes for exact similarity search inhigh-dimensional outer datasets are often (not always!)outperformed by linear scan.

∗ ∗ ∗

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.10/25

Page 45: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

What happens in practice?

The best indexing schemes for exact similarity search inhigh-dimensional outer datasets are often (not always!)outperformed by linear scan.

∗ ∗ ∗

The emphasis has shifted towards approximate similaritysearch:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.10/25

Page 46: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

What happens in practice?

The best indexing schemes for exact similarity search inhigh-dimensional outer datasets are often (not always!)outperformed by linear scan.

∗ ∗ ∗

The emphasis has shifted towards approximate similaritysearch:

given ε > 0 and ω ∈ Ω, return a point that is [with highprobability] at a distance < (1 + ε)dNN (ω) from ω.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.10/25

Page 47: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

The curse of dimensionality conjecture

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.11/25

Page 48: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

The curse of dimensionality conjecture

Conjecture.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.11/25

Page 49: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

The curse of dimensionality conjecture

Conjecture. Let X ⊆ 0, 1d be a dataset with n points,where the Hamming cube is equipped with the Hamming(ℓ1) distance:

d(x, y) = ♯i : xi 6= yi.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.11/25

Page 50: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

The curse of dimensionality conjecture

Conjecture. Let X ⊆ 0, 1d be a dataset with n points,where the Hamming cube is equipped with the Hamming(ℓ1) distance:

d(x, y) = ♯i : xi 6= yi.

Suppose d = no(1), but d = ω(log n).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.11/25

Page 51: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

The curse of dimensionality conjecture

Conjecture. Let X ⊆ 0, 1d be a dataset with n points,where the Hamming cube is equipped with the Hamming(ℓ1) distance:

d(x, y) = ♯i : xi 6= yi.

Suppose d = no(1), but d = ω(log n). Any data structure forexact nearest neighbour search in X,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.11/25

Page 52: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

The curse of dimensionality conjecture

Conjecture. Let X ⊆ 0, 1d be a dataset with n points,where the Hamming cube is equipped with the Hamming(ℓ1) distance:

d(x, y) = ♯i : xi 6= yi.

Suppose d = no(1), but d = ω(log n). Any data structure forexact nearest neighbour search in X, with dO(1) querytime,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.11/25

Page 53: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

The curse of dimensionality conjecture

Conjecture. Let X ⊆ 0, 1d be a dataset with n points,where the Hamming cube is equipped with the Hamming(ℓ1) distance:

d(x, y) = ♯i : xi 6= yi.

Suppose d = no(1), but d = ω(log n). Any data structure forexact nearest neighbour search in X, with dO(1) querytime, must use nω(1) space.

∗ ∗ ∗

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.11/25

Page 54: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

The curse of dimensionality conjecture

Conjecture. Let X ⊆ 0, 1d be a dataset with n points,where the Hamming cube is equipped with the Hamming(ℓ1) distance:

d(x, y) = ♯i : xi 6= yi.

Suppose d = no(1), but d = ω(log n). Any data structure forexact nearest neighbour search in X, with dO(1) querytime, must use nω(1) space.

∗ ∗ ∗

The cell probe model: Ω(d/ log n) lower bound(Barkol–Rabani, 2000).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.11/25

Page 55: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Concentration of measure

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.12/25

Page 56: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Concentration of measure

The phenomenon of concentration of measure on high-

dimensional structures (“Geometric LLN”):

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.12/25

Page 57: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Concentration of measure

The phenomenon of concentration of measure onhigh-dimensional structures (“Geometric LLN”):

for a typical “high-dimensional” structure Ω, if A is asubset containing at least half of all points, then themeasure of the ε-neighbourhood Aε of A isoverwhelmingly close to 1 already for small ε > 0.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.12/25

Page 58: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Concentration of measure

The phenomenon of concentration of measure onhigh-dimensional structures (“Geometric LLN”):

for a typical “high-dimensional” structure Ω, if A is asubset containing at least half of all points, then themeasure of the ε-neighbourhood Aε of A isoverwhelmingly close to 1 already for small ε > 0.

)

ε

A

at least half ofall points

bounds from above

containsΑ Ω

α(Ω,ε)µ(Ω

Ω \ A ε

\ A ε

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.12/25

Page 59: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Concentration function

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.13/25

Page 60: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Concentration function

Let Ω = (Ω, d, µ) be a metric space with measure.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.13/25

Page 61: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Concentration function

Let Ω = (Ω, d, µ) be a metric space with measure.The concentration function of Ω:

α(ε) =

12 , if ε = 0,1 − min

µ♯ (Aε) : A ⊆ Ω, µ♯(A) ≥ 12

, if ε > 0.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.13/25

Page 62: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Concentration function

Let Ω = (Ω, d, µ) be a metric space with measure.The concentration function of Ω:

α(ε) =

12 , if ε = 0,1 − min

µ♯ (Aε) : A ⊆ Ω, µ♯(A) ≥ 12

, if ε > 0.

For Ω = Σn, the Hamming cube (normalized distance +unif. measure):

αΣn(ε) ≤ e−2ε2n.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.13/25

Page 63: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Concentration function

Let Ω = (Ω, d, µ) be a metric space with measure.The concentration function of Ω:

α(ε) =

12 , if ε = 0,1 − min

µ♯ (Aε) : A ⊆ Ω, µ♯(A) ≥ 12

, if ε > 0.

For Ω = Σn, the Hamming cube (normalized distance +unif. measure):

αΣn(ε) ≤ e−2ε2n.

Gaussian estimates are typical

(Euclidean spheres Sn, cubes I

n, ...)

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.13/25

Page 64: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Example: the Hamming cube

0

0.2

0.4

0.6

0.8

1

0 0.05 0.1 0.15 0.2

Concentration function versus Chernoff’s bound, n = 101

Concentration functionChernoff bound

Concentration function α(Σ101, ε) versus Chernoff bound

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.14/25

Page 65: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Effects of concentration on branching

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.15/25

Page 66: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Effects of concentration on branching

A

A B

< α (C, ε)< α (C, ε)C

ω

Bε ε

ε

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.15/25

Page 67: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Effects of concentration on branching

A

A B

< α (C, ε)< α (C, ε)C

ω

Bε ε

ε

For all query points ω ∈ C except a set of measure

≤ 2α(C, ε),

the search algorithm branches out at the node C.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.15/25

Page 68: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Search radius

εNN (ω) is a 1-Lipschitz function, so concentrates nearthe median value, εM ;

εM → Eµ⊗µd(x, y) = O(1).

Example: 1000 pts ∼ [0, 1]10, the ℓ2-εNN :

εM = 0.69419 Ed(x, y) = 1.2765.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.16/25

Page 69: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

A naive average O(n) lower bound

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.17/25

Page 70: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P (Ω)...

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.17/25

Page 71: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P (Ω)......as well as query points.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.17/25

Page 72: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P (Ω)......as well as query points.A balanced metric tree of depth O(log n), with O(n) bins ofroughly equal size (µ-measure).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.17/25

Page 73: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P (Ω)......as well as query points.A balanced metric tree of depth O(log n), with O(n) bins ofroughly equal size (µ-measure).in 1/2 the cases, εNN ≥ εM = O(1), the median NN dist.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.17/25

Page 74: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P (Ω)......as well as query points.A balanced metric tree of depth O(log n), with O(n) bins ofroughly equal size (µ-measure).in 1/2 the cases, εNN ≥ εM = O(1), the median NN dist.For every element A of level t partition,

α(A, εM ) ≤ 2µ(A)−1α(Ω, εM/2) = O(2t)e−O(1)ε2Md.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.17/25

Page 75: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P (Ω)......as well as query points.A balanced metric tree of depth O(log n), with O(n) bins ofroughly equal size (µ-measure).in 1/2 the cases, εNN ≥ εM = O(1), the median NN dist.For every element A of level t partition,

α(A, εM ) ≤ 2µ(A)−1α(Ω, εM/2) = O(2t)e−O(1)ε2Md.

branching at every node occurs for all ω except

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.17/25

Page 76: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P (Ω)......as well as query points.A balanced metric tree of depth O(log n), with O(n) bins ofroughly equal size (µ-measure).in 1/2 the cases, εNN ≥ εM = O(1), the median NN dist.For every element A of level t partition,

α(A, εM ) ≤ 2µ(A)−1α(Ω, εM/2) = O(2t)e−O(1)ε2Md.

branching at every node occurs for all ω except

♯(nodes) × 2 supA

α(A, ε) = O(n2)e−O(1)d = o(1),

because d = ω(log n), e−O(1)d is superpoly(n).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.17/25

Page 77: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

What’s wrong?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.18/25

Page 78: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

What’s wrong?

A dataset X is modeled by a sequence of i.i.d. r.v. Xi ∼ µ.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.18/25

Page 79: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

What’s wrong?

A dataset X is modeled by a sequence of i.i.d. r.v. Xi ∼ µ.

Implicit assumption: empirical measure µn(A) = |A|n ≈ µ(A).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.18/25

Page 80: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

What’s wrong?

A dataset X is modeled by a sequence of i.i.d. r.v. Xi ∼ µ.

Implicit assumption: empirical measure µn(A) = |A|n ≈ µ(A).

But the scheme is chosen after seeing an instance X!

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.18/25

Page 81: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

What’s wrong?

A dataset X is modeled by a sequence of i.i.d. r.v. Xi ∼ µ.

Implicit assumption: empirical measure µn(A) = |A|n ≈ µ(A).

But the scheme is chosen after seeing an instance X!

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

How much can be said of concentration in (Ω, µn)?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.18/25

Page 82: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

VC dimension

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.19/25

Page 83: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

VC dimension

Let A be a family of subsets of Ω (a concept class).B ⊆ Ω is shattered by A if for each C ⊆ B there is A ∈ A

such thatA ∩ B = C.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.19/25

Page 84: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

VC dimension

Let A be a family of subsets of Ω (a concept class).B ⊆ Ω is shattered by A if for each C ⊆ B there is A ∈ A

such thatA ∩ B = C.

A

Ω

B

C

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.19/25

Page 85: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

VC dimension

Let A be a family of subsets of Ω (a concept class).B ⊆ Ω is shattered by A if for each C ⊆ B there is A ∈ A

such thatA ∩ B = C.

A

Ω

B

C

The Vapnik–Chervonenkis dimension VC-dim (A ) of A is

the largest cardinality of a set B ⊆ Ω shattered by A .Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.19/25

Page 86: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Statistical learning bounds

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.20/25

Page 87: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Statistical learning bounds

Let A ⊆ 2Ω be a concept class of finite VC dimension, d.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.20/25

Page 88: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Statistical learning bounds

Let A ⊆ 2Ω be a concept class of finite VC dimension, d.Then for all ǫ, δ > 0 and every probability measure µ on Ω,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.20/25

Page 89: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Statistical learning bounds

Let A ⊆ 2Ω be a concept class of finite VC dimension, d.Then for all ǫ, δ > 0 and every probability measure µ on Ω,if n datapoints in X are drawn randomly and independentlyacoording to µ, then with confidence 1 − δ

∀A ∈ A ,

µ(A) −X ∩ A

n

< ǫ,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.20/25

Page 90: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Statistical learning bounds

Let A ⊆ 2Ω be a concept class of finite VC dimension, d.Then for all ǫ, δ > 0 and every probability measure µ on Ω,if n datapoints in X are drawn randomly and independentlyacoording to µ, then with confidence 1 − δ

∀A ∈ A ,

µ(A) −X ∩ A

n

< ǫ,

provided n is large enough:

n ≥128

ε2

(

d log

(

2e2

εlog

2e

ε

)

+ log8

δ

)

.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.20/25

Page 91: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Bin access lemma

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.21/25

Page 92: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Bin access lemma

Let δ > 0, and let γ be a collection of subsets A ⊆ Ω ofmeasure µ(A) ≤ α(δ) ≤ 1

4 each, satisfying µ(∪γ) ≥ 1/2.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.21/25

Page 93: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Bin access lemma

Let δ > 0, and let γ be a collection of subsets A ⊆ Ω ofmeasure µ(A) ≤ α(δ) ≤ 1

4 each, satisfying µ(∪γ) ≥ 1/2.Then the 2δ-neighbourhood of every point ω ∈ Ω, apart froma set of measure at most 1

2α(δ)1

2 , meets at least ⌈12α(δ)−

1

2 ⌉

elements of γ.∗ ∗ ∗

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.21/25

Page 94: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Bin access lemma

Let δ > 0, and let γ be a collection of subsets A ⊆ Ω ofmeasure µ(A) ≤ α(δ) ≤ 1

4 each, satisfying µ(∪γ) ≥ 1/2.Then the 2δ-neighbourhood of every point ω ∈ Ω, apart froma set of measure at most 1

2α(δ)1

2 , meets at least ⌈12α(δ)−

1

2 ⌉

elements of γ.∗ ∗ ∗

If we can now guarantee that the bins are not too large, we

get a lower bound on the number of bin accesses.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.21/25

Page 95: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Bin complexity estimates

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.22/25

Page 96: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Bin complexity estimates

Let F be a class of 1-Lipschitz functions used forconstructing a metric tree of a particular type.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.22/25

Page 97: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Bin complexity estimates

Let F be a class of 1-Lipschitz functions used forconstructing a metric tree of a particular type.Let A be the concept class of all solution sets toinequalities

f R a, f ∈ F , a ∈ R.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.22/25

Page 98: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Bin complexity estimates

Let F be a class of 1-Lipschitz functions used forconstructing a metric tree of a particular type.Let A be the concept class of all solution sets toinequalities

f R a, f ∈ F , a ∈ R.

Supposep = VC-dim (A ) < ∞

(pseudodimension of F in the sense of Vapnik).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.22/25

Page 99: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Bin complexity estimates

Let F be a class of 1-Lipschitz functions used forconstructing a metric tree of a particular type.Let A be the concept class of all solution sets toinequalities

f R a, f ∈ F , a ∈ R.

Supposep = VC-dim (A ) < ∞

(pseudodimension of F in the sense of Vapnik).Denote B the class of all bins of all possible metric trees ofdepth ≤ h built using F . Then

VC-dim (B) ≤ 2hp log(hp) = O(hp).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.22/25

Page 100: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Rigorous lower bounds

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.23/25

Page 101: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Rigorous lower bounds

thm. Let F be a class of 1-Lipschitz functions on 0, 1d

with VC dimension of the class of sets given by inequalitiesf R a being poly(d).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.23/25

Page 102: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Rigorous lower bounds

thm. Let F be a class of 1-Lipschitz functions on 0, 1d

with VC dimension of the class of sets given by inequalitiesf R a being poly(d).With probability approaching 1, every metric tree indexingscheme for a random sample X of 0, 1d containing n

points, where d = no(1) and d = ω(log n),

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.23/25

Page 103: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Rigorous lower bounds

thm. Let F be a class of 1-Lipschitz functions on 0, 1d

with VC dimension of the class of sets given by inequalitiesf R a being poly(d).With probability approaching 1, every metric tree indexingscheme for a random sample X of 0, 1d containing n

points, where d = no(1) and d = ω(log n), will have theworst-case performance dω(1).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.23/25

Page 104: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Rigorous lower bounds

thm. Let F be a class of 1-Lipschitz functions on 0, 1d

with VC dimension of the class of sets given by inequalitiesf R a being poly(d).With probability approaching 1, every metric tree indexingscheme for a random sample X of 0, 1d containing n

points, where d = no(1) and d = ω(log n), will have theworst-case performance dω(1).

⊳ Can suppose every bin contains poly(d) datapoints, and

the tree depth is poly(d). The VC-dim of all possible bins

is poly(d) = o(n). If ǫ = n1/2−γ, by learning estimates the

measure of each bin of the scheme is O(n−1/2+γ), so there

will be Ω(n1/4−γ) = dω(1) bin accesses. ⊲Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.23/25

Page 105: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Example: vp-tree

The vp-tree (Yianilos) uses decision functions of the form

ft(ω) = (1/2)(ρ(xt+, ω) − ρ(xt−, ω)),

where

t± are two children of t and

xt± are the vantage points for the node t.

If Ω = Rd, VC dimension is d + 1.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.24/25

Page 106: Analysis of hierarchical metric-tree indexing schemescris/AofA2008/slides/pestov.pdf · Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA

Example: M -tree

The M-tree (Ciaccia, Patella, Zezula) employs decisionfunctions

ft(ω) = ρ(xt, ω) − supτ∈Bt

ρ(xt, τ),

where

Bt is a block corresponding to the node t,

xt is a datapoint chosen for each node t, and

suprema on the r.h.s. are precomputed and stored.

If Ω = Rd, VC-dim is d + 1; for Ω = 0, 1d, it is O(d).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP, Brazil – p.25/25