88
Distance-based Similarity Searching and its Applications in Multimedia Retrieval with Deep Learning David Novak Laboratory of Data Intensive Systems and Applications (DISA) Masaryk University, Brno, Czech Republic Fulbright scholar at CIIR, UMass, Amherst, MA, USA CIIR Meeting, 21st September 2015 http://disa.fi.muni.cz/david-novak/ David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 1 / 29

Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Distance-based Similarity Searchingand its Applications in Multimedia Retrieval with Deep Learning

David Novak

Laboratory of Data Intensive Systems and Applications (DISA)Masaryk University, Brno, Czech Republic

Fulbright scholar at CIIR, UMass, Amherst, MA, USA

CIIR Meeting, 21st September 2015

http://disa.fi.muni.cz/david-novak/

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 1 / 29

Page 2: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Where Do I Come From

Lab of Data Intensive Systems and Applications (DISA) http://disa.fi.muni.cz/

Faculty of Informatics, Masaryk University http://www.fi.muni.cz/

Brno, Czech Republic

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 2 / 29

Page 3: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Where Do I Come From

Lab of Data Intensive Systems and Applications (DISA) http://disa.fi.muni.cz/

Faculty of Informatics, Masaryk University http://www.fi.muni.cz/

Brno, Czech Republic

Head of DISA Lab: prof. Pavel Zezula

Post-doc researches:

Michal BatkoPetra BudikovaVlastislav DohnalDavid NovakJan Sedmidubsky

PhD students:

current: 6successful (over 12 years): 10

undergraduates

fields of interest: databases, Big Data processing, similarity searching,multimedia retrieval, biometrics

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 2 / 29

Page 4: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Similarity Searching

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 3 / 29

Page 5: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Similarity Searching

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

Everything we can see, hear, measure, observe is in digital form

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 3 / 29

Page 6: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Similarity Searching

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

Everything we can see, hear, measure, observe is in digital form

Computers should be able to search data based on similarity

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 3 / 29

Page 7: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Similarity Searching

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

Everything we can see, hear, measure, observe is in digital form

Computers should be able to search data based on similarity

The similarity search problem has two aspects

effectiveness: how to measure similarity of two “objects”

domain specific (data- and application-specific, context dependent, . . . )photos, video, X-rays, voice, music, EEG, MTR, texts, . . .

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 3 / 29

Page 8: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Similarity Searching

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

Everything we can see, hear, measure, observe is in digital form

Computers should be able to search data based on similarity

The similarity search problem has two aspects

effectiveness: how to measure similarity of two “objects”

domain specific (data- and application-specific, context dependent, . . . )photos, video, X-rays, voice, music, EEG, MTR, texts, . . .

efficiency: how to realize similarity search fast

using a given data + similarity measureon very large data collections

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 3 / 29

Page 9: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Efficiency: Motivation Example

Example of data:

general images (photos)

every image processed by a deep convolutional neural network

to obtain a visual characterization of the image (feature)compared by Euclidean distance to measure generic visual similarity

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 4 / 29

Page 10: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Efficiency: Motivation Example

Example of data:

general images (photos)

every image processed by a deep convolutional neural network

to obtain a visual characterization of the image (feature)compared by Euclidean distance to measure generic visual similarity

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 4 / 29

Page 11: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Efficiency: Motivation Example

Example of data:

general images (photos)

every image processed by a deep convolutional neural networkto obtain a visual characterization of the image (feature)compared by Euclidean distance to measure generic visual similarity

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 4 / 29

Page 12: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Efficiency: Motivation Example

Example of data:

general images (photos)

every image processed by a deep convolutional neural network

to obtain a visual characterization of the image (feature)compared by Euclidean distance to measure generic visual similarity

Efficiency problem:

20 million of images with such descriptors

each descriptor is a 4096-dimensional float vector (16 kB)

⇒ over 320 GB of data to be organized for similarity search

answer similarity queries online

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 4 / 29

Page 13: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Motivation & Fundamentals Similarity Search: Effectiveness and Efficiency

Outline of the Talk

1 Motivation & FundamentalsSimilarity Search: Effectiveness and Efficiency

2 Indexing & Searching in Metric SpacesMetric-based Model of SimilarityOverview and PrinciplesVoronoi Partitioning

3 Specific Similarity IndexesM-IndexPPP-Codes

4 Deep Convolutional Neural Networksand their applications in image recognition

5 Visual Search Demo

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 5 / 29

Page 14: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Metric-based Model of Similarity

We model the data as metric space (D, δ), where D is a domain ofobjects and δ is a total distance function δ : D ×D −→ R+

0 satisfyingthe following postulates ∀x , y , x ∈ D:

identity: δ(x , x) = 0symmetry: δ(x , y) = δ(y , x)triangle inequality: δ(x , y) ≤ δ(x , z) + δ(z , y)

Objective: organize data collection X ⊆ D, resolve query by example

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 6 / 29

Page 15: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Metric-based Model of Similarity

We model the data as metric space (D, δ), where D is a domain ofobjects and δ is a total distance function δ : D ×D −→ R+

0 satisfyingthe following postulates ∀x , y , x ∈ D:

identity: δ(x , x) = 0symmetry: δ(x , y) = δ(y , x)triangle inequality: δ(x , y) ≤ δ(x , z) + δ(z , y)

Objective: organize data collection X ⊆ D, resolve query by example

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 6 / 29

Page 16: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Metric-based Model of Similarity

We model the data as metric space (D, δ), where D is a domain ofobjects and δ is a total distance function δ : D ×D −→ R+

0 satisfyingthe following postulates ∀x , y , x ∈ D:

identity: δ(x , x) = 0symmetry: δ(x , y) = δ(y , x)triangle inequality: δ(x , y) ≤ δ(x , z) + δ(z , y)

Objective: organize data collection X ⊆ D, resolve query by example

range query R(q, r) returns allobjects x ∈ X with δ(q, x) ≤ r

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 6 / 29

Page 17: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Metric-based Model of Similarity

We model the data as metric space (D, δ), where D is a domain ofobjects and δ is a total distance function δ : D ×D −→ R+

0 satisfyingthe following postulates ∀x , y , x ∈ D:

identity: δ(x , x) = 0symmetry: δ(x , y) = δ(y , x)triangle inequality: δ(x , y) ≤ δ(x , z) + δ(z , y)

Objective: organize data collection X ⊆ D, resolve query by example

range query R(q, r) returns allobjects x ∈ X with δ(q, x) ≤ r

k-NN(q) query returns k objectsx ∈ X with the smallest δ(q, x)

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 6 / 29

Page 18: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Metric vs. Other Models

Metric model of similarity

is very generic

applicable to many data types + similarity functions

we can build one index structure and apply many times

in some cases, we can even omit the triangle inequality (see below)

On the other hand

techniques for specific similarity are often more efficient

e.g. cosine similarity used with vector space model in IRthe inverted file indexes are very convenient for sparse vectors

But the distance-based indexing is often able to capture intrinsiccomplexity of the similarity space

ignoring unnecessary external “dimensions” of the data

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 7 / 29

Page 19: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Metric vs. Other Models

Metric model of similarity

is very generic

applicable to many data types + similarity functions

we can build one index structure and apply many times

in some cases, we can even omit the triangle inequality (see below)

On the other hand

techniques for specific similarity are often more efficient

e.g. cosine similarity used with vector space model in IRthe inverted file indexes are very convenient for sparse vectors

But the distance-based indexing is often able to capture intrinsiccomplexity of the similarity space

ignoring unnecessary external “dimensions” of the data

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 7 / 29

Page 20: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Metric vs. Other Models

Metric model of similarity

is very generic

applicable to many data types + similarity functions

we can build one index structure and apply many times

in some cases, we can even omit the triangle inequality (see below)

On the other hand

techniques for specific similarity are often more efficient

e.g. cosine similarity used with vector space model in IRthe inverted file indexes are very convenient for sparse vectors

But the distance-based indexing is often able to capture intrinsiccomplexity of the similarity space

ignoring unnecessary external “dimensions” of the data

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 7 / 29

Page 21: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Examples of Metric Spaces

Vector data:

Lp metrics (Minkowski distances)

L1 – Manhattan (city block) distance L1(x , y) =n∑

i=1

|xi − yi |

L2 – Euclidean distance L2(x , y) =

√n∑

i=1

(xi − yi )2

L∞ – Chebyshev distance L∞(x , y) =n

maxi=1|xi − yi |

quadratic form (Mahalanobis) δ(x , y) =((x − y)T ·M · (x − y)

) 12

where M is a matrix n × n

cosine distance 1− cos(x , y)

does NOT fulfill triangle inequality

if required, use angular similarity 1− cos−1 cos(x,y)π

Hamming distance (on vectors over a finite field, e.g. binary vectors)

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 8 / 29

Page 22: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Examples of Metric Spaces

Vector data:

Lp metrics (Minkowski distances)

L1 – Manhattan (city block) distance L1(x , y) =n∑

i=1

|xi − yi |

L2 – Euclidean distance L2(x , y) =

√n∑

i=1

(xi − yi )2

L∞ – Chebyshev distance L∞(x , y) =n

maxi=1|xi − yi |

quadratic form (Mahalanobis) δ(x , y) =((x − y)T ·M · (x − y)

) 12

where M is a matrix n × n

cosine distance 1− cos(x , y)

does NOT fulfill triangle inequality

if required, use angular similarity 1− cos−1 cos(x,y)π

Hamming distance (on vectors over a finite field, e.g. binary vectors)

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 8 / 29

Page 23: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Examples of Metric Spaces

Vector data:

Lp metrics (Minkowski distances)

L1 – Manhattan (city block) distance L1(x , y) =n∑

i=1

|xi − yi |

L2 – Euclidean distance L2(x , y) =

√n∑

i=1

(xi − yi )2

L∞ – Chebyshev distance L∞(x , y) =n

maxi=1|xi − yi |

quadratic form (Mahalanobis) δ(x , y) =((x − y)T ·M · (x − y)

) 12

where M is a matrix n × n

cosine distance 1− cos(x , y)

does NOT fulfill triangle inequality

if required, use angular similarity 1− cos−1 cos(x,y)π

Hamming distance (on vectors over a finite field, e.g. binary vectors)

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 8 / 29

Page 24: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Examples of Metric Spaces

Vector data:

Lp metrics (Minkowski distances)

L1 – Manhattan (city block) distance L1(x , y) =n∑

i=1

|xi − yi |

L2 – Euclidean distance L2(x , y) =

√n∑

i=1

(xi − yi )2

L∞ – Chebyshev distance L∞(x , y) =n

maxi=1|xi − yi |

quadratic form (Mahalanobis) δ(x , y) =((x − y)T ·M · (x − y)

) 12

where M is a matrix n × n

cosine distance 1− cos(x , y)

does NOT fulfill triangle inequality

if required, use angular similarity 1− cos−1 cos(x,y)π

Hamming distance (on vectors over a finite field, e.g. binary vectors)

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 8 / 29

Page 25: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Examples of Metric Spaces (2)

Strings:

edit distance (Levenshtein distance)

minimum number of insertions, deletions or substitutions

Sets:

Jaccard’s coefficient δ(X ,Y ) = 1− X∩YX∪Y

Hausdorff distance

for sets with elements related by another distance

Other types of data:

Earth mover’s distance (for histograms)

Tree Edit distance

Signature Quadratic Form distance, etc.

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 9 / 29

Page 26: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Examples of Metric Spaces (2)

Strings:

edit distance (Levenshtein distance)

minimum number of insertions, deletions or substitutions

Sets:

Jaccard’s coefficient δ(X ,Y ) = 1− X∩YX∪Y

Hausdorff distance

for sets with elements related by another distance

Other types of data:

Earth mover’s distance (for histograms)

Tree Edit distance

Signature Quadratic Form distance, etc.

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 9 / 29

Page 27: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Metric-based Model of Similarity

Examples of Metric Spaces (2)

Strings:

edit distance (Levenshtein distance)

minimum number of insertions, deletions or substitutions

Sets:

Jaccard’s coefficient δ(X ,Y ) = 1− X∩YX∪Y

Hausdorff distance

for sets with elements related by another distance

Other types of data:

Earth mover’s distance (for histograms)

Tree Edit distance

Signature Quadratic Form distance, etc.

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 9 / 29

Page 28: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Overview and Principles

Problem Formulation and Overview

Objective: Preprocess and organize collection of objects X ⊆ D in such away that similarity queries are processed efficiently

collections can be very large

computation of distance function δ can be expensive

Two decades of research in this area

theoretical principles identified

static and dynamic memory structures for precise similarity search

efficient disk-oriented techniques

precise and approximate (not all objects from k-NN answer returned)

distributed processing

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 10 / 29

Page 29: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Overview and Principles

Problem Formulation and Overview

Objective: Preprocess and organize collection of objects X ⊆ D in such away that similarity queries are processed efficiently

collections can be very large

computation of distance function δ can be expensive

Two decades of research in this area

theoretical principles identified

static and dynamic memory structures for precise similarity search

efficient disk-oriented techniques

precise and approximate (not all objects from k-NN answer returned)

distributed processing

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 10 / 29

Page 30: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Overview and Principles

Problem Formulation and Overview

Objective: Preprocess and organize collection of objects X ⊆ D in such away that similarity queries are processed efficiently

collections can be very large

computation of distance function δ can be expensive

Two decades of research in this area

theoretical principles identified

static and dynamic memory structures for precise similarity search

efficient disk-oriented techniques

precise and approximate (not all objects from k-NN answer returned)

distributed processing

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 10 / 29

Page 31: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Overview and Principles

Space Partitioning

Key and fundamental task of indexing is to partition the collection

But in metric space

the objects do not have any dimensions to partition the spacethere is no “absolute” ordering of the objectsjust with respect to some object

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 11 / 29

Page 32: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Overview and Principles

Space Partitioning

Key and fundamental task of indexing is to partition the collection

But in metric space

the objects do not have any dimensions to partition the spacethere is no “absolute” ordering of the objectsjust with respect to some object

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 11 / 29

Page 33: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Overview and Principles

Space Partitioning

Key and fundamental task of indexing is to partition the collection

But in metric space

the objects do not have any dimensions to partition the spacethere is no “absolute” ordering of the objectsjust with respect to some object

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 11 / 29

Page 34: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Overview and Principles

Space Partitioning

Key and fundamental task of indexing is to partition the collection

But in metric space

the objects do not have any dimensions to partition the spacethere is no “absolute” ordering of the objectsjust with respect to some object

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 11 / 29

Page 35: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Overview and Principles

Space Partitioning

Key and fundamental task of indexing is to partition the collection

But in metric space

the objects do not have any dimensions to partition the spacethere is no “absolute” ordering of the objectsjust with respect to some object

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 11 / 29

Page 36: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Voronoi Partitioning

Partitioning using a fixed set of reference objects (pivots)

Let us have a set of n pivots {p1, . . . , pn}

Voronoi cell Ci = all objects forwhich pivot pi is the closest

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 12 / 29

Page 37: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Voronoi Partitioning

Partitioning using a fixed set of reference objects (pivots)

Let us have a set of n pivots {p1, . . . , pn}

Voronoi cell Ci = all objects forwhich pivot pi is the closest

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 12 / 29

Page 38: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Recursive Voronoi Partitioning

Let us use the same set of n pivots p1, . . . , pn recursively

Partition each Ci using the other pivots p1, . . . , pi−1, pi+1, . . . , pnCi ,j = objects for which pi is the closest and pj the second closest

this principle can be used l-times recursively up to level l = n

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 13 / 29

Page 39: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Recursive Voronoi Partitioning

Let us use the same set of n pivots p1, . . . , pn recursively

Partition each Ci using the other pivots p1, . . . , pi−1, pi+1, . . . , pn

Ci ,j = objects for which pi is the closest and pj the second closestthis principle can be used l-times recursively up to level l = n

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 13 / 29

Page 40: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Recursive Voronoi Partitioning

Let us use the same set of n pivots p1, . . . , pn recursively

Partition each Ci using the other pivots p1, . . . , pi−1, pi+1, . . . , pnCi ,j = objects for which pi is the closest and pj the second closest

this principle can be used l-times recursively up to level l = n

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 13 / 29

Page 41: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Recursive Voronoi Partitioning

Let us use the same set of n pivots p1, . . . , pn recursively

Partition each Ci using the other pivots p1, . . . , pi−1, pi+1, . . . , pnCi ,j = objects for which pi is the closest and pj the second closest

this principle can be used l-times recursively up to level l = n

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 13 / 29

Page 42: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Pivot Permutations

A different point of view: (prefixes of) pivot permutations

Given object x ∈ X , order the pivots according to distances δ(x , pi )

0.5

3

p2

p4

p1

C3,2

C2,1

C1,2

C4,3

4,1C

x

C3,4

1,3C

C1,4

C2,3C3,1

0.2

0.3

0.25

p

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 14 / 29

Page 43: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Pivot Permutations

A different point of view: (prefixes of) pivot permutations

Given object x ∈ X , order the pivots according to distances δ(x , pi )

Let Πx be a permutation on the set of pivot indexes {1, . . . , n}such that Πx(j) is index of the j-th closest pivot from x

for example, Πx(1) is index of the closest pivot from xpΠx (j) is the j-the closest pivot from x

Πx is denoted as pivot permutation (PP) with respect to x .

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 14 / 29

Page 44: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Pivot Permutations

A different point of view: (prefixes of) pivot permutations

Given object x ∈ X , order the pivots according to distances δ(x , pi )

Let Πx be a permutation on the set of pivot indexes {1, . . . , n}such that Πx(j) is index of the j-th closest pivot from x

for example, Πx(1) is index of the closest pivot from xpΠx (j) is the j-the closest pivot from x

Πx is denoted as pivot permutation (PP) with respect to x .

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 14 / 29

Page 45: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Correspondence between Voronoi partitioning and PPs

Recursive Voronoi partitioning to level l

Cell C〈i1,...,il 〉 contains objects x forwhich

Πx(1) = i1, Πx(2) = i2, . . . ,Πx(l) = il

l-tuple 〈i1, . . . , il〉 is an l-prefix of pivot permutation Πx

pivot permutation prefix (PPP)there is one-to-one correspondence between “Voronoi cell” and “PPP”

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 15 / 29

Page 46: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Indexing & Searching in Metric Spaces Voronoi Partitioning

Correspondence between Voronoi partitioning and PPs

Recursive Voronoi partitioning to level l

Cell C〈i1,...,il 〉 contains objects x forwhich

Πx(1) = i1, Πx(2) = i2, . . . ,Πx(l) = il

l-tuple 〈i1, . . . , il〉 is an l-prefix of pivot permutation Πx

pivot permutation prefix (PPP)there is one-to-one correspondence between “Voronoi cell” and “PPP”

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 15 / 29

Page 47: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index Indexing Structure

Specific Voronoi-based indexes: M-Index, Distributed M-Index, PPP-Codes

M-Index: basic properties

uses dynamic recursive Voronoi partitioning (see below)

it defines a (hash) mapping from the metric space to (float) numbers

data either in memory or on disk (continuous chunks)

both precise and approximate similarity search

Novak, D. and Batko, M. (2009). Metric Index: An Efficient and Scalable Solution forSimilarity Search. In Proceedings of SISAP ’09, (pp. 65–73). IEEE Comput. Soc. Press.

Novak, D., Batko, M. and Zezula, P. (2011). Metric Index: An Efficient and Scalable

Solution for Precise and Approximate Similarity Search. Inform. Syst., 36(4), 721–733.

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 16 / 29

Page 48: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index Indexing Structure

Specific Voronoi-based indexes: M-Index, Distributed M-Index, PPP-Codes

M-Index: basic properties

uses dynamic recursive Voronoi partitioning (see below)

it defines a (hash) mapping from the metric space to (float) numbers

data either in memory or on disk (continuous chunks)

both precise and approximate similarity search

Novak, D. and Batko, M. (2009). Metric Index: An Efficient and Scalable Solution forSimilarity Search. In Proceedings of SISAP ’09, (pp. 65–73). IEEE Comput. Soc. Press.

Novak, D., Batko, M. and Zezula, P. (2011). Metric Index: An Efficient and Scalable

Solution for Precise and Approximate Similarity Search. Inform. Syst., 36(4), 721–733.

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 16 / 29

Page 49: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index Indexing Structure

Specific Voronoi-based indexes: M-Index, Distributed M-Index, PPP-Codes

M-Index: basic properties

uses dynamic recursive Voronoi partitioning (see below)

it defines a (hash) mapping from the metric space to (float) numbers

data either in memory or on disk (continuous chunks)

both precise and approximate similarity search

Novak, D. and Batko, M. (2009). Metric Index: An Efficient and Scalable Solution forSimilarity Search. In Proceedings of SISAP ’09, (pp. 65–73). IEEE Comput. Soc. Press.

Novak, D., Batko, M. and Zezula, P. (2011). Metric Index: An Efficient and Scalable

Solution for Precise and Approximate Similarity Search. Inform. Syst., 36(4), 721–733.

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 16 / 29

Page 50: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index Mapping Function

16

2,3C2,1

2,4C

p2

......0 87654

C

example with n = 4 and l = 2

integral part of the keyidentification of the cell

fractional part of the key

position within the cell

distance from the closest pivot

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 17 / 29

Page 51: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index with Dynamic Level

1C

1,3C

1,2C

1,3,C n nC ,1,n−1

2C

1,3,4C1,3,2C nC ,1,2 nC ,1,3

lkey0 n3

l=1

l=2

l=3

...

... ...

......C1,n

CnC

n,1

nC nC,2 ,n−1

level

level

level

partition only those cells that exceed certain capacity

pick a maximum level 1 ≤ lmax ≤ n

a kind of dynamic hashing, similar to extensible hashing

it is a locality sensitive hashing for generic distance spaces

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 18 / 29

Page 52: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index with Dynamic Level

1C

1,3C

1,2C

1,3,C n nC ,1,n−1

2C

1,3,4C1,3,2C nC ,1,2 nC ,1,3

lkey0 n3

l=1

l=2

l=3

...

... ...

......C1,n

CnC

n,1

nC nC,2 ,n−1

level

level

level

partition only those cells that exceed certain capacity

pick a maximum level 1 ≤ lmax ≤ n

a kind of dynamic hashing, similar to extensible hashing

it is a locality sensitive hashing for generic distance spaces

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 18 / 29

Page 53: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index: Precise Query Evaluation Strategy

The precise query evaluation gives guarantees to return all results

for both range query R(q, r) and k-NN queries

search principle: filter & refine

filter out parts of index that cannot contain query relevant objectsfor objects x ∈ X that cannotbe filtered out, calculate δ(q, x)

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 19 / 29

Page 54: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index: Precise Query Evaluation Strategy

The precise query evaluation gives guarantees to return all results

for both range query R(q, r) and k-NN queries

search principle: filter & refine

filter out parts of index that cannot contain query relevant objects

for objects x ∈ X that cannotbe filtered out, calculate δ(q, x)

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 19 / 29

Page 55: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index: Precise Query Evaluation Strategy

The precise query evaluation gives guarantees to return all results

for both range query R(q, r) and k-NN queries

search principle: filter & refine

filter out parts of index that cannot contain query relevant objects

for objects x ∈ X that cannotbe filtered out, calculate δ(q, x)

2

C2

1C

2 3

...

q

r

0 1

p

C3

0

1p

p

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 19 / 29

Page 56: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index: Precise Query Evaluation Strategy

The precise query evaluation gives guarantees to return all results

for both range query R(q, r) and k-NN queries

search principle: filter & refine

filter out parts of index that cannot contain query relevant objectsfor objects x ∈ X that cannotbe filtered out, calculate δ(q, x)

2

C2

1C

2 3

...

q

r

0 1

p

C3

0

1p

p

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 19 / 29

Page 57: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index: Precise Query Evaluation Strategy

The precise query evaluation gives guarantees to return all results

for both range query R(q, r) and k-NN queries

search principle: filter & refine

filter out parts of index that cannot contain query relevant objectsfor objects x ∈ X that cannotbe filtered out, calculate δ(q, x)

M-Index employs practically all knownmetric principles of space pruning andfiltering

triangle inequality required

2

C2

1C

2 3

...

q

r

0 1

p

C3

0

1p

p

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 19 / 29

Page 58: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index: Precise Query Evaluation Strategy

The precise query evaluation gives guarantees to return all results

for both range query R(q, r) and k-NN queries

search principle: filter & refine

filter out parts of index that cannot contain query relevant objectsfor objects x ∈ X that cannotbe filtered out, calculate δ(q, x)

M-Index employs practically all knownmetric principles of space pruning andfiltering

triangle inequality required

2

C2

1C

2 3

...

q

r

0 1

p

C3

0

1p

p

the search must still access and refine about 30–50% of the data

for collections with high intrinsic dimensionality (dimensionality curse)for R(q, r) and k-NN queries with reasonable r and k

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 19 / 29

Page 59: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

Approximate Strategy for M-Index

p3

p2

p4

p1

C3,2

C2,1

C1,2

C4,3

4,1C

q

C3,4

1,3C

C1,4

C2,3C3,1

0.2

0.3

0.25

0.5

determine cells with a high chanceto contain relevant objects

use query-pivot distancesδ(q, p1), δ(q, p2), . . . , δ(q, pn)

estimate “distances” between thequery and Voronoi cells (PPPs)

=⇒ rank the Voronoi cells (data)with respect to the query

measure quality of k-NN approximate search by recall = precision

recall(A) =|A ∩ AP |

k· 100%

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 20 / 29

Page 60: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

Approximate Strategy for M-Index

p3

p2

p4

p1

C3,2

C2,1

C1,2

C4,3

4,1C

q

C3,4

1,3C

C1,4

C2,3C3,1

0.2

0.3

0.25

0.5

determine cells with a high chanceto contain relevant objects

use query-pivot distancesδ(q, p1), δ(q, p2), . . . , δ(q, pn)

estimate “distances” between thequery and Voronoi cells (PPPs)

=⇒ rank the Voronoi cells (data)with respect to the query

measure quality of k-NN approximate search by recall = precision

recall(A) =|A ∩ AP |

k· 100%

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 20 / 29

Page 61: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

Approximate Strategy for M-Index

p3

p2

p4

p1

C3,2

C2,1

C1,2

C4,3

4,1C

q

C3,4

1,3C

C1,4

C2,3C3,1

0.2

0.3

0.25

0.5

determine cells with a high chanceto contain relevant objects

use query-pivot distancesδ(q, p1), δ(q, p2), . . . , δ(q, pn)

estimate “distances” between thequery and Voronoi cells (PPPs)

=⇒ rank the Voronoi cells (data)with respect to the query

measure quality of k-NN approximate search by recall = precision

recall(A) =|A ∩ AP |

k· 100%

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 20 / 29

Page 62: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index Approximate Strategy: Brief Evaluation

3 4 5

k=10

accessed and refined objects [% of database]

k−

NN

rec

all

[%]

k=100

M−Index

0k=1000

20

40

60

80

100

0 1 2 3 4 5accessed and refined objects [% of database]

0

20

40

60

80

100

0 1 2

k=1

collection of 1M 4096-dimensional float vectors with L2 distance

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 21 / 29

Page 63: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index Approximate Strategy: Brief Evaluation

3 4 5

k=10

accessed and refined objects [% of database]

k−

NN

rec

all

[%]

k=100

M−Index

0k=1000

20

40

60

80

100

0 1 2 3 4 5accessed and refined objects [% of database]

0

20

40

60

80

100

0 1 2

k=1

collection of 1M 4096-dimensional float vectors with L2 distance

access and refine 3–5 % of the data

to achieve recall over 90 %the larger the data collection, the lower the percentage

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 21 / 29

Page 64: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes M-Index

M-Index Approximate Strategy: Brief Evaluation

3 4 5

k=10

accessed and refined objects [% of database]

k−

NN

rec

all

[%]

k=100

M−Index

0k=1000

20

40

60

80

100

0 1 2 3 4 5accessed and refined objects [% of database]

0

20

40

60

80

100

0 1 2

k=1

4 5accessed and refined objects [% of database]

disk, multi thread

M−Index: various implementations

aver

age

sear

ch t

ime

[ms] memory, single thread

0

500

memory, multi thread

1000

1500

2000

2500

0 1 2 3

disk, single thread

collection of 1M 4096-dimensional float vectors with L2 distance

access and refine 3–5 % of the data

to achieve recall over 90 %the larger the data collection, the lower the percentage

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 21 / 29

Page 65: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Standard Similarity Search Approach

standard approach to large-scale approximate search (e.g. M-Index):

dataset X is partitioned and stored

disk storage

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 22 / 29

Page 66: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Standard Similarity Search Approach

standard approach to large-scale approximate search (e.g. M-Index):

dataset X is partitioned and stored

given query q, the “most-promising” partitions form the candidate set

disk storage

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 22 / 29

Page 67: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Standard Similarity Search Approach

standard approach to large-scale approximate search (e.g. M-Index):

dataset X is partitioned and stored

given query q, the “most-promising” partitions form the candidate set

the candidate set SC is refined by calculating δ(q, x), ∀x ∈ SC

disk storage

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 22 / 29

Page 68: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Standard Similarity Search Approach

standard approach to large-scale approximate search (e.g. M-Index):

dataset X is partitioned and stored

given query q, the “most-promising” partitions form the candidate set

the candidate set SC is refined by calculating δ(q, x), ∀x ∈ SC

majority of the search costs:

reading and refinement of SC

=⇒ accuracy of the candidate setis key

disk storage

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 22 / 29

Page 69: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Code: The Core Idea

the Voronoi cells span large areas of the space

given a query, the “close” cells contain also distant data objects

actually, far more distant objects than the close ones

having several independent “orthogonal” partitionings

the relevant objects should be in close cells of “all” partitioningsthe distant objects in close cells vary

aggregate the candidate sets from individual partitionings so thatobjects that appear “often” on top positions are ranked high

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 23 / 29

Page 70: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Code: The Core Idea

the Voronoi cells span large areas of the space

given a query, the “close” cells contain also distant data objects

actually, far more distant objects than the close ones

having several independent “orthogonal” partitionings

the relevant objects should be in close cells of “all” partitioningsthe distant objects in close cells vary

aggregate the candidate sets from individual partitionings so thatobjects that appear “often” on top positions are ranked high

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 23 / 29

Page 71: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Code: The Core Idea

the Voronoi cells span large areas of the space

given a query, the “close” cells contain also distant data objects

actually, far more distant objects than the close ones

having several independent “orthogonal” partitionings

the relevant objects should be in close cells of “all” partitioningsthe distant objects in close cells vary

aggregate the candidate sets from individual partitionings so thatobjects that appear “often” on top positions are ranked high

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 23 / 29

Page 72: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Code: The Core Idea

the Voronoi cells span large areas of the space

given a query, the “close” cells contain also distant data objects

actually, far more distant objects than the close ones

having several independent “orthogonal” partitionings

the relevant objects should be in close cells of “all” partitioningsthe distant objects in close cells vary

aggregate the candidate sets from individual partitionings so thatobjects that appear “often” on top positions are ranked high

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 23 / 29

Page 73: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Code: The Core Idea

the Voronoi cells span large areas of the space

given a query, the “close” cells contain also distant data objects

actually, far more distant objects than the close ones

having several independent “orthogonal” partitionings

the relevant objects should be in close cells of “all” partitioningsthe distant objects in close cells vary

aggregate the candidate sets from individual partitionings so thatobjects that appear “often” on top positions are ranked high

Novak, D. and Zezula, P. Rank Aggregation of Candidate Sets for Efficient Similarity Search. InProceedings of 25th Inter. Conf. on Database and Expert Systems Applications (DEXA 2014)

Best paper of DEXA 2014 Award.

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 23 / 29

Page 74: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Codes in a Nutshell

1 data space is partitioned multiple-times by recursive Voronoi

2 based on these partitionings, objects are mapped onto memory codesa dynamic memory index is created using these codes

3 given query q, multiple ranked candidate sets are generated

4 these candidate rankings are merged using median of individual ranksthe merged candidate set is smaller and more accurate

5 the final candidate set is retrieved and refinedobject-by-object, objects stored in a ID-object store

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 24 / 29

Page 75: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Codes in a Nutshell

1 data space is partitioned multiple-times by recursive Voronoi

2 based on these partitionings, objects are mapped onto memory codesa dynamic memory index is created using these codes

3 given query q, multiple ranked candidate sets are generated

4 these candidate rankings are merged using median of individual ranksthe merged candidate set is smaller and more accurate

5 the final candidate set is retrieved and refinedobject-by-object, objects stored in a ID-object store

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 24 / 29

Page 76: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Codes in a Nutshell

1 data space is partitioned multiple-times by recursive Voronoi

2 based on these partitionings, objects are mapped onto memory codesa dynamic memory index is created using these codes

3 given query q, multiple ranked candidate sets are generated

4 these candidate rankings are merged using median of individual ranksthe merged candidate set is smaller and more accurate

5 the final candidate set is retrieved and refinedobject-by-object, objects stored in a ID-object store

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 24 / 29

Page 77: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Codes in a Nutshell

1 data space is partitioned multiple-times by recursive Voronoi

2 based on these partitionings, objects are mapped onto memory codesa dynamic memory index is created using these codes

3 given query q, multiple ranked candidate sets are generated

4 these candidate rankings are merged using median of individual ranksthe merged candidate set is smaller and more accurate

5 the final candidate set is retrieved and refinedobject-by-object, objects stored in a ID-object store

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 24 / 29

Page 78: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Codes in a Nutshell

1 data space is partitioned multiple-times by recursive Voronoi

2 based on these partitionings, objects are mapped onto memory codesa dynamic memory index is created using these codes

3 given query q, multiple ranked candidate sets are generated

4 these candidate rankings are merged using median of individual ranksthe merged candidate set is smaller and more accurate

5 the final candidate set is retrieved and refinedobject-by-object, objects stored in a ID-object store

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 24 / 29

Page 79: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

PPP-Codes in a Nutshell

1 data space is partitioned multiple-times by recursive Voronoi

2 based on these partitionings, objects are mapped onto memory codesa dynamic memory index is created using these codes

3 given query q, multiple ranked candidate sets are generated

4 these candidate rankings are merged using median of individual ranksthe merged candidate set is smaller and more accurate

5 the final candidate set is retrieved and refinedobject-by-object, objects stored in a ID-object store

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 24 / 29

Page 80: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Candidate Aggregation Principle

Given query q, Voronoi cells from each of the λ partitioning are ranked

λ candidate rankings ψjq of object IDs to be aggregated

q ∈ D

ψq1: {x y1 y2} {y3 y4 y5} {y6} ...

ψq2: {y3 y2} {y1 y4 y6 y7} {x y8} ...

ψq3: {x} {y3 y4 y5} {y2 y6} ...

ψq4: {y1 y2} {y3 y4 y5} {y8} {y6} ...

ψq5: {y1 y2 } {y4 y5} {y3} {x y7} ...

objects with the rank '1'

rank '2'rank '3'

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 25 / 29

Page 81: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Candidate Aggregation Principle

Given query q, Voronoi cells from each of the λ partitioning are ranked

λ candidate rankings ψjq of object IDs to be aggregated

final rank of object x is p-percentile (e.g. median) of its λ ranks

Ψp(q, x) = percentilep(ψ1q(x), ψ2

q(x), . . . , ψλq (x))

q ∈ D

ψq1: {x y1 y2} {y3 y4 y5} {y6} ...

ψq2: {y3 y2} {y1 y4 y6 y7} {x y8} ...

ψq3: {x} {y3 y4 y5} {y2 y6} ...

ψq4: {y1 y2} {y3 y4 y5} {y8} {y6} ...

ψq5: {y1 y2 } {y4 y5} {y3} {x y7} ...

objects with the rank '1'

Ψ0.5 (q, x) = percentile0.5{1, 1, 3, 4, ?} = 3

rank '2'rank '3'

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 25 / 29

Page 82: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Evaluation 1: Accuracy of the Candidate Set

How many candidate objects are needed to achieve certain recall level

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 26 / 29

Page 83: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Evaluation 1: Accuracy of the Candidate Set

How many candidate objects are needed to achieve certain recall level

λ=1λ=2λ=4λ=6λ=8

10

100

1000

10000

32 64 128 256

candid

ate

set

size

R 0.20.2

0.20.2

0.050.07

0.05

0.06

# of pivots n

Candidate set size R necessary to achieve 80% of 1-NN recall

Settings: 1M CoPhIR dataset, l = 8 and p = 0.75

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 26 / 29

Page 84: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Evaluation 2: Overall Efficiency

0 0.1

search time [ms] (right)

0.2 0.3 0.4 0.5 0

50

100

150

200

250

accessed and refined objects [% of database]

collection size: 1M

sear

ch t

ime

[ms]

k−

NN

rec

all

[%]

0

20

40

60

80

100

0 0.1 0.2 0.3 0.4 0.5 0

50

100

150

200

250

accessed and refined objects [% of database]

0

1−NN recall (left) 20

40

60

10−NN recall (left)

80

100

100−NN recall (left)

4096-dimensional float vectors with L2 distance

shrinking the candidate set size by one or two orders of magnitude

while preserving the answer qualitythe larger the data collection, the lower the percentage

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 27 / 29

Page 85: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Specific Similarity Indexes PPP-Codes

Evaluation 2: Overall Efficiency

0 0.1

search time [ms] (right)

0.2 0.3 0.4 0.5 0

50

100

150

200

250

accessed and refined objects [% of database]

collection size: 1M

sear

ch t

ime

[ms]

k−

NN

rec

all

[%]

0

20

40

60

80

100

0 0.1 0.2 0.3 0.4 0.5 0

50

100

150

200

250

accessed and refined objects [% of database]

0

1−NN recall (left) 20

40

60

10−NN recall (left)

80

100

100−NN recall (left)

0.01

search time [ms] (right)

0.02 0.03 0.04 0.05 0

200

400

600

800

1000

sear

ch t

ime

[ms]

accessed and refined objects [% of database]

k−

NN

rec

all

0

20

40

60

80

100

0 0.01 0.02 0.03 0.04 0.05 0

200

400

600

800

1000

sear

ch t

ime

[ms]

accessed and refined objects [% of database]

collection size: 20M

0

1−NN recall (left) 20

40

60

10−NN recall (left)

80

100

100−NN recall (left)

0

4096-dimensional float vectors with L2 distance

shrinking the candidate set size by one or two orders of magnitude

while preserving the answer qualitythe larger the data collection, the lower the percentage

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 27 / 29

Page 86: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Deep Convolutional Neural Networks and their applications in image recognition

Deep Convolutional Neural Networks

a separate Google docs presentation

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 28 / 29

Page 87: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Visual Search Demo

Online Demonstration

Demonstration of large-scale image visual search

20 million images from a photo stock company

features from deep convolutional neural networks

4096-dimensional vectors with L2 distance

collection was recently released for research purposes

http://disa.fi.muni.cz/profiset/

PPP-Codes index (1 GB in memory, 124 GB on the SSD disk)

http://disa.fi.muni.cz/demos/profiset-decaf/

front-end temporarily running athttp://cybela12.fi.muni.cz:8888/demos/profiset-decaf/

Novak, D., Batko. M. and Zezula, P. (2015). Large-scale Image Retrieval using Neural Net

Descriptors. Presented at SIGIR 2015.

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 29 / 29

Page 88: Distance-based Similarity Searching - Masaryk Universitydisa.fi.muni.cz/wp-content/uploads/distance-based-sim...Distance-based Similarity Searching and its Applications in Multimedia

Visual Search Demo

Online Demonstration

Demonstration of large-scale image visual search

20 million images from a photo stock company

features from deep convolutional neural networks

4096-dimensional vectors with L2 distance

collection was recently released for research purposes

http://disa.fi.muni.cz/profiset/

PPP-Codes index (1 GB in memory, 124 GB on the SSD disk)

http://disa.fi.muni.cz/demos/profiset-decaf/

front-end temporarily running athttp://cybela12.fi.muni.cz:8888/demos/profiset-decaf/

Novak, D., Batko. M. and Zezula, P. (2015). Large-scale Image Retrieval using Neural Net

Descriptors. Presented at SIGIR 2015.

David Novak (MU Brno & UMass) Distance-based Similarity Searching 21st September 2015 29 / 29