Upload
anton-konushin
View
1.022
Download
2
Tags:
Embed Size (px)
DESCRIPTION
We will learn about modern algorithmic techniques for handling large datasets, often by using imprecise but concise representations of the data such as a sketch or a sample of the data. The lectures will cluster around three themes Nearest Neighbor Search (similarity search): the general problem is, given a set of objects (e.g., images), to construct a data structure so that later, given a query object, one can efficiently find the most similar object from the database. Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic? Parallel framework: we look at problems where neither the data or the output fits on a machine. For example, given a set of 2D points, how can we compute the minimum spanning tree over a cluster of machines. The focus will be on techniques such as sketching, dimensionality reduction, sampling, hashing, and others.
Citation preview
Sketching, Sampling and other Sublinear Algorithms:
Nearest Neighbor Search
Alex Andoni(MSR SVC)
Nearest Neighbor Search (NNS)
Preprocess: a set of points
Query: given a query point , report a point with the smallest distance to
𝑞
𝑝
Motivation
Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure
Application areas: machine learning: k-NN rule speech/image/video/music recognition,
vector quantization, bioinformatics, etc… Distance can be:
Hamming, Euclidean, edit distance, Earth-mover
distance, etc… Primitive for other problems:
find the similar pairs in a set D, clustering…
000000011100010100000100010100011111
000000001100000100000100110100111111 𝑞
𝑝
Lecture Plan1. Locality-Sensitive Hashing2. LSH as a Sketch3. Towards Embeddings
2D case
Compute Voronoi diagram Given query , perform point
location Performance:
Space: Query time:
High-dimensional case
All exact algorithms degrade rapidly with the dimension
In practice: When is “low-medium”, kd-trees work reasonably When is “high”, state-of-the-art is unsatisfactory
Algorithm Query time Space
Full indexing (Voronoi diagram size)
No indexing – linear scan
Approximate NNS
r-near neighbor: given a new point , report a point s.t.
Randomized: a point returned with 90% probability
c-approximate
𝑐𝑟if there exists apoint at distance
q
r p
cr
Heuristic for Exact NNS
r-near neighbor: given a new point , report a set with all points s.t. (each with 90%
probability)
may contain some approximate neighbors s.t.
Can filter out bad answers
q
r p
cr
c-approximate
Approximation Algorithms for NNS
A vast literature: milder dependence on dimension
[Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],…[Aiger-Kaplan-Sharir’13],
little to no dependence on dimension
[Indyk-Motwani’98], [Kushilevitz-Ostrovsky-Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A-Indyk’06],… [A-Indyk-Nguyen-Razenshteyn’??]
Locality-Sensitive Hashing
Random hash function on s.t. for any points : Close when is “high”
Far when is “small”
Use several hashtables
q
p
𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)
Pr [𝑔 (𝑝 )=𝑔 (𝑞)]
𝑟 𝑐𝑟
1
𝑃 1
𝑃 2
: , where
[Indyk-Motwani’98]
q
“not-so-small”𝑃 1=¿𝑃 2=¿
𝜌=log 1/𝑃1
log 1/𝑃2
Locality sensitive hash functions
11
Hash function is usually a concatenation of “primitive” functions:
Example: Hamming space , i.e., choose bit for a random chooses bits at random
Formal description
12
Data structure is just hash tables: Each hash table uses a fresh random function
Hash all dataset points into the table Query:
Check for collisions in each of the hash tables until we encounter a point within distance
Guarantees: Space: , plus space to store points Query time: (in expectation) 50% probability of success.
Analysis of LSH Scheme
13
How did we pick ? For fixed , we have
Pr[collision close pair] = Pr[collision far pair] =
Want to make Set , where
Analysis: Correctness
14
Let be an -near neighbor If does not exists, algorithm can output anything
Algorithm fails when: near neighbor is not in the searched buckets
Probability of failure: Probability do not collide in a hash table: Probability they do not collide in hash tables at
most
Analysis: Runtime
15
Runtime dominated by: Hash function evaluation: time Distance computations to points in buckets
Distance computations: Care only about far points, at distance In one hash table, we have
Probability a far point collides is at most Expected number of far points in a bucket:
Over hash tables, expected number of far points is
Total: in expectation
LSH in the wild
16
If want exact NNS, what is ? Can choose any parameters Correct as long as Performance:
trade-off between # tables and false positives will depend on dataset “quality” Can tune to optimize for given dataset
Further advantages: Point insertions/deletions easy natural to distribute computation/hash tables in
a cluster
𝐿
𝑘
safety not guaranteed
fewer false positives
fewer tables
LSH Zoo
17
Hamming distance [IM’98] : pick a random coordinate(s)
Manhattan distance: homework
Jaccard distance between sets:
: pick a random permutation on the universe
min-wise hashing [Bro’97] Euclidean distance: next
lecture
To be or not to be
To sketch or not to sketch
…21102…
be toornot
sket
ch
…01122…
be toornot
sket
ch
…11101… …01111…
{be,not,or,to} {not,or,to, sketch}
1 1
not
not
=be,to,sketch,or,not
be to