Summer School on Hashing’14 Dimension Reduction

Summer School on Hashing’14

Dimension Reduction

Alex Andoni(Microsoft Research)

Nearest Neighbor Search (NNS)

Preprocess: a set of points

Query: given a query point , report a point with the smallest distance to

𝑞

𝑝

Motivation

Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure

Application areas: machine learning: k-NN rule speech/image/video/music recognition,

vector quantization, bioinformatics, etc… Distance can be:

Hamming, Euclidean, edit distance, Earth-mover

distance, etc… Primitive for other problems:

find the similar pairs in a set D, clustering…

000000011100010100000100010100011111

000000001100000100000100110100111111 𝑞

𝑝

2D case

Compute Voronoi diagram Given query , perform point

location Performance:

Space: Query time:

High-dimensional case

All exact algorithms degrade rapidly with the dimension

Algorithm Query time Space

Full indexing (Voronoi diagram size)

No indexing – linear scan

Dimension Reduction If high dimension is an issue, reduce it?!

“flatten” dimension into dimension Not possible in general: packing bound But can if: for a fixed subset of

Johnson Lindenstrauss Lemma [JL’84] Application: NNS in

Trivial scan: query time Reduce to time if preprocess, where time to

reduce dimension of the query point

Johnson Lindenstrauss Lemma There is a randomized linear map , , that

preserves distance between two vectors up to factor:

with probability ( some constant) Preserves distances among points for Time to compute map:

Idea: Project onto a random subspace of

dimension !

1D embedding How about one dimension () ? Map

, where are iid normal (Gaussian) random variable

Why Gaussian? Stability property: is distributed as , where is also

Gaussian Equivalently: is centrally distributed, i.e., has

random direction, and projection on random direction depends only on length of

pdf =

1D embedding Map ,

for any , Linear:

Want: Ok to consider since linear

Claim: for any , we have Expectation: Standard deviation:

Proof: Expectation

2 2

pdf =

Full Dimension Reduction Just repeat the 1D embedding for times!

where is matrix of Gaussian random variables

Again, want to prove: for fixed with probability

Concentration is distributed as

where each is distributed as Gaussian Norm

is called chi-squared distribution with degrees Fact: chi-squared very well concentrated:

Equal to with probability Akin to central limit theorem

Johnson Lindenstrauss: wrap-up with high probability

Beyond Gaussians: Can use instead of Gaussians [AMS’96, Ach’01,

TZ’04…]

Faster JL ?

14

Time: To compute time ?

Yes! [AC’06, AL’08’11, DKS’10, KN’12…]

Will show: time [Ailon-Chazelle’06]

Fast JL Transform

15

Costly because is dense Meta-approach: use matrix ? Suppose sample entries/row Analysis of one row:

s.t. with probability Expectation of :

What about variance??Set

𝑮 𝑥𝑘

𝑑

𝑮

normalization constant

Fast JLT: sparse projection

16

Variance of can be large Bad case: is sparse

think: Even for (each coordinate of goes somewhere)

two coordinates collide (bad) with probability ~ want exponential in failure probability really would need

But, take away: may work if is “spread around”

New plan: “spread around” use sparse

𝑥𝑘

𝑑

𝑮

FJLT: full

17

= matrix with r.v. on diagonal = Hadamard matrix:

can be computed in time composed of only

= sparse matrix as before, size , with

Projection:sparse matrix

𝑧=𝑃𝐻𝐷⋅𝑥

“spreading around”

Hadamard(Fast Fourier Transform)

Diagonal

Spreading around: intuition

18

Idea for Hadamard/Fourier Transform: “Uncertainty principle”: if the original is sparse,

then the transform is dense! Though can “break” ’s that are already dense

𝑧=𝑃𝐻𝐷𝑥

Projection:sparse matrix

Hadamard(Fast Fourier Transform)

Diagonal

“spreading around”

Spreading around: proof

19

Suppose Ideal spreading around:

Lemma: with probability at least , for each coordinate

Proof:

for some vector Hence is approx. gaussian Hence

Why projection ?

20

Why aren’t we done? choose first few coordinates of each has same distribution: gaussian Issue: are not independent

Nevertheless: since is a change of basis (rotation in )


Projection

21

Have: with probability

= projection onto just random coordinates!

Proof: standard concentration

Chernoff: enough to sample terms for approximation

Hence suffices


FJLT: wrap-up

22

Obtain: with probability at least dimension of is time:

Dimension not optimal: apply regular (dense) JL on to reduce further to

Final time: [AC’06, AL’08’11, WDLSA’09, DKS’10, KN’12,

BN, NPW’14]


Dimension Reduction: beyond Euclidean

23

Johnson-Lindenstrauss: for Euclidean space dimension, oblivious

Other norms, such as ? Essentially no: [CS’02, BC’03, LN’04, JN’10…] For points, approximation: between and [BC03,

NR10, ANN10…] even if map depends on the dataset!

But can do weak dimension reduction

Towards dimension reduction for

24

Can we do the “analog” of Euclidean projections?

For , we used: Gaussian distribution has stability property: is distributed as

Is there something similar for 1-norm? Yes: Cauchy distribution! 1-stable: is distributed as

What’s wrong then? Cauchy are heavy-tailed… doesn’t even have finite expectation

𝑝𝑑𝑓 (𝑠 )= 1

𝜋(𝑠2+1)

Weak embedding [Indyk’00]

25

Still, can consider map as before

Each coordinate distributed as Cauchy does not concentrate at all, but…

Can estimate by: Median of absolute values of coordinates! Concentrates because abs(Cauchy) is in the

correct range for 90% of the time!

Gives a sketch OK for nearest neighbor search

Documents

Summer School on Hashing’14 Dimension Reduction