The Post Office Problem

Embed Size (px)

DESCRIPTION

A talk I presented at Nashville Hack Day 2012 concerning using multiple random low-dimensional projections of high-dimensional data to optimize an approximate nearest neighbor search

Citation preview

  • 1. The Post Office Problemk-d trees, k-nn search, and the Johnson-Lindenstrauss lemma

2. Who am I? Jeremy Holland Senior lead developer atCentresource Math and algorithms nerd @awebneck,github.com/awebneck, freenode:awebneck, (you get the idea) 3. If you like the talk... I like scotch. Just putting it out there. 4. What is the Post Office Problem? Don Knuth, professional CS badass. TAOCP, vol. 3 Otherwise known as Nearest Neighbor search Lets say youve... 5. Just moved to Denmark! 6. But you need to mail a letter!Which post office do you go to? 7. Finding free images of postoffices is hard, so...Well just reduce it to this:q 8. Naive implementation Calculate distance to all points, find smallestmin = INFINITYP = K = q = best = nilfor p in P dodimDistSum = 0for k in K dodimDistSum += (q[k]-p[k])**2dist = dimDistSum.sqrtif dist < minmin = distbest = preturn best 9. With a little preprocessing... But that takestime! - can we do better? You bet! k-d tree Binary tree (each node has at most twochildren) Each node represents a single point in the setto be searched 10. Each node looks like... Domain: the vector describing the point (i.e.[p[0], p[1], p[k-1]]) Range: Some identifying characteristic (e.g. PKin a database) Split: A chosen dimension from 0 split < k Left: The left child (left.domain[split] q = for p in P dopq = nns 98% accuracy whenmultiple nearest neighbors are selected fromeach projection and d is reduced from 256 to15, with approximately 30% of the calculation.(see credits) Additional experiments yielded similar results,as did my own Thats pretty darn-tootin good 53. Stuff to watch out for Balancing is vitally important (assuming uniformdistribution of points): careful attention must bepaid to selection of nodes (node with mediancoordinate for split axis) Cycle through axes for each level of the tree root should split on 0, lvl 1 on 1, lvl 2 on 2, etc. 54. Stuff to watch out for Building the trees still takes some time Building the projections is effectively matrixmultiplication, time in(Strassensalgorithm) Building the (balanced) trees from the projectionstakes time in approximately Solution: build the trees ahead of time andstore them for later querying (i.e. index thosebad boys!) 55. Thanks! Credits: Based in large part on research conducted byYousuf Ahmed, NYU: http://bit.ly/NZ7ZHo K-d trees: J. L. Bentley, Stanford U.:http://bit.ly/Mpy05p Dimensionality reduction: W. B. Johnson and J.Lindenstrauss: http://bit.ly/m9SGPN Research Fuel: Ardbeg Uigeadail:http://bit.ly/fcag0E