34
Nearest Neighbor Analysis of Customer Behavior

Nearest Neighbor Customer Insight

Embed Size (px)

DESCRIPTION

Nearest neighbor models are conceptually just about the simplest kind of model possible. The problem is that they generally aren’t feasible to apply. Or at least, they weren’t feasible until the advent of Big Data techniques. These slides will describe some of the techniques used in the knn project to reduce thousand-year computations to a few hours. The knn project uses the Mahout math library and Hadoop to speed up these enormous computations to the point that they can be usefully applied to real problems. These same techniques can also be used to do real-time model scoring.

Citation preview

Page 1: Nearest Neighbor Customer Insight

Nearest Neighbor Analysis of Customer Behavior

Page 2: Nearest Neighbor Customer Insight

whoami – Chao Yuan

• SVP, Risk and Information Management, American Express

Page 3: Nearest Neighbor Customer Insight

whoami – Ted Dunning

• Chief Application Architect, MapR Technologies• Committer, member, Apache Software Foundation

– particularly Mahout, Zookeeper and Drill

• Contact me [email protected]@[email protected]@ted_dunning

• Get slides and more info athttp://www.mapr.com/company/events/speaking/oanyc-9-27-12

Page 4: Nearest Neighbor Customer Insight

Agenda – The Business Side

• Digital Transformation

• Modeling opportunity

• Potential applications of agile modeling

• Required scale and speed of KNN

Page 5: Nearest Neighbor Customer Insight

Agenda – The Math Side

• Nearest neighbor models– Colored dots; need good distance metric; projection, LSH and k-

means search• K-means algorithms

– O(k d log n) per point for Lloyd’s algorithm … not good for k = 2000, n = 108

– Surrogate methods• fast, sloppy single pass clustering with κ = k log n• fast sloppy search for nearest cluster, O(log κ) = O(d (log k + log log n)) per

point• fast, in-memory, high-quality clustering of κ weighted centroids• result consists of k high-quality centroids for the original data

• Results

Page 6: Nearest Neighbor Customer Insight

Context

• Digital transformation.

• Data helps us better serve our customers.

• Privacy is paramount.

Page 7: Nearest Neighbor Customer Insight

Our Business

• We are and continue to strive to be best-in-class.

• We have 100 million cards in circulation.

• Quick and accurate decision-making is key.– Marketing offers– Fraud prevention

Page 8: Nearest Neighbor Customer Insight

Opportunity

• Demand of modeling is increasing rapidly

• So we are testing something simpler and more agile

• Like k-nearest neighbor

Page 9: Nearest Neighbor Customer Insight

What’s that?

• Find the k nearest training examples – lookalike customers

• This is easy … but hard– easy because it is so conceptually simple and you don’t have

knobs to turn or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results

• Initial rapid prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time

Page 10: Nearest Neighbor Customer Insight

Comparison to Other Modeling Approaches

• Logistic regression

• Tree-based methods

Page 11: Nearest Neighbor Customer Insight

K-Nearest Neighbor Example

Page 12: Nearest Neighbor Customer Insight

Required Scale and Speed and Accuracy

• Want 20 million queries against 25 million references in 10,000 s

• Should be able to search > 100 million references

• Should be linearly and horizontally scalable• Must have >50% overlap against reference

search• Evaluation by sub-sampling is viable, but tricky

Page 13: Nearest Neighbor Customer Insight

How Hard is That?

• 20 M x 25 M x 100 Flop = 50 P Flop

• 1 CPU = 5 Gflops

• We need 10 M CPU seconds => 10,000 CPU’s

• Real-world efficiency losses may increase that by 10x

• Not good!

Page 14: Nearest Neighbor Customer Insight

How Can We Search Faster?

• First rule: don’t do it– If we can eliminate most candidates, we can do less work– Projection search and k-means search

• Second rule: don’t do it– We can convert big floating point math to clever bit-wise integer

math– Locality sensitive hashing

• Third rule: reduce dimensionality– Projection search– Random projection for very high dimension

Page 15: Nearest Neighbor Customer Insight

Projection Search

java.lang.TreeSet!

Page 16: Nearest Neighbor Customer Insight

How Many Projections?

Page 17: Nearest Neighbor Customer Insight

LSH Search

• Each random projection produces independent sign bit• If two vectors have the same projected sign bits, they

probably point in the same direction (i.e. cos θ ≈ 1)• Distance in L2 is closely related to cosine

• We can replace (some) vector dot products with long integer XOR

Page 18: Nearest Neighbor Customer Insight

LSH Bit-match Versus Cosine

Page 19: Nearest Neighbor Customer Insight

Results

Page 20: Nearest Neighbor Customer Insight

K-means Search

• First do clustering with lots (thousands) of clusters

• Then search nearest clusters to find nearest points

• We win if we find >50% overlap with “true” answer

• We lose if we can’t cluster super-fast– more on this later

Page 21: Nearest Neighbor Customer Insight

Lots of Clusters Are Fine

Page 22: Nearest Neighbor Customer Insight

Lots of Clusters Are Fine

Page 23: Nearest Neighbor Customer Insight

Some Details

• Clumpy data works better– Real data is clumpy

• Speedups of 100-200x seem practical with 50% overlap– Projection search and LSH can be used to accelerate that

(some)

• More experiments needed

• Definitely need fast search

Page 24: Nearest Neighbor Customer Insight

Lloyd’s Algorithm• Part of CS folk-lore• Developed in the late 50’s for signal quantization, published in 80’s

initialize k cluster centroids somehowfor each of many iterations:for each data point:assign point to nearest clusterrecompute cluster centroids from points assigned to clusters

• Highly variable quality, several restarts recommended

Page 25: Nearest Neighbor Customer Insight

Ball k-means

• Provably better for highly clusterable data• Tries to find initial centroids in the “core” of real clusters• Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendencyfor each of a very few iterations:

for each data point:assign point to nearest cluster

recompute centroids using only points much closer than closest cluster

Page 26: Nearest Neighbor Customer Insight

Surrogate Method

• Start with sloppy clustering into κ = k log n clusters• Use these clusters as a weighted surrogate for the data• Cluster surrogate data using ball k-means

• Results are provably high quality for highly clusterable data

• Sloppy clustering can be done on-line• Surrogate can be kept in memory• Ball k-means pass can be done at any time

Page 27: Nearest Neighbor Customer Insight

Algorithm Costs

• O(k d log n) per point for Lloyd’s algorithm … not so good for k = 2000, n = 108

• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log

log n)) per point– fast, in-memory, high-quality clustering of κ weighted centroids– result consists of k high-quality centroids

• This is a big deal:– k d log n = 2000 x 10 x 26 = 50,000– log k + log log n = 11 + 5 = 17– 3000 times faster makes the grade as a bona fide big deal

Page 28: Nearest Neighbor Customer Insight

The Internals

• Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

• Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute

• Super-fast clustering– Kmeans, StreamingKmeans

Page 29: Nearest Neighbor Customer Insight

How It Works• For each point

– Find approximately nearest centroid (distance = d)– If d > threshold, new centroid– Else possibly new cluster– Else add to nearest centroid

• If centroids > K ~ C log N– Recursively cluster centroids with higher threshold

• Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

Page 30: Nearest Neighbor Customer Insight

Parallel Speedup?

Page 31: Nearest Neighbor Customer Insight

What About Map-Reduce

• Map-reduce implementation is nearly trivial– Compute surrogate on each split– Total surrogate is union of all partial surrogates– Do in-memory clustering on total surrogate

• Threaded version shows linear speedup already– Map-reduce speedup is likely, not entirely

guaranteed

Page 32: Nearest Neighbor Customer Insight

How Well Does it Work?

• Theoretical guarantees for well clusterable data– Shindler, Wong and Meyerson, NIPS, 2011

• Evaluation on synthetic data– Rough clustering produces correct surrogates– Possible issue in ball k-means initialization (still

produces good clustering on test data)

Page 33: Nearest Neighbor Customer Insight

Summary

• Nearest neighbor algorithms can be blazing fast

• But you need blazing fast clustering– Which we now have

Page 34: Nearest Neighbor Customer Insight

Contact Us!• We’re hiring at MapR in California

• We’re hiring at Amex in Phoenix and New York

• Come get the slides at http://www.mapr.com/company/events/speaking/oanyc-9-27-12

• Contact Ted at [email protected] or @ted_dunning• Contact Chao at [email protected]