18
1 ©MapR Technologies - Confidential Super-Fast Clustering Report from MapR workshop

1 ©MapR Technologies - Confidential Super-Fast Clustering Report from MapR workshop

Embed Size (px)

Citation preview

1©MapR Technologies - Confidential

Super-Fast ClusteringReport from MapR workshop

2©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Twitter for this talk– #mapr_uk

Slides and such:– http://info.mapr.com/ted-uk-05-2012

3©MapR Technologies - Confidential

Company Background

MapR provides the industry’s best Hadoop Distribution– Combines the best of the Hadoop community

contributions with significant internally financed infrastructure development

Background of Team– Deep management bench with extensive analytic,

storage, virtualization, and open source experience– Google, EMC, Cisco, VMWare, Network Appliance, IBM,

Microsoft, Apache Foundation, Aster Data, Brio, ParAccel Proven – MapR used across industries (Financial Services, Media,

Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco– Over 1,000 installs

4©MapR Technologies - Confidential

We Also Do …

Open source development– Zookeeper– Hadoop– Mahout– Stuff

Partner workshops– Machine learning– Information architecture– Cluster design

5©MapR Technologies - Confidential

We Also Do …

Open source development– Zookeeper– Hadoop– Mahout– Stuff

Partner workshops– Machine learning– Information architecture– Cluster design

6©MapR Technologies - Confidential

The Problem

A certain bank– had lots of customers– had lots of prospective customers– had a non-trivial number of fraudulent customers– had a non-trivial number of fraudulent merchants

They also – collected data– built models– collected more data– built more models

7©MapR Technologies - Confidential

But …

These models were arduous to build

And hard to test

So people suggested something simpler

Like k-nearest neighbor

8©MapR Technologies - Confidential

What’s that?

Find the k nearest training examples Use the average value of the target variable from them

This is easy … but hard– easy because it is so conceptually simple and you don’t have knobs to turn

or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results

Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time

9©MapR Technologies - Confidential

What We Did

Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute

Super-fast clustering– Kmeans, StreamingKmeans

10©MapR Technologies - Confidential

Projection Search

11©MapR Technologies - Confidential

K-means Search

12©MapR Technologies - Confidential

But These Require k-means!

Need a new k-means algorithm to get speed

Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads)– Very parallelizable

13©MapR Technologies - Confidential

How It Works

For each point– Find approximately nearest centroid (distance = d)– If d > threshold, new centroid– Else possibly new cluster– Else add to nearest centroid

If centroids > K ~ C log N– Recursively cluster centroids with higher threshold

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

14©MapR Technologies - Confidential

Parallel Speedup?

15©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

16©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)

17©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such:– http://info.mapr.com/ted-uk-05-2012

18©MapR Technologies - Confidential

Thank You