Fast Single-pass K-means Clusterting at Oxford

Fast Single-pass k-means Clustering

whoami – Ted Dunning

• Chief Application Architect, MapR Technologies• Committer, member, Apache Software

Foundation– particularly Mahout, Zookeeper and Drill

• Contact me attdunning@maprtech.comtdunning@apache.comted.dunning@gmail.com@ted_dunning

Agenda

• Rationale

• Theory– clusterable data, k-mean failure modes, sketches

• Algorithms– ball k-means, surrogate methods

• Implementation– searchers, vectors, clusterers

• Results

• Application

RATIONALE

Why k-means?

• Clustering allows fast search

– k-nn models allow agile modeling

– lots of data points, 108 typical

– lots of clusters, 104 typical

• Model features

– Distance to nearest centroids

– Poor man’s manifold discovery

What is Quality?

• Robust clustering not a goal

– we don’t care if the same clustering is replicated

• Generalization to unseen data critical

– number of points per cluster

– distance distributions

– target function distributions

– model performance stability

An Example

The Problem

• Spirals are a classic “counter” example for k-means

• Classic low dimensional manifold with added noise

• But clustering still makes modeling work well

An Example

The Cluster Proximity Features

• Every point can be described by the nearest cluster – 4.3 bits per point in this case

– Significant error that can be decreased (to a point) by increasing number of clusters

• Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities)– Error is negligible

– Unwinds the data into a simple representation

Diagonalized Cluster Proximity

Lots of Clusters Are Fine

The Limiting Case

• Too many clusters lead to over-fitting

• Which we mediate by averaging over several nearby clusters

• In the limit we get k-nn modeling

– and probably use k-means to speed up search

THEORY

Intuitive Theory

• Traditionally, minimize over all distributions

– optimization is NP-complete

– that isn’t like real data

• Recently, assume well-clusterable data

• Interesting approximation bounds provable

s 2Dk-1

2 (X) > Dk

1+O(s 2 )

For Example

Grouping these two clusters

seriously hurts squared distance

2 (X) >1

ALGORITHMS

Lloyd’s Algorithm

• Part of CS folk-lore• Developed in the late 50’s for signal quantization, published

in 80’s

initialize k cluster centroids somehowfor each of many iterations:

for each data point:assign point to nearest cluster

recompute cluster centroids from points assigned to clusters

• Highly variable quality, several restarts recommended

Typical k-means Failure

Selecting two seeds here cannot be

fixed with Lloyds

Result is that these two clusters get glued

together

Ball k-means

• Provably better for highly clusterable data• Tries to find initial centroids in each “core” of each real

clusters• Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendencyfor each of a very few iterations:

for each data point:assign point to nearest cluster

recompute centroids using only points much closer than closest cluster

Still Not a Win

• Ball k-means is nearly guaranteed with k = 2

• Probability of successful seeding drops exponentially with k

• Alternative strategy has high probability of success, but takes O(nkd + k3d) time

Surrogate Method

• Start with sloppy clustering into κ = k log nclusters

• Use this sketch as a weighted surrogate for the data

• Cluster surrogate data using ball k-means• Results are provably good for highly clusterable

data• Sloppy clustering is on-line• Surrogate can be kept in memory• Ball k-means pass can be done at any time

Algorithm Costs

• O(k d log n) per point per iteration for Lloyd’s algorithm

• Number of iterations not well known

• Iteration > log n reasonable assumption

Algorithm Costs

• Surrogate methods

– fast, sloppy single pass clustering with κ = k log n

– fast sloppy search for nearest cluster, O(d log κ) = O(d(log k + log log n)) per point

– fast, in-memory, high-quality clustering of κ weighted centroids

O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality

O(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids

• Even the sloppy clusters may suffice

Algorithm Costs

• How much faster for the sketch phase?

– take k = 2000, d = 10, n = 100,000

– k d log n = 2000 x 10 x 26 = 500,000

– d (log k + log log n) = 10(11 + 5) = 170

– 3,000 times faster is a bona fide big deal

Pragmatics

• But this requires a fast search internally

• Have to cluster on the fly for sketch

• Have to guarantee sketch quality

• Previous methods had very high complexity

How It Works

• For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

• If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

• Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of

clustering original– or we can just use the result directly

IMPLEMENTATION

How Can We Search Faster?

• First rule: don’t do it– If we can eliminate most candidates, we can do less work– Projection search and k-means search

• Second rule: don’t do it– We can convert big floating point math to clever bit-wise

integer math– Locality sensitive hashing

• Third rule: reduce dimensionality– Projection search– Random projection for very high dimension

Projection Search

total ordering!

How Many Projections?

LSH Search

• Each random projection produces independent sign bit• If two vectors have the same projected sign bits, they

probably point in the same direction (i.e. cos θ ≈ 1)• Distance in L2 is closely related to cosine

• We can replace (some) vector dot products with long integer XOR

x - y2

= x2 - 2(x × y)+ y2

= x2 - 2 x y cosq + y2

LSH Bit-match Versus Cosine

0 8 16 24 32 40 48 56 64

X Axis

Results with 32 Bits

The Internals

• Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

• Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute

• Super-fast clustering– Kmeans, StreamingKmeans

Parallel Speedup?

1 2 3 4 5 20

Threads

(μs) 2

Threaded version

Non- threaded

Perfect Scaling

What About Map-Reduce?

• Map-reduce implementation is nearly trivial

– Compute surrogate on each split

– Total surrogate is union of all partial surrogates

– Do in-memory clustering on total surrogate

• Threaded version shows linear speedup already

– Map-reduce speedup is likely, not entirely guaranteed

How Well Does it Work?

• Theoretical guarantees for well clusterabledata

– Shindler, Wong and Meyerson, NIPS, 2011

• Evaluation on synthetic data

– Rough clustering produces correct surrogates

– Ball k-means strategy 1 performance is very good with large k

APPLICATION

The Business Case

• Our customer has 100 million cards in circulation

• Quick and accurate decision-making is key.

– Marketing offers

– Fraud prevention

Opportunity

• Demand of modeling is increasing rapidly

• So they are testing something simpler and more agile

• Like k-nearest neighbor

What’s that?

• Find the k nearest training examples – lookalike customers

• This is easy … but hard– easy because it is so conceptually simple and you don’t

have knobs to turn or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results

• Initial rapid prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time

K-Nearest Neighbor Example

Required Scale and Speed and Accuracy

• Want 20 million queries against 25 million references in 10,000 s

• Should be able to search > 100 million references

• Should be linearly and horizontally scalable

• Must have >50% overlap against reference search

How Hard is That?

• 20 M x 25 M x 100 Flop = 50 P Flop

• 1 CPU = 5 Gflops

• We need 10 M CPU seconds => 10,000 CPU’s

• Real-world efficiency losses may increase that by 10x

• Not good!

K-means Search

• First do clustering with lots (thousands) of clusters

• Then search nearest clusters to find nearest points

• We win if we find >50% overlap with “true” answer

• We lose if we can’t cluster super-fast– more on this later

Lots of Clusters Are Fine

Some Details

• Clumpy data works better

– Real data is clumpy

• Speedups of 100-200x seem practical with 50% overlap

– Projection search and LSH give additional 100x

• More experiments needed

Summary

• Nearest neighbor algorithms can be blazing fast

• But you need blazing fast clustering

– Which we now have

Contact Me!

• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Come get the slides at http://www.slideshare.net/tdunning/oxford-05oct2012

• Get the code athttps://github.com/tdunning/knn

• Contact me at tdunning@maprtech.com or @ted_dunning

Fast Single-pass K-means Clusterting at Oxford

Career

ETSI EN301 489 EMC TEST REPORT - CALM....Dec 18, 2018 · EN 301489-1 V3.2.0 §9.2 PASS Electrostatic discharge EN 301489-1 V3.2.0 §9.3 PASS Fast transients EN 301489-1 V3.2.0 §9.4

You’ve got a fast pass to building a massive lineage!assets.wvholdings.com/1/PDF/PROMOTIONS/FastPass150/FastPass15… · FAQ Q: What is Fast Pass 150? A: This promotion will allow

On the Meaning of the Question “How Fast Does Time Pass?”web.mit.edu/bskow/www/research/rate-of-passage.pdf · On the Meaning of the Question “How Fast Does Time Pass? ... The

TP-LINK Print Server Compatibility List MFP / Printer ... · 105 Canon PIXMA iP3680 Pass Pass Pass Pass Pass Pass Pass Pass 106 Canon PIXMA iP4200 Pass Pass Pass Pass Pass Pass Pass

FAST VERSUS SLOW AVALANCHE IMPACT DYNAMICS: … · 2018-09-27 · FAST VERSUS SLOW AVALANCHE IMPACT DYNAMICS: INSIGHTS FROM MEASUREMENTS AT LAUTARET PASS AVALANCHE TEST-SITE, FRANCE

STEWARDS’ REPORT · Steward’s report 3 STEWARDS CLEARANCE TRIAL – 259M Box Greyhound Weight PASS/FAIL - Comment 2 FAST JUDGEMENT 25.8 PASS 4 6 PERFORMANCE/ INCIDENTS FAST JUDGEMENT

THE FAST FOURIER - Kangwonsar.kangwon.ac.kr/gisg/FFT_book.pdf · · 2009-07-14CHAPTER 8 THE FAST FOURIER TRANSFORM (FFT) ... CHAPTER 13 FFT MULTICHANNEL BAND-PASS FILTERING

Maire Williams March 2016 - Centre for Cities · The Fast Growth Cities group, comprised of Cambridge, Oxford, Milton Keynes, ... (Cambridge, Milton Keynes and Oxford) more than 15

Fast Innovation requires Fast IT · work/life life . 2011 iPad Debuts 2012 Smartphones and tablets pass PCs in unit sales 2013 ... Switzerland Video . ... Design Configuration Components

RAILROAD PASS HOTEL & CASINO FAST FOOD PAD WITH DRIVE … · 2017-09-13 · RAILROAD PASS HOTEL & CASINO FAST FOOD PAD WITH DRIVE-THRU 2800 South Boulder Highway, Henderson, Nevada

Fast Pass to US Market for Life-Science Industry

Home - IBIT (University of the Punjab) · Pass pass Pass Pass Pass Drop Pass Pass Pass Pass Pass Pass Drop Pass Drop Drop Pas Marks Obtd CGPA Name of the Candidate Reed. No Sr. N

Fast Layer 2 Roaming and Layer 3 Mobilitywhp-aus2.cold.extweb.hp.com/pub/networking/software/12-C09-L3...Fast Layer 2 Roaming and Layer 3 Mobility ... cation messages must pass through

Space races: settling the universe fast - Future of ... · Space races: settling the universe fast Anders Sandberg Future of Humanity Institute, Oxford Martin School, University of

CompTIA - Network+ Fast Pass - Sybex 2005

CAMBRIDGE – MILTON KEYNES – OXFORD …...The Cambridge-Milton Keynes-Oxford corridor is home to 3.3 million people and hosts some of the most productive, successful and fast growing

Oxford OXFORD CND

Sybex ccna fast pass 3rd edition

OpenCL @ Adobe - Khronos Group...Basic 3D Black & white Brightness & contrast Color balance Color pass Color replace Crop Directional blur Drop Shadow Extract Fast blur Fast color

Oxford Activity Sheets - edu.xunta.gal · 5 Oxford Rooftops 3. 6 Oxford Rooftops 3. 7 Oxford Rooftops 3. 8 Oxford Rooftops 3. 9 Oxford Rooftops 3. 10 Oxford Rooftops 3. 11 Oxford