26
Erik Erlandson Sketching Data With T-Digest in Apache Spark Red Hat, Inc.

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Embed Size (px)

Citation preview

Page 1: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Erik Erlandson

Sketching Data With T-Digest in Apache Spark

Red Hat, Inc.

Page 2: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

IntroductionErik ErlandsonSoftware Engineer at Red Hat, Inc.Emerging Technologies Group

Internal Data ScienceInsightful Applications

Page 3: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Why Sketching?

● Faster● Smaller● Essential Features

Page 4: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

We All Sketch Data3.46.02.5⋮

Mean = 3.97Variance = 3.30

3.4, 5.0, 9.06.0, 2.1, 7.72.5, 4.4, 3.2

Page 5: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

T-Digest• Computing Extremely Accurate Quantiles Using

t-Digests• Ted Dunning & Omar Ertl• https://github.com/tdunning/t-digest• Implementations in Java, Python, R, JS, C++

and Scala

Page 6: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

What is T-Digest Sketching?

3.46.02.5⋮

(3.4, 3)(6.0, 2)(2.5, 8)

or

Sketch of CDF

P(X <= x)

X

Data Domain

Page 7: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Incremental UpdatesCurrentT-Digest + (x, w) = Updated

T-Digest

Large orStreaming Data

Compact“Running”Sketch

Page 8: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

The Payoff

RESTService

Query Latencies

What does my latency distribution look like?

I want to simulate my latencies!

Are 90% of my latencies under 1 second?

Page 9: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Representation

clusters

DistributionCDF

(location, mass)(x, m)

Page 10: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Update(x, m)

NearestCluster

Update location

IncrementMass

Page 11: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Cluster Mass Bounds

q=0 q=1

C∙M/4

Quantiles q(x)

M = (masses)

B(x) =C∙M∙q(x)∙(1-q(x))

C =compression

Page 12: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Bounds Force New Clusters

(x,m)

mc+ m?

(xc,mc)

mc+ m > B(xc)!

(xc,mc) (xu,B(xc))

(x, B(xc)-(mc+ m))

(x,m)

Page 13: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Resolution

q=0 q=1

More small clusters

Fewer Large Clusters

Page 14: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

T-Digests are Monoidal

C1 ∪ C2

D1 |+| D2

D1 ≡ C1D2 ≡ C2

C1 ∪ C2 ⟹

Page 15: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Monoidal => Map-Reduce

P1

P2

Pn

|+|

Data in Spark t-digests

result

Map

Page 16: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

7

|+| - Randomized Order

1 35

92 4 86 1110

71 35

9 24 86 1110D1 |+| D2 ⟸

Page 17: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

7

|+| - Merged Order

1 35

92 4 86 1110

71 35

92 4 86 1110D1 |+| D2 ⟸

Page 18: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

7

|+| - Large to Small

1 35

92 4 86 1110

71 35

924 8 611 10D1 |+| D2 ⟸

Page 19: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Comparing |+| Definitions

Page 20: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Algorithmic Considerations• Clusters maintained in sorted order by location• Clusters frequently inserted / deleted / updated• Query the cluster nearest to an incoming (x,m)• Given (x,m), query the prefix-sum of cluster mass

– (m’), over all clusters (x’,m’) where x’ <= x• Do it all in logarithmic time!

Page 21: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Backed By Balanced Tree

Page 22: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Scala Considerations• Immutable Red/Black Tree• Extends Map and MapLike• Capabilities are Mixable Traits

– Red/Black– Ordered– Incrementable-Values– Nearest-Neighbor– Prefix-Sum

• Interface to Algebird Monoids & Aggregators

Page 23: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Discrete DistributionsIf (tdigest.clusters.size <= max_discrete) {

// increment by m (or insert new)

tdigest.clusters.increment(x, m)

} else {

// do full t-digest cluster updating algorithm

tdigest.update(x, m)

}

Experimental

Page 25: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Demo

Page 26: Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Thank [email protected]@manyangledhttps://github.com/isarn/isarn-sketches