Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Erik Erlandson

Sketching Data With T-Digest in Apache Spark

Red Hat, Inc.

IntroductionErik ErlandsonSoftware Engineer at Red Hat, Inc.Emerging Technologies Group

Internal Data ScienceInsightful Applications

Why Sketching?

● Faster● Smaller● Essential Features

We All Sketch Data3.46.02.5⋮

Mean = 3.97Variance = 3.30

3.4, 5.0, 9.06.0, 2.1, 7.72.5, 4.4, 3.2

T-Digest• Computing Extremely Accurate Quantiles Using

t-Digests• Ted Dunning & Omar Ertl• https://github.com/tdunning/t-digest• Implementations in Java, Python, R, JS, C++

and Scala

What is T-Digest Sketching?

3.46.02.5⋮

(3.4, 3)(6.0, 2)(2.5, 8)

Sketch of CDF

P(X <= x)

Data Domain

Incremental UpdatesCurrentT-Digest + (x, w) = Updated

T-Digest

Large orStreaming Data

Compact“Running”Sketch

The Payoff

RESTService

Query Latencies

What does my latency distribution look like?

I want to simulate my latencies!

Are 90% of my latencies under 1 second?

Representation

clusters

DistributionCDF

(location, mass)(x, m)

Update(x, m)

NearestCluster

Update location

IncrementMass

Cluster Mass Bounds

q=0 q=1

C∙M/4

Quantiles q(x)

M = (masses)

B(x) =C∙M∙q(x)∙(1-q(x))

C =compression

Bounds Force New Clusters

mc+ m?

(xc,mc)

mc+ m > B(xc)!

(xc,mc) (xu,B(xc))

(x, B(xc)-(mc+ m))

Resolution

q=0 q=1

More small clusters

Fewer Large Clusters

T-Digests are Monoidal

C1 ∪ C2

D1 |+| D2

D1 ≡ C1D2 ≡ C2

C1 ∪ C2 ⟹

Monoidal => Map-Reduce

Data in Spark t-digests

result

|+| - Randomized Order

92 4 86 1110

9 24 86 1110D1 |+| D2 ⟸

|+| - Merged Order

92 4 86 1110

92 4 86 1110D1 |+| D2 ⟸

|+| - Large to Small

92 4 86 1110

924 8 611 10D1 |+| D2 ⟸

Comparing |+| Definitions

Algorithmic Considerations• Clusters maintained in sorted order by location• Clusters frequently inserted / deleted / updated• Query the cluster nearest to an incoming (x,m)• Given (x,m), query the prefix-sum of cluster mass

– (m’), over all clusters (x’,m’) where x’ <= x• Do it all in logarithmic time!

Backed By Balanced Tree

Scala Considerations• Immutable Red/Black Tree• Extends Map and MapLike• Capabilities are Mixable Traits

– Red/Black– Ordered– Incrementable-Values– Nearest-Neighbor– Prefix-Sum

• Interface to Algebird Monoids & Aggregators

Discrete DistributionsIf (tdigest.clusters.size <= max_discrete) {

// increment by m (or insert new)

tdigest.clusters.increment(x, m)

} else {

// do full t-digest cluster updating algorithm

tdigest.update(x, m)

Experimental

Applications

• Quantile Estimation• Feature Data Characterization• Building CoDecs• Value-At-Risk Modeling• Generative Data Models

Thank Youeje@redhat.com@manyangledhttps://github.com/isarn/isarn-sketches

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Data & Analytics

BIH - Sketching

MENUS AND PROGRESS BARS Joey Erlandson and Alex Blue

Erlandson Et Al. 2008

River sketching

Ipad Sketching

Sketching preview

Embedding and Sketching Sketching for streaming

Sketching and sketching technique

Sketching Paris

Sketching Problems

Erlandson&Rick 2010 Archaeology Meets Ecology Marine Cultures Human Impact

Design Sketching

Pencil Sketching

Sketching Presentation

Chapter 4 Sketching and Orthographic Projection. 2 Links for Chapter 4 Sketching Shapes Sketching Procedures Orthographic Projection

Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)

Mobile sketching

Erlandson, J.M., M.H. Graham, B.J. Borque, D. Corbett, J.A. Estes

Sketching clouds

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Summit East talk by Erik Erlandson and Trevor McKay