50
User Defined Aggregation In Apache Spark A Love Story Erik Erlandson Principal Software Engineer

User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

User Defined Aggregation In Apache Spark A Love Story

Erik ErlandsonPrincipal Software Engineer

Page 2: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

All Love Stories Are The Same

Hero Meets Aggregators

Hero Files Spark JIRA

Hero Merges Spark PR

Page 3: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Establish The Plot

Page 4: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Spark’s Scale-Out World

232535235

logical

Page 5: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Spark’s Scale-Out World

2 3 2

5 3 5

2 3 5

232535235

physical

logical

Page 6: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Scale-Out Sum

2 3 5

s = 0

Page 7: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Scale-Out Sum

2 3 5

s = s

+ 2 (

2)

Page 8: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Scale-Out Sum

2 3 5

s = s

+ 3 (

5)

Page 9: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Scale-Out Sum

2 3 5

s = s

+ 5 (

10)

Page 10: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Scale-Out Sum

2 3 5 10

Page 11: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Scale-Out Sum

2 3 5 10

5 3 5 13

Page 12: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Scale-Out Sum

2 3 5 10

5 3 5 13

2 3 2 7

Page 13: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Scale-Out Sum

2 3 5 10

5 3 5 13 + 7 = 20

2 3 2

Page 14: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Scale-Out Sum

2 3 5 10 + 20 = 30

5 3 5

2 3 2

Page 15: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Spark Aggregators

Operation Data Accumulator Zero Update Merge

Sum Numbers Number 0 a + x a1 + a2

Page 16: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Spark Aggregators

Operation Data Accumulator Zero Update Merge

Sum Numbers Number 0 a + x a1 + a2

Max Numbers Number -∞ max(a, x) max(a1, a2)

Page 17: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Spark Aggregators

Operation Data Accumulator Zero Update Merge

Sum Numbers Number 0 a + x a1 + a2

Max Numbers Number -∞ max(a, x) max(a1, a2)

Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2)

Present

sum / count

Page 18: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Love Interest

Page 19: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Data Sketching: T-Digest

q = 0.9

x is the 90th %-ile0

1

(x,q)

CDF

Page 20: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Data Sketching: T-Digest

q = 0.9

x is the 90th %-ile0

1

(x,q)

CDF

Page 21: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Is T-Digest an Aggregator?

Data Type Numeric

Accumulator Type T-Digest Sketch

Zero Empty T-Digest

Update tdigest + x

Merge tdigest1 + tdigest2

Present tdigest.cdfInverse(quantile)

Page 22: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Is T-Digest an Aggregator?

Data Type Numeric

Accumulator Type T-Digest Sketch

Zero Empty T-Digest

Update tdigest + x

Merge tdigest1 + tdigest2

Present tdigest.cdfInverse(quantile)

Page 23: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Romantic Chemistry

val sketchCDF = tdigestUDAF[Double]

spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5))

spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))

Page 24: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Romantic Chemistry

val query = records .writeStream //...

+---------+|wordcount|+---------+| 12|| 5|| 9|| 18|| 12|+---------+

val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90"))val query = r.writeStream //...

+----+----+| p50| p90|+----+----+|15.6|31.0||16.0|30.8||15.8|30.0||15.7|31.0||16.0|31.0|+----+----+

Page 25: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Romantic Montage

Sketching Data with T-Digest In Apache Spark

Smart Scalable Feature Reduction With Random Forests

One-Pass Data Science In Apache Spark With Generative T-Digests

Apache Spark for Library Developers

Extending Structured Streaming Made Easy with Algebra

Page 26: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Conflict!

Page 27: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

UDAF Anatomyclass TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends

UserDefinedAggregateFunction {

def initialize(buf: MutableAggregationBuffer): Unit =buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))

def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))

def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++

buf2.getAs[TDigestSQL](0).tdigest)

def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)

// yada yada yada ...}

Page 28: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

UDAF Anatomyclass TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends

UserDefinedAggregateFunction {

def initialize(buf: MutableAggregationBuffer): Unit =buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))

def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))

def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++

buf2.getAs[TDigestSQL](0).tdigest)

def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)

// yada yada yada ...}

Page 29: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

User Defined Type Anatomyclass TDigestUDT extends UserDefinedType[TDigestSQL] {

def sqlType: DataType = StructType(StructField("delta", DoubleType, false) ::StructField("maxDiscrete", IntegerType, false) ::StructField("nclusters", IntegerType, false) ::StructField("clustX", ArrayType(DoubleType, false), false) ::StructField("clustM", ArrayType(DoubleType, false), false) ::Nil)

def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }

def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }

// yada yada yada ...}

Page 30: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

User Defined Type Anatomyclass TDigestUDT extends UserDefinedType[TDigestSQL] {

def sqlType: DataType = StructType(StructField("delta", DoubleType, false) ::StructField("maxDiscrete", IntegerType, false) ::StructField("nclusters", IntegerType, false) ::StructField("clustX", ArrayType(DoubleType, false), false) ::StructField("clustM", ArrayType(DoubleType, false), false) ::Nil)

def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }

def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }

// yada yada yada ...}

Expensive

Page 31: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

What Could Go Wrong?class TDigestUDT extends UserDefinedType[TDigestSQL] {

def serialize(tdsql: TDigestSQL): Any = { print(“In serialize”) // ... }

def deserialize(datum: Any): TDigestSQL = { print(“In deserialize”) // ... }

// yada yada yada ...}

Page 32: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

What Could Go Wrong?

2 3 2

5 3 5

2 3 5

Init Updates Serialize

Init Updates Serialize

Init Updates Serialize

Merge

Page 33: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Wait What?val sketchCDF = tdigestUDAF[Double]

val data = /* data frame with 1000 rows of data */

val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first

In deserializeIn serializeIn deserializeIn serialize

… 997 more times !In deserializeIn serialize

Page 34: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Oh No

def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))

// is equivalent to ...

def update(buf: MutableAggregationBuffer, input: Row): Unit = {

}

Page 35: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Oh No

def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))

// is equivalent to ...

def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize

}

Page 36: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Oh No

def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))

// is equivalent to ...

def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update

}

Page 37: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Oh No

def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))

// is equivalent to ...

def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update buf(0) = TDigestSQL(updated) // re-serialize}

Page 38: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

SPARK-27296

Page 39: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Resolution

Page 40: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

#25024

Page 41: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Aggregator Anatomy

class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends

Aggregator[Double, TDigestSQL, TDigestSQL] {

def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))

def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a)

def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL =

TDigestSQL(b1.tdigest ++ b2.tdigest)

def finish(b: TDigestSQL): TDigestSQL = b

val serde = ExpressionEncoder[TDigestSQL]()

def bufferEncoder: Encoder[TDigestSQL] = serde

def outputEncoder: Encoder[TDigestSQL] = serde

}

Page 42: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Intuitive Serialization

2 3 2

5 3 5

2 3 5

Init Updates Serialize

Init Updates Serialize

Init Updates Serialize

Merge

Page 43: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Custom Aggregation in Spark 3.0

import org.apache.spark.sql.functions.udaf

val sketchAgg = TDigestAggregator(0.5, 0)

val sketchCDF: UserDefinedFunction = udaf(sketchAgg)

val sketch = data.agg(sketchCDF($”column”)).first

Page 44: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Performance

scala> val sketchOld = TDigestUDAF(0.5, 0)sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...

scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)

scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...

scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)

Page 45: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Performance

scala> val sketchOld = TDigestUDAF(0.5, 0)sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...

scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)

scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...

scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)

70x Faster

Page 46: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Epilogue

Page 47: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Don’t Give Up

Page 48: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Patience

Page 49: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

Respect

Page 50: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass

ErikE ErErlandsonPrincipal Software Engineer

Erik [email protected]@ManyAngled