Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
User Defined Aggregation In Apache Spark A Love Story
Erik ErlandsonPrincipal Software Engineer
All Love Stories Are The Same
Hero Meets Aggregators
Hero Files Spark JIRA
Hero Merges Spark PR
Establish The Plot
Spark’s Scale-Out World
232535235
logical
Spark’s Scale-Out World
2 3 2
5 3 5
2 3 5
232535235
physical
logical
Scale-Out Sum
2 3 5
s = 0
Scale-Out Sum
2 3 5
s = s
+ 2 (
2)
Scale-Out Sum
2 3 5
s = s
+ 3 (
5)
Scale-Out Sum
2 3 5
s = s
+ 5 (
10)
Scale-Out Sum
2 3 5 10
Scale-Out Sum
2 3 5 10
5 3 5 13
Scale-Out Sum
2 3 5 10
5 3 5 13
2 3 2 7
Scale-Out Sum
2 3 5 10
5 3 5 13 + 7 = 20
2 3 2
Scale-Out Sum
2 3 5 10 + 20 = 30
5 3 5
2 3 2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2)
Present
sum / count
Love Interest
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile0
1
(x,q)
CDF
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile0
1
(x,q)
CDF
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Romantic Chemistry
val sketchCDF = tdigestUDAF[Double]
spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5))
spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
Romantic Chemistry
val query = records .writeStream //...
+---------+|wordcount|+---------+| 12|| 5|| 9|| 18|| 12|+---------+
val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90"))val query = r.writeStream //...
+----+----+| p50| p90|+----+----+|15.6|31.0||16.0|30.8||15.8|30.0||15.7|31.0||16.0|31.0|+----+----+
Romantic Montage
Sketching Data with T-Digest In Apache Spark
Smart Scalable Feature Reduction With Random Forests
One-Pass Data Science In Apache Spark With Generative T-Digests
Apache Spark for Library Developers
Extending Structured Streaming Made Easy with Algebra
Conflict!
UDAF Anatomyclass TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...}
UDAF Anatomyclass TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...}
User Defined Type Anatomyclass TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(StructField("delta", DoubleType, false) ::StructField("maxDiscrete", IntegerType, false) ::StructField("nclusters", IntegerType, false) ::StructField("clustX", ArrayType(DoubleType, false), false) ::StructField("clustM", ArrayType(DoubleType, false), false) ::Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...}
User Defined Type Anatomyclass TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(StructField("delta", DoubleType, false) ::StructField("maxDiscrete", IntegerType, false) ::StructField("nclusters", IntegerType, false) ::StructField("clustX", ArrayType(DoubleType, false), false) ::StructField("clustM", ArrayType(DoubleType, false), false) ::Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...}
Expensive
What Could Go Wrong?class TDigestUDT extends UserDefinedType[TDigestSQL] {
def serialize(tdsql: TDigestSQL): Any = { print(“In serialize”) // ... }
def deserialize(datum: Any): TDigestSQL = { print(“In deserialize”) // ... }
// yada yada yada ...}
What Could Go Wrong?
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Wait What?val sketchCDF = tdigestUDAF[Double]
val data = /* data frame with 1000 rows of data */
val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first
In deserializeIn serializeIn deserializeIn serialize
… 997 more times !In deserializeIn serialize
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update buf(0) = TDigestSQL(updated) // re-serialize}
SPARK-27296
Resolution
#25024
Aggregator Anatomy
class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends
Aggregator[Double, TDigestSQL, TDigestSQL] {
def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a)
def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL =
TDigestSQL(b1.tdigest ++ b2.tdigest)
def finish(b: TDigestSQL): TDigestSQL = b
val serde = ExpressionEncoder[TDigestSQL]()
def bufferEncoder: Encoder[TDigestSQL] = serde
def outputEncoder: Encoder[TDigestSQL] = serde
}
Intuitive Serialization
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Custom Aggregation in Spark 3.0
import org.apache.spark.sql.functions.udaf
val sketchAgg = TDigestAggregator(0.5, 0)
val sketchCDF: UserDefinedFunction = udaf(sketchAgg)
val sketch = data.agg(sketchCDF($”column”)).first
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
70x Faster
Epilogue
Don’t Give Up
Patience
Respect
ErikE ErErlandsonPrincipal Software Engineer
Erik [email protected]@ManyAngled