Upload
chris-fregly
View
84
Download
1
Embed Size (px)
Citation preview
pipeline.io
After Dark 2.0End-to-End, Real-time, Advanced Analytics and ML
Big Data Reference Pipeline
Atlanta Spark User GroupSept 22, 2016
Thanks Emory Continuing Education!
Chris FreglyResearch Scientist @ PipelineIO
We’re Hiring - Only Nice People!
pipeline.ioadvancedspark.com
pipeline.io
Who Am I?
2
Research Scientist @ PipelineIOgithub.com/fluxcapacitor/pipeline
Meetup Founder
Advanced
Book Author
Advanced .
pipeline.io
Who Was I?
3
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
pipeline.io
Advanced Spark and Tensorflow MeetupMeetup Metrics
Top 5 Most Active Spark Meetup!4000+ Members in just 1 year!!6000+ Downloads of Docker Image
with many Meetup Demos!!!@ advancedspark.com
Meetup GoalsCode dive deep into Spark and related open source code basesStudy integrations with Cassandra, ElasticSearch,Tachyon, S3,
BlinkDB, Mesos, YARN, Kafka, R, etcSurface and share patterns and idioms of well-designed,
distributed, big data processing systems
pipeline.io
Atlanta Hadoop User Meetup Last Night
http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
pipeline.io
MLconf ATL Tomorrow
pipeline.io
Current PipelineIO ResearchModel Deploying and Testing
Model Scaling and Serving
Online Model Training
Dynamic Model Optimizing
7
pipeline.io
PipelineIO Deliverables100% Open Source!!
Githubhttps://github.com/fluxcapacitor/
DockerHubhttps://hub.docker.com/r/fluxcapacitor
Workshophttp://pipeline.io
8
pipeline.io
Topics of This Talk (20-30 mins each)① Spark Streaming and Spark ML
Generating Real-time Recommendations
② Spark CoreTuning and Profiling
③ Spark SQLTuning and Customizing
9
pipeline.io
Live, Interactive Demo!Kafka, Cassandra, ElasticSearch, Redis, Spark ML
pipeline.io
Audience Participation Required!
11
You -> Audience Instructions①Navigate to
demo.pipeline.io
②Swipe software used inProduction Only!
Data -> Scientist
This is Totally Anonymous!!
pipeline.io
Topics of This Talk (15-20 mins each)① Spark Streaming and Spark ML
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark CoreTuning and Profiling
③ Spark SQLTuning and Customizing
12
pipeline.io
Mechanical Sympathy“Hardware and software working together.”
- Martin Thompsonhttp://mechanical-sympathy.blogspot.com
“Whatever your data structure, my array will win.”- Scott Meyers
Every C++ Book, basically
13
pipeline.io
Spark and Mechanical Sympathy
14
Project Tungsten(Spark 1.4-1.6+)
100TBGraySortChallenge(Spark 1.1-1.2)
Minimize Memory & GCMaximize CPU Cache
Saturate Network I/OSaturate Disk I/O
pipeline.io
CPU Cache Refresher
15
aka“LLC”
MyLaptop
pipeline.io
CPU Cache Sympathy (AlphaSort paper)
Key (10 bytes) + Pointer (4 bytes) = 14 bytes
16
Key PtrPre-process & Pull Key From Record
PtrKey-Prefix2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes
Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes
PtrMust Dereference to Compare Key
Key
Pointer (4 bytes) = 4 bytes
Key Ptr
Pad
/Pad CPU Cache-line Friendly!
Dereference full key only to resolve prefix duplicates
pipeline.io
Sort Performance Comparison
17
pipeline.io
Sequential vs Random Cache Misses
18
pipeline.io
Demo!Sorting
pipeline.io
Instrumenting and Monitoring CPUUse Linux perf command!
20
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
pipeline.io
Results of Random vs. Sequential Sort
21
Naïve Random AccessPointer Sort
Cache Friendly Sequential Key/Pointer Sort
Ptr KeyMust Dereference to Compare Key
Key PtrPre-process & Pull Key From Record
-35%
-90%-68%
-26%
% Change
-55%
perf stat –event \L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
pipeline.io
Demo!Matrix Multiplication
pipeline.io
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vectorfor (i <- 0 until numRowA)for (j <- 0 until numColsB)for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];
23
Bad: Row-wise traversal, not using full CPU cache line,ineffective pre-fetching
pipeline.io
CPU Cache Friendly Matrix Multiplication
// Transpose Bfor (i <- 0 until numRowsB)for (j <- 0 until numColsB)
matBT[ i ][ j ] = matB[ j ][ i ];
24
Good: Full CPU Cache Line,Effective Prefetching
OLD: matB [ k ][ j ];
// Modify algo for Transpose Bfor (i <- 0 until numRowsA)for (j <- 0 until numColsB)for (k <- 0 until numColsA)res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
pipeline.io
Results Of Matrix Multiplication
Cache-Friendly Matrix Multiply
25
Naïve Matrix Multiply
perf stat –event \L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, \
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
-96%
-93%
-93%-70%
-53%
% Change
-63%
+8543%?
pipeline.io
Demo!Thread Synchronization
pipeline.io
Thread and Context Switch SympathyProblemAtomically Increment 2 Counters (each at different increments) by1000’s of Simultaneous Threads
27
Possible Solutions①Synchronized Immutable②Synchronized Mutable③AtomicReference CAS④Volatile?
Context Switches are Expen$ive!!
aka“LLC”
pipeline.io
Synchronized Immutable Counterscase class Counters(left: Int, right: Int)
object SynchronizedImmutableCounters {var counters = new Counters(0,0)def getCounters(): Counters = {
this.synchronized { counters }}def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
this.synchronized {counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)}
}} 28
Locks wholeouter object!!
pipeline.io
Synchronized Mutable Countersclass MutableCounters(left: Int, right: Int) {def increment(leftIncrement: Int, rightIncrement: Int): Unit={this.synchronized {…}
}def getCountersTuple(): (Int, Int) = {this.synchronized{ (counters.left, counters.right) }
}}object SynchronizedMutableCounters {val counters = new MutableCounters(0,0)…def increment(leftIncrement: Int, rightIncrement: Int): Unit = {counters.increment(leftIncrement, rightIncrement)
}} 29
Locks justMutableCounters
pipeline.io
Lock-Free AtomicReference Counterscase class Counters(left: Int, right: Int)
object LockFreeAtomicReferenceCounters {val counters = new AtomicReference[Counters](new Counters(0,0))def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalCounters: Counters = nullvar updatedCounters: Counters = nulldo {
originalCounters = getCounters()updatedCounters = new Counters(originalCounters.left+ leftIncrement,
originalCounters.right+ rightIncrement)} // Retry lock-free, optimistic compareAndSet() until AtomicRef updateswhile !(counters.compareAndSet(originalCounters, updatedCounters))
}} 30Lock Free!!
pipeline.io
Lock-Free AtomicLong Countersobject LockFreeAtomicLongCounters {// a single Long (64-bit) will maintain 2 separate Ints (32-bits each)val counters = new AtomicLong()…def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
var originalCounters = 0Lvar updatedCounters = 0Ldo {
originalCounters = counters.get()…// Store two 32-bit Int into one 64-bit Long// Use >>> 32 and << 32 to set and retrieve each Int from the Long}
// Retry lock-free, optimistic compareAndSet() until AtomicLong updateswhile !(counters.compareAndSet(originalCounters, updatedCounters))
}}
31 Lock Free!!
A: The JVM does notguarantee atomicupdates of 64-bitlongs and doubles
Q: Why not use@volatile long?
pipeline.io
Results of Thread Synchronization
Immutable Case Class
32
Lock-Free AtomicLong
-64%
-46%
-17%
-33%
perf stat –event \context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, \
LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
% Change
-31%-32%
-33%
-27%
case class Counters(left: Int, right: Int)...this.synchronized {counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)}
val counters = new AtomicLong()…do {…} while !(counters.compareAndSet(originalCounters,
updatedCounters))
pipeline.io
Profile Visualizations: Flame Graphs
33Example: Spark Word Count
Java Stack Traces are Good! JDK 1.8+(-XX:-Inline -XX:+PreserveFramePointer)
Plateaus are Bad!I/O stalls, Heavy CPU
serialization, etc
pipeline.io
Project Tungsten: CPU and MemoryCreate Custom Data Structures & Algorithms
Operate on serialized and compressed ByteArrays!Minimize Garbage Collection
Reuse ByteArraysIn-place updates for aggregations
Maximize CPU Cache Effectiveness8-byte alignmentAlphaSort-based Key-Prefix
Utilize Catalyst Dynamic Code GenerationDynamic optimizations using entire query planDeveloper implements genCode() to create Scala source code
(String)Project Janino compiles source code into JVM bytecode 34
pipeline.io
Why is CPU the Bottleneck?CPU for serialization, hashing, & compression
Spark 1.2 updates saturated Network, Disk I/O
10x increase in I/O throughput relative to CPU
More partition, pruning, and pushdown support
Newer columnar file formats help reduce I/O35
pipeline.io
Custom Data Structs & Algos: AggsUnsafeFixedWidthAggregationMap
Uses BytesToBytesMap internallyIn-place updates of serialized aggregationNo object creation on hot-path
TungstenAggregate & TungstenAggregationIteratorOperates directly on serialized, binary UnsafeRow2 steps to avoid single-key OOMs
① Hash-based (grouping) agg spills to disk if needed② Sort-based agg performs external merge sort on spills
36
pipeline.io
Custom Data Structures & Algorithmso.a.s.util.collection.unsafe.sort.
UnsafeSortDataFormatUnsafeExternalSorterUnsafeShuffleWriterUnsafeInMemorySorterRecordPointerAndKeyPrefix
37
PtrKey-Prefix
2x CPU Cache-line Friendly!
SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>Note: Mixing multiple subclasses of SortDataFormatsimultaneously will prevent JIT inlining.
Supports merging compressed records(if compression CODEC supports it, ie. LZF)
In-place external sorting of spilled BytesToBytes data
AlphaSort-based, 8-byte aligned sort key
In-place sorting of BytesToBytesMap data
pipeline.io
Code GenerationProblem
Boxing creates excessive objectsExpression tree evaluations are costlyJVM can’t inline polymorphic impls
Lack of polymorphism == poor code design
SolutionCode generation enables inliningRewrite and optimize code using overall plan, 8-byte alignDefer source code generation to each operator, UDF,
UDAFUse Janino to compile generated source code ->bytecode
(More IDE friendly than Scala quasiquotes) 38
pipeline.io
Autoscaling Spark Workers (Spark 1.5+)Scaling up is easy J
SparkContext.addExecutors() until max is reachedScaling down is hard L
SparkContext.removeExecutors()Lose RDD cache inside Executor JVMMust rebuild active RDD partitions in another Executor
JVM
Uses External Shuffle Service from Spark 1.1-1.2If Executor JVM dies/restarts, shuffle keeps shufflin’! 39
pipeline.io
“Hidden” Spark Submit REST APIhttp://arturmkrtchyan.com/apache-spark-hidden-rest-apiSubmit Spark Job
curl -X POST http://127.0.0.1:6066/v1/submissions/create \--header "Content-Type:application/json;charset=UTF-8" \--data ’{"action" : "CreateSubmissionRequest”,
"mainClass" : "org.apache.spark.examples.SparkPi”,"sparkProperties" : {
"spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar","spark.app.name" : "SparkPi",…
}}’
Get Spark Job Statuscurl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request>
Kill Spark Jobcurl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request>
40
(the snitch)
pipeline.io
Outline① Spark Streaming and Spark ML
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark CoreTuning and Profiling
③ Spark SQLTuning and Customizing
41
pipeline.io
Parquet Columnar File FormatBased on Google Dremel paper ~2010Collaboration with Twitter and ClouderaColumnar storage format for fast columnar aggsSupports evolving schemaSupports pushdownsSupport nested partitionsTight compressionMin/max heuristics enable file and chunk skipping 42
Min/Max HeuristicsFor Chunk Skipping
pipeline.io
PartitionsPartition Based on Data Access Patterns
/genders.parquet/gender=M/…/gender=F/… <-- Use Case: Access Users by Gender/gender=U/…
Dynamic Partition Creation (Write)Dynamically create partitions on write based on column (ie. Gender)SQL: INSERT TABLE genders PARTITION (gender) SELECT …DF: gendersDF.write.format("parquet").partitionBy("gender")
.save("/genders.parquet")
Partition Discovery (Read)Dynamically infer partitions on read based on paths (ie./gender=F/…)SQL: SELECT id FROM genders WHERE gender=FDF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id").
.where("gender=F")43
pipeline.io
PruningPartition Pruning
Filter out rows by partition
SELECT id, gender FROM genders WHERE gender = ‘F’
Column PruningFilter out columns by column filterExtremely useful for columnar storage formats (Parquet)Skip entire blocks of columns
SELECT id, gender FROM genders44
pipeline.io
Pushdownsaka. Predicate or Filter Pushdowns Predicate returns true or false for given functionFilters rows deep into the data sourceReduces number of rows returnedData Source must implement PrunedFilteredScan
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
45
pipeline.io
Demo!File Formats, Partitions, Pushdowns, and Joins
pipeline.io
Predicate Pushdowns & Filter Collapsing
47
Filter pushdownNo extra pass
Filter combiningOnly 1 extra pass
2 extra passes through the data after retrieval
pipeline.io
Join Between Partitioned & Unpartitioned
48
pipeline.io
Join Between Partitioned & Partitioned
49
pipeline.io
Broadcast Join vs. Normal Shuffle Join
50
pipeline.io
Cartesian Join vs. Inner Join
51
pipeline.io
Visualizing the Query Plan
52
Effectiveness of Filter
Cost-basedJoin Optimization
Similar toMapReduce
Map-side Join& DistributedCache
Peak Memory forJoins and Aggs
UnsafeFixedWidthAggregationMapgetPeakMemoryUsedBytes()
pipeline.io
Data Source APIRelations (o.a.s.sql.sources.interfaces.scala)BaseRelation (abstract class): Provides schema of dataTableScan (impl): Read all data from sourcePrunedFilteredScan (impl): Column pruning & predicate pushdownsInsertableRelation (impl): Insert/overwrite data based on SaveModeRelationProvider (trait/interface): Handle options, BaseRelation factory
Filters (o.a.s.sql.sources.filters.scala)Filter (abstract class): Handles all filters supported by this source
EqualTo (impl)GreaterThan (impl)StringStartsWith (impl) 53
pipeline.io
Native Spark SQL Data Sources
54
pipeline.io
JSON Data SourceDataFrameval ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")-- or –val ratingsDF = sqlContext.read.json
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL CodeCREATE TABLE genders USING jsonOPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2")
55
json() convenience method
pipeline.io
Parquet Data SourceConfiguration
spark.sql.parquet.filterPushdown=truespark.sql.parquet.mergeSchema=false (unless your schema is evolving)spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable())spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFramesval gendersDF = sqlContext.read.format("parquet")
.load("file:/root/pipeline/datasets/dating/genders.parquet")gendersDF.write.format("parquet").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQLCREATE TABLE genders USING parquetOPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")
56
pipeline.io
ElasticSearch Data SourceGithub
https://github.com/elastic/elasticsearch-hadoop
Mavenorg.elasticsearch:elasticsearch-spark_2.10:2.1.0
Codeval esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",
"es.port" -> "<port>")df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document-type>")57
pipeline.io
Cassandra Data SourceGithub
https://github.com/datastax/spark-cassandra-connector
Mavencom.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
CoderatingsDF.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"<keyspace>","table"->"<table>")).save(…) 58
pipeline.io
Tips for Cassandra AnalyticsBy-pass Cassandra CQL “front door”
CQL Optimized for Transactions
Bulk read and write directly against SSTablesCheck out Netflix OSS project “Aegisthus”
Cassandra becomesa first-class analytics optionReplicated analytics cluster no longer needed
59
pipeline.io
Creating a Custom Data Source① Study existing implementations
o.a.s.sql.execution.datasources.jdbc.JDBCRelation② Extend base traits & implement required methods
o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}
Spark JDBC (o.a.s.sql.execution.datasources.jdbc)class JDBCRelation extends BaseRelation
with PrunedFilteredScanwith InsertableRelation
DataStax Cassandra (o.a.s.sql.cassandra)class CassandraSourceRelation extends BaseRelation
with PrunedFilteredScanwith InsertableRelation 60
pipeline.io
Demo!Create a Custom Data Source
pipeline.io
Publishing Custom Data Sources
62
spark-packages.org
pipeline.io
Spark SQL UDF Code Generation100+ UDFs now generating code
More to come in Spark 1.6+
Details in SPARK-8159, SPARK-9571
Every UDF must use Expressions andimplement Expression.genCode()
to participate in the funLambdas (RDD or Dataset API)
and sqlContext.udf.registerFunction()are not enough!!
pipeline.io
Creating a Custom UDF with Code Gen① Study existing implementations
o.a.s.sql.catalyst.expressions.Substring
② Extend and implement base traito.a.s.sql.catalyst.expressions.Expression.genCode
③ Don’t forget about Python!python.pyspark.sql.functions.py
64
pipeline.io
Demo!Creating a Custom UDF participating in Code Generation
pipeline.io
Spark 1.6 and 2.0 ImprovementsAdaptiveness, Metrics, Datasets, and Streaming State
pipeline.io
Adaptive Query ExecutionAdapt query execution using data from previous stagesDynamically choose spark.sql.shuffle.partitions (default 200)
67
Broadcast Join(popular keys)
Shuffle Join(not-so-popular keys)
AdaptiveHybrid
Join
pipeline.io
Adaptive Memory ManagementSpark <1.6
Manual configure between 2 memory regionsSpark execution engine (shuffles, joins, sorts, aggs)
spark.shuffle.memoryFractionRDD Data Cache
spark.storage.memoryFractionSpark 1.6+
Unified memory regionsDynamically expand/contract memory regionsSupports minimum for RDD storage (LRU Cache)
68
pipeline.io
MetricsShows exact memory usage per operator & nodeHelps debugging and identifying skew
69
pipeline.io
Spark SQL APIDatasets type safe API (similar to RDDs) utilizing Tungsten
val ds = sqlContext.read.text("ratings.csv").as[String]val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DFval agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc)
Typed Aggregators used alongside UDFs and UDAFsval simpleSum = new Aggregator[Int, Int, Int] with Serializable {
def zero: Int = 0def reduce(b: Int, a: Int) = b + a def merge(b1: Int, b2: Int) = b1 + b2def finish(b: Int) = b
}.toColumnval sum = Seq(1,2,3,4).toDS().select(simpleSum)
Query files directly without registerTempTable()%sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70
pipeline.io
Spark Streaming State ManagementNew trackStateByKey()
Store deltas, compact laterMore efficient per-key state updateSession TTL
Integrated A/B Testing (?!)
Show Failed Output in Admin UIBetter debugging
71
pipeline.io
Thank You!!Chris FreglyResearch Scientist @ PipelineIO(http://pipeline.io)San Francisco, California, USA
advancedspark.comSign up for the Meetup and BookContribute on Github!Run All Demos in Docker~6000 Docker Downloads!!
Find me on LinkedIn, Twitter, Github, Email, Fax72