57

Sparklife - Life In The Trenches With Spark

Embed Size (px)

Citation preview

Page 1: Sparklife - Life In The Trenches With Spark

Life in the Trenches with ApacheSpark

Page 2: Sparklife - Life In The Trenches With Spark
Page 3: Sparklife - Life In The Trenches With Spark
Page 4: Sparklife - Life In The Trenches With Spark

Who? What?

I work for Kogentix, IncBig Data projects, often with Spark (Core, Streaming, ML)@carsondial on Twitter

Page 5: Sparklife - Life In The Trenches With Spark

Who is the talk for?

People using Apache SparkPeople deciding whether to use Apache SparkPeople who love hearing about Apache SparkWill it get technical? A bit! (but not too much)

Page 6: Sparklife - Life In The Trenches With Spark

Our itinerary:

Good Things!Bad Things!Tips! Tricks! (hopefully the useful section!)Other Frameworks Are Available

Page 7: Sparklife - Life In The Trenches With Spark

Good Things!

Developer-friendly API for batch and streamingSo much faster than MapReduce (in general!)Fits in well with Hadoop EcosystemBatteries included (ML / GraphX / etc.)Tons of Enterprise support

Page 8: Sparklife - Life In The Trenches With Spark

Bad Things!

Spark is not magic pixie dust!Distributed systems are hard!And…well, let's talk about the elephant…

Page 9: Sparklife - Life In The Trenches With Spark

The (Apparent) Motto of ApacheSpark Development:

Page 10: Sparklife - Life In The Trenches With Spark

In Memoriam

This slide is dedicated to all those that created a productionsystem using MLLib just as Spark added ML Pipelines

Page 11: Sparklife - Life In The Trenches With Spark

(and also…)

(btw, if you haven't upgraded to Spark 2.0 yet, be awarethat the change to MurmurHash3 for hashing means you'll

have to retrain all your models that use HashingTF…)

Page 12: Sparklife - Life In The Trenches With Spark

Keeping Up With The Joneses

The Spark development team moves pretty fast, especially fora project in the Hadoop ecosystem

Sometimes Catalyst will fail on a query / job that workedperfectly fine in the previous versionNew features (e.g. KafkaDirectStream/Datasets) can bepushed before they're entirely ready.Sometimes you get big surprises (Hello, StructuredStreaming!)Hadoop distributions can lag (WHERE'S MY SHINY? ITSHOULD HAVE THE SHINY!?)

Page 13: Sparklife - Life In The Trenches With Spark

(cont…)

Not backporting fixes to earlier versions of Spark

e.g. Spark < 2.0 not having support for Kafka 0.9+clients, despite it being a rather useful enterprise-y thing tohave) [SPARK-12177]

Reading the release notes is essential. Yes, I know you doit, but still…

Page 14: Sparklife - Life In The Trenches With Spark

(and yet more…)

API Stability is a big issue

"I want to create a Spark Streaming Solution with Kafkaand Spark 2.0! How should I do it?""Er…"

Hold off on Structured Streaming for now…

Page 15: Sparklife - Life In The Trenches With Spark

ENOUGH NEGATIVITY!

The Spark maintainers are acutely aware of these issuesRecent on-going discussions focussed on improving thingsHurrah!Now, onto tips and tricks

Page 16: Sparklife - Life In The Trenches With Spark

Developers!

Use Scala if you can.Java 8 will preserve your sanity if you have to use Java

Page 17: Sparklife - Life In The Trenches With Spark

Why Not Python?

You will likely take a performance hit with PySparkPython interpreter memory overhead on top of the JVMPickling RDDs and serializing them to Python and back tothe JVMOften lags behind Scala / Java in new features

Page 18: Sparklife - Life In The Trenches With Spark

More Tips

I'm legally mandated mention to not use collect() orgroupByKey() in your code.Be careful with your closures!

Page 19: Sparklife - Life In The Trenches With Spark

Which data structure to use?

Datasets (Catalyst and fancy Encoders!)Dataframes (Catalyst!)RDDs (you're on your own…)Use Kryo serialization when you can if not Encoders

Page 20: Sparklife - Life In The Trenches With Spark

How many partitions should Ihave?

Goldilocks ProblemToo few - not enough parallelismToo many - too much parallelism and lose time inschedulingRemember repartition() is not a free operation

Page 21: Sparklife - Life In The Trenches With Spark

Partition Rule of Thumb

3x cores in your cluster as a baselineExperiment until performance suffers (increase by a factorof <2 each time)gzipped files may not be your friend for partitionsPartitions in streaming often set by streaming source (e.g.Kafka)

Page 22: Sparklife - Life In The Trenches With Spark

Map vs. MapPartitions

.map() works on an element-by-element basis

.mapPartitions() works on a partition basisCan be very helpful when working with heavy objects /connections /etcDon't accidentally consume the iterator!(e.g. converting to a list, using size()/count(), etc)

Page 23: Sparklife - Life In The Trenches With Spark

One more thing

Beware Java!mapPartitions() in Java may bring the entire partitioninto memoryEmbrace Scala, JOIIIIIN USSSSSSS

Page 24: Sparklife - Life In The Trenches With Spark

Streaming Tips

You need to get your processing done within the batchduration.Backpressure!Prefer mapWithState() over updateStateByKey(),as it allows timeouts, modifying state without having toiterate over the entire state spaceSee last year's ATO talk for more streaming tips!

Page 25: Sparklife - Life In The Trenches With Spark

Streaming Sources

Use Apache Kafka if you have an Ops team that cansupport itKinesis if you don't or you can live with the restrictions ofKinesis over Kafka (and you're in AWS)

Page 26: Sparklife - Life In The Trenches With Spark

SparkSQL

Don't rely on JSON schema reflection if you can help itLarge JSON schemas may break Hive(or at least require you to alter things deep in the HiveDDL)Try to push filters down into the source layer when possibleParquet ALL THE THINGSCustom UDFs are (currently) opaque to Catalyst (non-JVMlanguages are even worse here!)

Page 27: Sparklife - Life In The Trenches With Spark

Testing

It's hard! But do it anywayspark-testing-baseMaintained by Holden KarauGreat set of tools for testing RDDs, Dataframes, Datasets,and streaming operations

Page 28: Sparklife - Life In The Trenches With Spark

Test Strategies

Is it correct? (spark-testing-base provides approximatematchers too!)Is it correct under production level loads?Consider a shadow cluster for streaming

Page 29: Sparklife - Life In The Trenches With Spark

Operations

The key to a successful Spark solution.Don't ignore OpsSo many knobs to fiddle with!

Page 30: Sparklife - Life In The Trenches With Spark

Deploying Spark Jobs

Don't rely on spark-submit for too long (do you reallywant users to have to log in to a production server to kickoff a new job?)Use Livy or Spark-Job-Submit as soon as possible to solvewith another layer of indirection!

Page 31: Sparklife - Life In The Trenches With Spark

Upgrading Spark StreamingApplications

Yay! I've turned checkpointing on and I'm super-resilient!Now I'm going to upgrade my app!Why has everything exploded?Checkpointing only works with the same code. Changeanything…and boom.

Page 32: Sparklife - Life In The Trenches With Spark

THIS IS FINE.

Delete checkpoint and it'll workBut…offsets for streaming?Store them in Zookeeper and obtain on start(do you actually need checkpointing in that case? Possiblynot!)

Page 33: Sparklife - Life In The Trenches With Spark

Scheduling Jobs.

OR: We need to talk about Apache Oozie.It can do anything you can throw at it.Providing anything is in a Turing-complete DSL embeddedin XMLWhich can often validate or not, even if written correctly.And a web UI that sometimes rises to the level of'tolerable'

Page 34: Sparklife - Life In The Trenches With Spark

Oozie. Poor Oozie.

But! Less hate!It is not a sexy part of the Hadoop ecosystemIt really can handle almost any scenarioAlso, what if you didn't have to write the XML?

Page 35: Sparklife - Life In The Trenches With Spark

Arbiter

Write your Oozie workflows in YAML100% more hipster-compliantSeriously, up to about 20% less typing and handlesdependencies for you.Try it, and maybe you won't hate Oozie so much

Page 36: Sparklife - Life In The Trenches With Spark

Monitoring

WebUI is great, but perhaps it would be better fed intoyour existing monitoring solution?CODA HALE TO THE RESCUE!Send metrics to CSV, JMX, slf4j, or Graphite

Page 37: Sparklife - Life In The Trenches With Spark

Graphite monitoring

val sparkConf = new spark.SparkConf() .set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink") .set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName) val sc = new spark.SparkContext(sparkConf)

Page 38: Sparklife - Life In The Trenches With Spark

Monitoring

Also, you're directing all your logs/metrics from yourexecutors and drivers to a central logging system, aren'tyou? And Kafka?Splunk / Datadog / ELK (Elastic, Logstash, Kibana) areyour friendInclude OS metrics too!

Page 39: Sparklife - Life In The Trenches With Spark

Debugging Issues

Spark WebUI is a good place to start for drilling down intotasksThe OS is still important (e.g. memory, OOM killer, Xenhypervisor networking, etc)Distributed systems are hard!

Page 40: Sparklife - Life In The Trenches With Spark

flaaaaaamegraphs

Invented by Brendan Gregg (Netflix, Joyent, Sun)Most common type is on-CPU flamegraphWidth of stack sample is how often that stack sample is onCPU

Page 41: Sparklife - Life In The Trenches With Spark

Spark-flame

Simple and dirty Ansible playbookAttaches to a Spark cluster running on YARNGenerates perf data and pulls back down for flamegraphsOne flamegraph per executor, can be combined into onegraphhttps://github.com/falloutdurham/spark-flame

Page 42: Sparklife - Life In The Trenches With Spark

Flame Graph Search

Interpreter

[libjvm.so]

java/io..

[libjvm.so]

start_thread

Interpreter

[libj..

scala/collection/AbstractIterator:::foreach

[li..

[libjvm..

[libjava.so]

s..

[libjvm.so]

Interpreter

Interpreter

[l..

[lib..

[l..

java

[..

InterpreterInterpr..

Ja..

[..

scala/collection/Iterator$class:::foreach

[libjvm.so]

[li..

[libjvm.so]

java/lang/StrictMat..itable stub

$li..[libjava.so]

[..

Interpreter

org/apach..

[l..

Interpreter

[libjvm.so]

[lib..

Interpreter

Interpreter

call_stub

[libjvm.so]

Interpreter

Interpreter

[libjvm.so]

jav..

java/io..

[libjvm.so]

[li..

[..

InterpreterInterpreter

[..

[libjvm.so]

[li..

Interpreter

Interpreter

scala/collection/generic/Growable$class:::$plus$plus$eq

sc..

java/..

[..

Interpr..

[l..

s..

[..

[..

[..

[..

[l..

[l..

Interpreter

[libjvm.so]

[..

[libjvm.so]

Interpreter

Interpreter

Page 43: Sparklife - Life In The Trenches With Spark

Performance!

How Many Executors Should We Use?How Much Memory Do We Need?What About Garbage Collection?

Page 44: Sparklife - Life In The Trenches With Spark

Executors

Hello Goldilocks again!Small numbers of large executors = long GC pauses, lowparallelismLarge number of small executors = Frequent GC, memoryerrors, other Bad Things

Page 45: Sparklife - Life In The Trenches With Spark

So How Many?

Stay between 3-5 cores per executor64GB is a decent upper memory limit per executorRemember the driver needs memory and cores too!Experiment, experiment, experiment

Page 46: Sparklife - Life In The Trenches With Spark

GC

Use G1GC!Use UI to spot time spent in GC and then turn GC loggingon (or have it on anyway!) Too many major GCs? Increase spark.memory.fraction orExecutor memoryLots of minor GCs? Increase Eden spaceTry other approaches before digging into the GC weeds

Page 47: Sparklife - Life In The Trenches With Spark

Other Frameworks Are Available

Apache Storm / HeronApache FlinkApache ApexApache BeamApache Kafkaaaaaaaaa?

Page 48: Sparklife - Life In The Trenches With Spark

Apache Storm

Low-latency at a level Spark Streaming can't (currently)touchLower-level API (build the DAG yourself, peasant!)Deploying and HA story has not been wonderful in thepast, but is getting much better!Mature and battle-tested at Twitter, Yahoo!, etc.1.0.x series is very solid - lower memory use, much faster.Has slightly undeserved reputation as 'old man' of streamprocessing

Page 49: Sparklife - Life In The Trenches With Spark

Heron

Built by Twitter to work around issues with StormStorm-compatible APIWorks with Apache Mesos (YARN support is coming)Looks very promising as a next-gen Storm(but Apache Storm has also solved a lot of the issuesTwitter did so shrug)

Page 50: Sparklife - Life In The Trenches With Spark

Apache Flink

Higher-level API like SparkBased around streaming rather than Spark's 'batch' focusGetting traction in places like Uber, NetflixDefinitely worth investigating

Page 51: Sparklife - Life In The Trenches With Spark

Apache Apex

Dark Horse of the DAG processing enginesLow-level API like StormFAST.Comes with an amazing array of lego bricks to assembleyour pipelines (want to pipe data from FTP and Splunk intoHBase? Easy with Malhar!)Documentation sometimes lackingUsed by Capital One

Page 52: Sparklife - Life In The Trenches With Spark

Apache Beam

One API to bring them all and in the darkness bind them.Initiative from Google - write your code using the ApacheBeam API and then you can run that code on:Google Cloud DataflowApache SparkApache Flink(more to come)

Page 53: Sparklife - Life In The Trenches With Spark

Apache Beam

In theory, this is great!But…The favoured API is, obviously, Google Cloud Dataflow.Last time I checked, the Apache Spark runner operatedonly in terms of RDDs, thus bypassing Catalyst/Datasetsand all the performance boosts associated with themI'd recommend Beam if you're shopping around for aframework!

Page 54: Sparklife - Life In The Trenches With Spark

Kafkaaaaaaaaaaaa?

Wait, wait, what?Kafka Connect as an alternative to Spark Streaming for ETLKafka Streams for streaming processingHA by using Kafka itself!Streams is very new.Should consider Connect for ETL rather than a Kafka /Spark solution

Page 55: Sparklife - Life In The Trenches With Spark

Finally…

Apache Spark is great!But can require stepping outside the box at scalelots of tuning!test things!monitor things!go do great things!

Page 56: Sparklife - Life In The Trenches With Spark

Zine

Inspired by Julia Evans (@b0rk), I've made a zine!30 copies!Includes bonus material!PDF: http://snappishproductions.com/files/sparklife.pdf