Spark Tips & Tricks

1© Cloudera, Inc. All rights reserved.

Tips + TricksBest Practices + Recommendations for Apache Spark

Jason Hubbard | Systems Engineer @ Cloudera


Quick Spark OverviewPromise!


Overview: Spark ComponentsSpark is a fast, general purpose cluster computing platform.Spark takes advantage of parallelism, by distributing processing across a cluster of nodes, in order to provide fast processing of data.

Each Spark application gets its own executor processes which stay up for the duration of the application. The executor runs tasks in multiple threads.

The driver program coordinates tasks and handles resource requests to the cluster manager. The driver distributes application code to each executor. Normally, each task takes 1 core.

Example: When Spark is run interactively via pyspark or spark-shell executors are assigned to the shell (driver) until the shell is exited. Each time the user invokes an action on the data the driver invokes that action on each executor as a task.


Spark on YarnYARN ( Yet Another Resource Negotiator) is a resource manager which can be used by Spark as the cluster manager (and is recommend for use with CDH).

Resource Manager Node Manager

Container

Executor

Node Manager

Container

Executor

Appl

icati

on

Mst

r.

Spark Driver• The Spark Driver submits initial

request to Resource Manager • Spark Application Master is launched • Spark Application Master coordinates

with Resource Manager and Node Managers to launch containers and Executors

• Spark Driver Coordinates execution of the application with the executors


What About pyspark?•Spark operates on JVM

•Objects serialized for performance•Python =/= Java•Spark uses Py4J to move data from JVM to python

•Extra serialization cost (Pickle)•Note: DataFrames create a query plan out of pyspark and execute in JVM

•But only if there are no UDFs•UDFs kick back to double serialization cost

•Apache Arrow aims to solve some of these issues


Resource ManagementHow do I figure out # of executors, cores, memory?


What are we doing here?•num-executors, executor-cores, executor-memory

•obviously pretty important•How do we configure these?

•Tip: use executor-cores to drive the rest. •i.e. X=total core, Y=executor-cores, then X/Y = num executors•i.e. Z=total mem, Z/(X/Y) = executor-memory

•Good rule of thumb – try ~5 executor-cores•Notes!

•Too few cores doesn’t take advantage of multiple tasks running in a executor (ex: sharing broadcast variables)•Too many tasks can create bad HDFS I/O (problems with concurrent threads)•You can’t give all your resources to Spark

•At the very least, YARN AM needs a container, OS needs memory/core, HDFS needs memory/core, YARN needs memory/core, offheap memory, other services..


Quick Aside on Memory Usage

• yarn.nodemanager.resource.memory-mb – what yarn is working with• spark.executor.memory – memory per executor process (heap memory)• Spark.yarn.executor.memoryOverhead – offheap memory

• Default is max (384 mb, 0.1 * spark.executor.memory)• Memory is pretty key – controls how much data you can process, group, join,

cache, shuffle, etc,


Unified Memory Management Spark

• Storage\Execution 1.6• Evicts Storage, not execution• Specify minimum unevictable amount (not

reservation)• Tasks 1.0

• Static vs dynamic (slots determined dynamically)

• Fair and starvation free, static simpler, dynamic better for stragglers

• Off by default in CDH (performance regressions)

• Toggle with spark.memory.useLegacyMode


Worked Example

16 Core

16 Core

16 Core

16 Core

64 Total Cores in Cluster512 GB RAM

C 1 Core for OS4 GB RAM

111C 12 Cores105 RAM for Executors

Core/RAM Allocation (per Host) GO AND TEST!Also, keep executor mem < 64 GB [GC delays]

1 Executor

4 Cores48 Cores 12 Executors with

4 Cores, 35 GB RAM Each

x

Allocate Resources Try differ executor/core ratios

1 Executor

5 Cores48 Cores 9 Executors with 5

Cores, 46 GB RAM Each

x

(Leaves cores un-utilized)

Determine the optimal resource allocation for the Spark job

128 GB

128 GB

128 GB

128 GB

Worker Nodes

C 1 Core for CM agent1 GB RAM

C 1 Core for NM1 GB RAM

C 1 Core for DN1 GB RAM

12 x 4 = 48 total cores105 x 4 = 420 GB RAM

1 Executor

6 Cores48 Cores 8 Executors with 6

Cores, 52 GB RAM Each

xC 12 GB RAM for overhead


This could be easier: Dynamic Resource Allocation Yarn will handle the sizing and distribution of the requested executors. CDH 5.5+

This configuration will allow the dynamic allocation of between 1 and 20 executors. Spark will initially attempt to run the job with 5 executors, this helps to speed up jobs which you know will require a certain number of executors ahead of time.

Configuration settings should be placed in the spark configuration file. However they can also be submitted with the job.

spark-submit --class com.cloudera.example.YarnExample \ --master yarn-cluster \ --conf "spark.dynamicAllocation.enabled=true" \ --conf "spark.dynamicAllocation.minExecutors=1" \ --conf "spark.dynamicAllocation.maxExecutors=20" \ --conf "spark.dynamicAllocation.initialExecutors=5" \ lib/yarn-example.jar \ 10

Dynamic Allocation of Resources in Yarn only handles allocation of Executors.The number of cores per executor is handled via Spark.conf they are not dynamically sized.

It is still important to understand the sizing limitations of your cluster in order to properly set the Min & Max executor settings as well as the executor- cores setting.Spark also lets you control the timeout of executors when they are not being used.


Warning: Shuffle Block Size•Spark imposes a 2 GB limit on shuffle block size•Anything larger will cause application errors•How do we fix this?•Create more partitions•Spark core: rdd.repartition, rdd.coalesce. •Spark SQL: spark.sql.shuffle.partitions•Default is 200!•Note: If partitions > 2000, Spark uses HighlyCompressedMapStatus•Rule of Thumb: 128 MB/partition•Avoid Data Skew


Data Skew

•Avoid This! •Skew slows down jobs/queries•May even causes errors/break Spark (partition > 2GB, etc)

Good Not Good


How to Avoid Skew• If things are running slowly, always inspect the data/partition

sizes• If you notice skew, try adding a salt to keys• With salt, do two stage operation, one on salted keys, then one on unsalted

results• There are more transformations, but we spread the work around better so job

should be faster• If there is a small number of of skewed keys, you can try isolated salting• Note: Less of a problem in SparkSQL• More efficient columnar cached representation• Able to push some operations down to the data store • Optimizer is able to look inside our operations (partition pruning, predicate

pushdowns, etc)


Finding Skew


DAGnabit

•ReduceByKey over GroupByKey•GroupByKey is unbounded and dependent on data

•Tree(Reduce\Aggregate) over Reduce\Aggregate•Push more work to workers and return less data to driver

•Less Spark•mapPartitions - reuse resources (JDBC)•Group or Reduce by then process in memory

•Submit multiple Jobs concurrently•Java concurrency


Serialization

•In general, Spark stores data in memory as deserialized java objects and on disk/through network as serialized binary•Serialize stuff before you send it to executors, or don’t send stuff that don’t need to be serialized

•This is bad:val myObj=<something>myRDD.filter(x => x == myObj.value)•This is good:val myObj=<something>val myVal=myObj.valuemyRDD.filter(x => x == myValue)

•Kyro is better than Java serialization. Register custom classes•Cut fat off of data objects. Only use what you need.


Labeling

•This seems trivial but getting in good habits can save a lot of time•Name your jobs

•sc.setJobGroup•Name your RDDs

•rdd.setName•Which is easier to understand?


Check Your Dependencies•Incorrectly built jars and version mismatch are common issues•Common errors:

•Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster•java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.MRJobConfig•java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream•ImportError: No module named qt_returns_summary_wrapper•NoSuchMethodException

•Double check and verify your artifacts•Use versions of components provided by your distro – versions will work together


Spark StreamingIt’s all about the microbatches


Spark Streaming•Incoming data represented as DStreams (Discretized Streams)

•Data commonly read from streaming data channels like Kafka or Flume

•A spark-streaming application is a DAG of Transformations and Actions on DStreams (and RDDs)


Discretized Stream•Incoming data stream is broken down into micro-batches

•Micro-batch size is user defined, usually 0.3 to 1 second •Micro-batches are disjoint

•Each micro-batch is an RDD •Effectively, a DStream is a sequence of RDDs, one per micro-batch•Spark Streaming known for high throughput


Windowed DStreams• Defined by specifying a window size and a step size

• Both are multiples of micro-batch size• Operations invoked on each window’s data


Fault Tolerance• Handle Driver Failures

• Process Control System• Submit with deploy-mode=cluster, runs in Application Master

• Driver failure causes Application Master Failure• Set yarn.resourcemanager.am.max-attempts higher (default 2), try 4

• Reset Failure Counts• spark.yarn.am.attemptFailuresValidityInterval (default none), try 1h• spark.yarn.max.executor.failures (default max(2 * num executors, 3)), try {8 *

num_executors} • spark.yarn.executor.failuresValidityInterval (default none), try 1h• spark.task.maxFailures (default 4), try 8


Graceful Shutdown• Design for failure, but may need to finish batches• yarn application -kill [applicationId] (may stop in the middle)• Shutdown hook too late• spark.streaming.stopGracefullyOnShutdown doesn’t work on Yarn• Marker file or http endpoint


Prevent Data Loss• Receiver

• Enable Checkpoint• Enable WAL• Upgrades won’t work, delete checkpoint dir!

• Direct• Checkpoint offsets (upgrades won’t work)• Save checkpoints manually in ZK, HDFS, Hbase, RDBMS, etc


Performance• Prevent starvation, create dedicated Pool

• --queue realtime_queue• Protect against stragglers, enable speculation

• --conf spark.speculation=true• Single receiver most execute same task and node as receiver

• Increase replication or lower spark.locality.wait (default 10 ms)• Batch time, Inverse function• mapWithState instead of updateStateByKey (size of batch instead of state)


Parallelism/Partitions• Repartition (may cause shuffle)• Batch Interval & Block Interval determine # tasks (may cause shuffle)

• Lower block interval to increase tasks (min 50 ms)• Batch Interval / Block Interval should = # executors

• Multiple receivers w/ union (avoids shuffle)• Kafka direct, increase kafka partitions• For receivers, don’t forger receiver consumes a long running task


Security• Submit principal and keytab for secured cluster via submit

• --principal user/hostname@domain --keytab keytabfile• Disable HDFS Cache with HA Namenode (HDFS-9276, SPARK-11182)

• --conf spark.hadoop.fs.hdfs.impl.disable.cache=true

https://issues.apache.org/jira/browse/HDFS-9276

https://issues.apache.org/jira/browse/SPARK-11182


Backpressure• Find optimal records for processing time• May hide lag, monitor• Smooth startup

• Spark 2 • spark.streaming.backpressure.initialRate

• Spark 1 • Receiver: spark.streaming.backpressure.initialRate• Direct: spark.streaming.kafka.maxRatePerPartition


Logging• Enable YARN rolling aggregator

• yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds

• Configure Spark rolling strategy- or -• Custom log4j appender

• --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties• --conf spark.executor.extraJavaOptions=-

Dlog4j.configuration=file:log4j.properties• --files /path/to/log4j.properties


Cloud


S3• Treats S3 as a filesystem, S3A is preferred over S3N and S3• Eventually consistent listing

• Consider writing to HDFS first then copy to S3• DirectParquetOutputCommitter removed from Spark 2• If writing directly to S3, use version 2 commit algorithm and turn off speculation

• spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2• spark.speculation=false


Parquet on S3• When reading Parquet enable random I/O

• fs.s3a.experimental.input.fadvise=random (Filesystem Level when created)• spark.hadoop.parquet.enable.summary-metadata=false• spark.sql.parquet.mergeSchema=false• spark.sql.parquet.filterPushdown=true• spark.sql.hive.metastorePartitionPruning=true


S3 Performance• fs.s3a.block.size• Tune fs.s3a.multipart.threshold fs.s3a.multipart.size • spark.hadoop.fs.s3a.readahead.range 67108864• Expiremental fs.s3a.fast.upload (buffer in memory) fs.s3a.fast.buffer.size• fs.s3a.threads.max active uploads fs.s3a.max.total.tasks queued


YARN• yarn.scheduler.fair.locality.threshold.node = -1• yarn.scheduler.fair.locality.threshold.rack = -1• spark.locality.wait.rack=0


[email protected] You

Technology

Spark Tips & Tricks