IBM Spark Meetup - RDD & Spark Basics

© 2015 IBM Corporation

RDD Deep Dive• RDD Basics

• How to create • RDD Operations• Lineage

• Partitions • Shuffle • Type of RDDs• Extending RDD• Caching in RDD


RDD Basics • RDD (Resilient Distributed Dataset)

• Distributed collection of Object • Resilient - Ability to re-compute missing partitions

(node failure)• Distributed – Split across multiple partitions • Dataset - Can contain any type, Python/Java/Scala

Object or User defined Object

• Fundamental unit of data in spark


RDD Basics – How to create Two ways

Loading external datasets Spark supports wide range of sources Access HDFS data through InputFormat & OutputFormat

of Hadoop. Supports custom Input/Output format

Parallelizing collection in driver program

val lineRDD = sc.textFile(“hdfs:///path/to/Readme.md”)textFile(“/my/directory/*”) or textFile(“/my/directory/*.gz”) SparkContext.wholeTextFiles returns (filename,content) pair

val listRDD = sc.parallelize(List(“spark”,”meetup”,”deepdive”))


RDD Operations Two type of Operations

Transformation Action

Transformations are lazy, nothing actually happens until an action is called.

Action triggers the computation Action returns values to driver or writes data to external storage.


Lazy Evaluation Transformation on RDD, don’t get performed immediately

Spark Internally records metadata to track the operation

Loading data into RDD also gets lazy evaluated

Lazy evaluation reduce number of passes on the data by grouping operations

MapReduce – Burden on developer to merge the operation, complex map.

Failure in Persisting the RDD will re-compute complete lineage every time.


RDD In Action sc.textFile(“hdfs://file.txt") .flatMap(line=>line.split(" ")) .map(word => (word,1)) .reduceByKey(_+_) .collect()

I scream you scream lets all scream for icecream!

I wish I were what I was when I wished I were what I am.

Iscreamyouscreamletsallscream foricecream

(I,1)(scream,1)(you,1)(scream,1)(lets,1)(all,1)(scream,1)(icecream,1)

(icecream,1)

(scream,3)

(you,1)

(lets,1)

(I,1)

(all,1)


Lineage Demo


RDD Partition Partition Definition

Fragments of RDD Fragmentation allows Spark to execute in Parallel. Partitions are distributed across cluster(Spark worker)

Partitioning Impacts parallelism Impacts performance


Importance of partition Tuning Too few partitions

Less concurrency, unused cores. More susceptible to data skew Increased memory pressure for groupBy, reduceByKey,

sortByKey, etc. Too many partitions

Framework overhead (more scheduling latency than the time needed for actual task.)

Many CPU context-switching Need “reasonable number” of partitions

Commonly between 100 and 10,000 partitions Lower bound: At least ~2x number of cores in cluster Upper bound: Ensure tasks take at least 100ms


How Spark Partitions data

Input data partition Shuffle transformations Custom Partitioner


Partition - Input Data Spark uses same class as Hadoop to perform Input/Output sc.textFile(“hdfs://…”) invokes Hadoop TextInputFormat Below are Knobs which defines #Partitions

dfs.block.size – default 128MB(Hadoop 2.0) numPartition – can be used to increase number of partition

default is 0 which means 1 partition mapreduce.input.fileinputformat.split.minsize – default 1kb

Partition Size = Max(minsize,Min(goalSize,blockSize) goalSize = totalInputSize/numPartitions 32MB, 0, 1KB, 640MB total size - Defaults

Max(1kb,Min(640MB,32MB) ) = 20 partitions 32MB, 30, 1KB , 640MB total size - Want more partition

Max(1kb,Min(32MB,32MB)) = 32 partition 32MB, 5, 1KB = Max(1kb,Min(120MB,32MB)) = 20 – Bigger size

partition 32MB,0, 64MB = Max(64MB,Min(640MB,32MB)) = 10 Bigger size

partition


Partition - Shuffle transformations All shuffle transformation provides parameter

for desire number of partition Default Behavior - Spark Uses HashPartitioner.

If spark.default.parallelism is set , takes that as # of partitions

If spark.default.parallelism is not set largest upstream RDD ‘s number of partition Reduces chances of out of memory

1. groupByKey2. reduceByKey3. aggregateByKey4. sortByKey5. join6. cogroup7. cartesian8. coalesce9. repartition10.repartitionAndSortWithinPartitions

Shuffle Transformation


Partition - Repartitioning RDD provides two operators

repartition(numPartitions) Can Increase/decrease number of partitions Internally does shuffle expensive due to shuffle For decreasing partition use coalesce

Coalesce(numPartition,Shuffle:[true/false]) Decreases partitions Goes for narrow dependencies Avoids shuffle In case of drastic reduction may trigger shuffle


Custom Partitioner Partition the data according to use case & data structure Custom Partitioning allows control over no of partitions and

distribution of data Extends Partitioner class, need to implement getPartitions &

numPartitons


Partitioning Demo


Shuffle - GroupByKey Vs ReduceByKey

val wordCountsWithGroup = rdd .groupByKey() .map(t => (t._1, t._2.sum)) .collect()


Shuffle - GroupByKey Vs ReduceByKey

val wordPairsRDD = rdd.map(word => (word, 1))val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _)

.collect()


The Shuffle Redistribution of data among partition between stages. Most of the Performance, Reliability Scalability Issues in Spark occurs

within Shuffle. Like MapReduce Spark shuffle uses Pull model. Consistently evolved and still an area of research in Spark


Shuffle Overview• Spark run job stage by stage.

• Stages are build up by DAGScheduler according to RDD’s ShuffleDependency

• e.g. ShuffleRDD / CoGroupedRDD will have a ShuffleDependency

• Many operator will create ShuffleRDD / CoGroupedRDD under the hood.• Repartition/CombineByKey/GroupBy/ReduceByKey/

cogroup• Many other operator will further call into the above

operators• e.g. various join operator will call CoGroup.

• Each ShuffleDependency maps to one stage in Spark Job and then will lead to a shuffle.


You have seen this

join

union

groupBy

Stage 3

Stage 1

Stage 2

A:

B:

C:

D:

map

E:

F:

G:


Shuffle is Expensive

• When doing shuffle, data no longer stay in memory only, gets written to disk.

• For spark, shuffle process might involve• Data partition: which might involve very expensive data

sorting works etc.• Data ser/deser: to enable data been transfer through

network or across processes.• Data compression: to reduce IO bandwidth etc.• Disk IO: probably multiple times on one single data block

• E.g. Shuffle Spill, Merge combine


Shuffle History Shuffle module in Spark has evolved over time. Spark(0.6-0.7) – Same code path as RDD’s persist method.

MEMORY_ONLY , DISK_ONLY options available. Spark (0.8-0.9)

Separate code for shuffle, ShuffleBlockManager & BlockObjectWriter for shuffle only.

Shuffle optimization - Consolidate Shuffle Write. Spark 1.0 – Introduced pluggable shuffle framework Spark 1.1 – Sort based Shuffle Implementation Spark 1.2 - Netty transfer Implementation. Sort based shuffle is

default now. Spark 1.2+ - External shuffle service etc.


Understanding Shuffle Input Aggregation Types of Shuffle

Hash based Basic Hash Shuffle Consolidate Hash Shuffle

Sort Based Shuffle


Input Aggregation Like MapReduce, Spark involves aggregate(Combiner) on map side. Aggregation is done in ShuffleMapTask using

AppendOnlyMap (In Memory Hash Table combiner) Key’s are never removed , values gets updated

ExternalAppendOnlyMap (In Memory and disk Hash Table combiner) A Hash Map which can spill to disk Append Only Map that spill data to disk if insufficient memory

Shuffle file In-Memory Buffer – Shuffle writes to In-memory buffer before writing to a shuffle file.


Shuffle Types – Basic Hash Shuffle Hash Based shuffle (spark.shuffle.manager). Hash Partitions the data

for reducers Each map task writes each bucket to a file.

#Map Tasks = M #Reduce Tasks = R #Shuffle File = M*R , #In-Memory Buffer = M*R


Shuffle Types – Basic Hash Shuffle Problem

Lets use 100KB as buffer size We have 10000 reducers 10 Mapper tasks Per Executor In-Memory Buffer size will = 100KB*10000*10 Buffer need will be 10GB/Executor This huge amount of Buffer is not acceptable and this

Implementation cant support 10000 reducer.


Shuffle Types – Consolidate Hash Shuffle Solution to decrease the IN-Memory Buffer size , No of File. Within Executor, Map Tasks writes each Bucket to a Segment of the file. #Shuffle file/Executor = #Reducers, # In-Memory Buffer/ Executor=#R( Reducers)


Shuffle Types – Sort Based Shuffle Consolidate Hash Shuffle needs one file for each reducer.

- Total C*R intermediate file , C = # of executor running map tasks

Still too many files(e.g ~10k reducers), Need significant memory for compression & serialization

buffer. Too many open files issue.

Sort Based Shuflle is similar to map-side shuffle from MapReduce

Introduced in Spark 1.1 , now its default shuffle


Shuffle Types – Sort Based Shuffle Map output records from each task are kept in memory till they can fit. Once full , data gets sorted by partition and spilled to single file. Each Map task generate 1 data file and one index file

Utilize external sorter to do the sort work If map side combiner is required data will be sorted by key and partition

otherwise only by partition #reducer <=200, no sorting uses hash approach, generate file per reducer and

merge them into a single file


Shuffle Reader On Reader side both Sort & Hash Shuffle uses Hash Shuffle Reader

On reducer side a set of thread fetch remote output map blocks Once block comes its records are de-serialized and passed into a

result queue. Records are passed to ExternalAppendOnlyMap , for ordering

operation like sortByKey records are passed to externalSorter.

20

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Bucket

Reduce Task

Aggregator Aggregator Aggregator Aggregator

Reduce Task Reduce Task Reduce Task


Type of RDDS - RDD Interface

Base for all RDDs (RDD.scala), consists of A Set of partitions (“splits” in Hadoop) A List of dependencies on parent RDDs A Function to compute the partition from its

parents

Optional preferred locations for each partition A Partitioner defines strategy for partitionig

hash/range

Basic operations like map, filter, persist etc

Partitions

Dependencies

Compute

PreferredLocations

Partitioner

map,filter,persists

Lineage

Optimized execution

Operations


Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(partition) = read corresponding block

preferredLocations(part) = HDFS block location partitioner = none


Example: MapPartitionRDD partitions = Parent Partition dependencies = “one-to-one “parent RDD compute(partition) = apply map on parent

preferredLocations(part) = none (ask parent) partitioner = none


Example: CoGroupRDD

partitions = one per reduce task dependencies = could be narrow or wide dependency compute(partition) = read and join shuffled data

preferredLocations(part) = none partitioner = HashPartitioner(numTasks)


Extending RDDsExtend RDDs to To add Domain specific transformation/actions

Allow developer to express domain specific calculation in cleaner way

Improves code readability Easy to maintain

Domain specific RDD Better way to express domain specific data Better control on partitioning and distribution Way to add new Input data source


How to Extend Add custom operators to RDD

Use scala Impilicits Feels and works like built in operator You can add operator to Specific RDD or to all

Custom RDD Extend RDD API to create our own RDD Implement compute & getPartitions abstract method


Implicit Class

Creates an extension method to existing type Introduced in Scala 2.10 Implicits are compile time checked. Implicit class gets resolved

into a class definition with implict conversion We will use Implicit to add new method in RDD


Adding new Operator to RDD We will use Scala Implicit feature to add a new operator to an

existingRDD This operator will show up only in our RDD Implicit conversions are handled by Scala


Custom RDD Implementation Extending RDD allow you to create your own custom RDD

structure Custom RDD allow control on computation, change partition &

locality information


Caching in RDD Spark allows caching/Persisting entire dataset in memory Persisting RDD in cache

First time when it is computed it will be kept in memory Reuse the the cache partition in next set of operation Fault-tolerant, recomputed in case of failure

Caching is key tool for interactive and iterative algorithm Persist support different storage level

Storage level - In memory , Disk or both , Techyon Serialized Vs Deserialized


Caching In RDD Spark Context tracks persistent RDDs Block Manager puts partition in memory when first evaluated Cache is lazy evaluation , no caching without an action. Shuffle also keeps its data in Cache after shuffle operations.

We still need to cache shuffle RDDs


Caching Demo

Data & Analytics

IBM Spark Meetup - RDD & Spark Basics