Dongwon Kim – A Comparative Performance Evaluation of Flink

A Comparative Performance Evaluation of Flink

Dongwon Kim

POSTECH

About Me

• Postdoctoral researcher @ POSTECH

• Research interest• Design and implementation of distributed systems

• Performance optimization of big data processing engines

• Doctoral thesis• MR2: Fault Tolerant MapReduce with the Push Model

• Personal blog• http://eastcirclek.blogspot.kr

• Why I’m here

2

http://eastcirclek.blogspot.kr/

Outline

• TeraSort for various engines

• Experimental setup

• Results & analysis

• What else for better performance?

• Conclusion

3

TeraSort

• Hadoop MapReduce program for the annual terabyte sort competition

• TeraSort is essentially distributed sort (DS)

a4

b3 a1

a2

b1

b2

a2

b1 a3

a4

b3

b4

Disk

a2

a4

b3

b1

a1

b4

a3

b2

a1

a3

b4

b2Disk

a2

a4

a1

a3

b3

b1

b4

b2

read shufflinglocal sort write

Disk

Disk

local sort

Node 1

Node 2

Typical DS phases :

4Total order

• Included in Hadoop distributions• with TeraGen & TeraValidate

• Identity map & reduce functions

• Range partitioner built on sampling• To guarantee a total order & to prevent partition skew

• Sampling to compute boundary points within few seconds

TeraSort for MapReduce

Reduce taskMap task

read shuffling sortsort reducemap write

read shufflinglocal sort writelocal sortDS phases :

reducemap

5

Record range…

Partition 1 Partition 2 Partition r

boundary points

• Tez can execute TeraSort for MapReduce w/o any modification• mapreduce.framework.name = yarn-tez

• Tez DAG plan of TeraSort for MapReduce

TeraSort for Tez

finalreduce vertex

initialmap vertex

Map task

read sortmap

Reduce task

shuffling sort reduce write

input data

output data

6

TeraSort for Spark & Flink

• My source code in GitHub:• https://github.com/eastcirclek/terasort

• Sampling-based range partitioner from TeraSort for MapReduce• Visit my personal blog for a detailed explanation

• http://eastcirclek.blogspot.kr

7

https://github.com/eastcirclek/terasort

http://eastcirclek.blogspot.kr/

RDD1 RDD2

• Code

• Two RDDs

TeraSort for Spark

Stage 1Stage 0

Shuffle-Map Task(for newAPIHadoopFile)

read sort

Result Task(for repartitionAndSortWithinPartitions)

shuffling sort write

read shufflinglocal sort writelocal sort

Create a new RDD to read from HDFS

# partitions = # blocks

Repartition the parent RDD based on the user-specified partitioner

Write output to HDFS

DS phases :

8

Pipeline

• Code

• Pipelines consisting of four operators

TeraSort for Flink

read shuffling writelocal sort

Create a dataset to read tuples from HDFS

partition tuples

Sort tuples of each partition

DataSource Partition SortPartition DataSink

local sort

No map-side sorting due to pipelined execution

Write output to HDFS

DS phases :

9

Importance of TeraSort

• Suitable for measuring the pure performance of big data engines• No data transformation (like map, filter) with user-defined logic

• Basic facilities of each engine are used

• “Winning the sort benchmark” is a great means of PR

10

Outline



• Machine specification

• Node configuration



• Conclusion

11

Machine specification (42 identical machines)

DELL PowerEdge R610

CPUTwo X5650 processors

(Total 12 cores)

MemoryTotal 24Gb

Disk6 disks * 500GB/disk

Network10 Gigabit Ethernet

My machine Spark team

ProcessorIntel Xeon X5650

(Q1, 2010)Intel Xeon E5-2670

(Q1, 2012)

Cores 6 * 2 processors 8 * 4 processors

Memory 24GB 244GB

Disks 6 HDD's 8 SSD's

Results can be differentin newer machines

12

24GB on each node

Node configuration

Total 2 GBfor daemons

13 GB

Tez-0.7.0

NodeManager (1 GB)

ShuffleService

MapTask (1GB)

DataNode (1 GB)

MapTask (1GB)

ReduceTask (1GB)

ReduceTask (1GB)

MapTask (1GB)

MapTask (1GB)

MapTask (1GB)

…

MapReduce-2.7.1

NodeManager (1 GB)

ShuffleService

DataNode (1 GB)

MapTask (1GB)

MapTask (1GB)

ReduceTask (1GB)

ReduceTask (1GB)

MapTask (1GB)

MapTask (1GB)

MapTask (1GB)

…12 GB

FlinkSpark

Spark-1.5.1

NodeManager (1 GB)

Executor (12GB)

Internal memory layout

Various managers

DataNode (1 GB)

Task slot 1

Task slot 2

Task slot 12

...

Thread pool

Flink-0.9.1

NodeManager (1 GB)

TaskManager (12GB)

DataNode (1 GB)

Internal memory layout

Various managers

Task slot 1

Task slot 2

Task slot 12

...

Task threads

TezMapReduce

ReduceTask (1GB)ReduceTask (1GB)

13

12 simultaneous tasks at most

Driver (1GB) JobManager (1GB)

Outline




• Flink is faster than other engines due to its pipelined execution


• Conclusion

14

How to read a swimlane graph & throughput graphs

Tasks

Time since job starts (seconds)

2nd stage

1st2nd

3rd4th

5th6th

15

Cluster network throughput

Cluster disk throughput

InOut

Disk read

DiskWrite

- 6 waves of 1st stage tasks- 1 wave of 2nd stage tasks

- Two stages are hardly overlapped

1st stage

2nd stage

1st stage

2nd stage

No network traffic during 1st stage

Each line : duration of each task

Different patterns for different stages

Result of sorting 80GB/node (3.2TB)

1480 sec

1st stage

1st stage

1st stage

2nd stage2157 sec

2nd stage

2171 sec

1 DataSource

2 Partition

3 SortPartition

4 DataSink

• Flink is the fastest due to its pipelined execution

• Tez and Spark do not overlap 1st and 2nd stages

• MapReduce is slow despite overlapping stages

MapReducein Hadoop-2.7.1

Tez-0.7.0

Spark-1.5.1

Flink-0.9.1

2nd stage

1887 sec

2157

1887

2171

1480

0

500

1000

1500

2000

2500

MapReduce

in Hadoop-2.7.1

Tez-0.7.0 Spark-1.5.1 Flink-0.9.1

Tim

e (se

conds)

16* Map output compression turned on for Spark and Tez

* *

Tez and Spark do not overlap 1st and 2nd stages



InOut

Disk readCluster network throughput


InOut

Disk read

Diskwrite

Diskwrite

Disk read

Disk write

Out

In

(1) 2nd stage starts

(2) Output of 1st stage is sent

(1) 2nd stage starts

(2) Output of 1st stage is sent

(1)Network traffic

occurs from start


(2)Write to HDFS occurs

right after shuffling is done

1 DataSource

2 Partition

3 SortPartition

4 DataSink

idle idle

(3)Disk write to HDFS occurs

after shuffling is done

(3)Disk write to HDFS occurs

after shuffling is done

17

Tez does not overlap 1st and 2nd stages

• Tez has parameters to control the degree of overlap• tez.shuffle-vertex-manager.min-src-fraction : 0.2

• tez.shuffle-vertex-manager.max-src-fraction : 0.4

• However, 2nd stage is scheduled early but launched late

scheduled launched

18

Spark does not overlap 1st and 2nd stages

• Spark cannot execute multiple stages simultaneously• also mentioned in the following VLDB paper (2015)

Spark doesn’t support the overlap between shuffle write and read stages.

…Spark may want to support this overlap

in the future to improve performance.

Experimental results of this paper- Spark is faster than MapReduce for WordCount, K-means, PageRank.- MapReduce is faster than Spark for Sort.

19

MapReduce is slow despite overlapping stages

• mapreduce.job.reduce.slowstart.completedMaps : [0.0, 1.0]

• Wang’s attempt to overlap spark stages

0.05(overlapping, default)

0.95 (no overlapping)

2157 sec

10% improvement

20

Wang proposes to overlap stages to achieve better utilization

10%???

Why Spark & MapReduce improve just 10%?

2385 sec

2nd stage

1st stage

Disk

Data transfer between tasks of different stages

Output file

P1 P2 Pn

Shuffle server

…

ConsumerTask 1

ConsumerTask 2

ConsumerTask n

P1

…

P2

Pn

Traditional pull model- Used in MapReduce, Spark, Tez- Extra disk access & simultaneous disk access- Shuffling affects the performance of producers

Producer Task

(1)Write output

to disk

(2)Request P1

(3)Send P1

Pipelined data transfer- Used in Flink- Data transfer from memory to memory- Flink causes fewer disk access during shuffling

21Leads to only 10% improvement

Flink causes fewer disk access during shuffling

MapReduce

Flink diff.

Total disk write(TB)

9.9 6.5 3.4

Total disk read(TB)

8.1 6.9 1.2

Difference comes from shuffling

Shuffled data are sometimes read from page cache


Disk read

Disk write

Disk read

Disk write


FlinkMapReduce

22

Total amount of disk read/writeequals to

the area of blue/green region

Result of TeraSort with various data sizes

node data size(GB)

Time (seconds)

Flink Spark MapReduce Tez

10 157 387 259 277

20 350 652 555 729

40 741 1135 1085 1709

80 1480 2171 2157 1887

160 3127 4927 4796 395023

100

1000

10000

10 20 40 80 160

Tim

e (se

conds)

node data size (GB)

Flink Spark MapReduce Tez

What we’ve seen

Log scale

* Map output compression turned on for Spark and Tez

Result of HashJoin

• 10 slave nodes

• org.apache.tez.examples.JoinDataGen• Small dataset : 256MB

• Large dataset : 240GB (24GB/node)

• Result :

24

Visit my blog

Flink is

~2x faster than Tez~4x faster than Spark

770

1538

378

0

200

400

600

800

1000

1200

1400

1600

1800

Tez-0.7.0 Spark-1.5.1 Flink-0.9.1

Tim

e (se

conds)

* No map output compression for both Spark and Tez unlike in TeraSort

Result of HashJoin with swimlane & throughput graphs

25

Idle

1 DataSource

2 DataSource

3 Join

4 DataSink

Idle



InOut

Diskread

Diskwrite

Disk read

Disk write

In

Out

In

Out

Disk read

Diskwrite



0.24 TB

0.41 TB0.60 TB 0.84 TB

0.68 TB

0.74 TB

Overlap

2nd

3rd

Flink’s shortcoming

• No support for map output compression• Small data blocks are pipelined between operators

• Job-level fault tolerance• Shuffle data are not materialized

• Low disk throughput during the post-shuffling phase

26

Low disk throughput during the post-shuffling phase

• Possible reason : sorting records from small files

• Concurrent disk access to small files too many disk seeks low disk throughput

• Other engines merge records from larger files than Flink

• “Eager pipelining moves some of the sorting work from the mapper to the reducer”

• from MapReduce online (NSDI 2010)

Flink Tez MapReduce

27

Outline





• Conclusion

28

MR2 – another MapReduce engine

• PhD thesis• MR2: Fault Tolerant MapReduce with the Push Model

• developed for 3 years

• Provide the user interface of Hadoop MapReduce• No DAG support

• No in-memory computation

• No iterative-computation

• Characteristics• Push model + Fault tolerance

• Techniques to boost up HDD throughput• Prefetching for mappers

• Preloading for reducers

29

MR2 pipeline

• 7 types of components with memory buffers1. Mappers & reducers : to apply user-defined functions

2. Prefetcher & preloader : to eliminate concurrent disk access

3. Sender & reducer & merger : to implement MR2’s push model

• Various buffers : to pass data between components w/o disk IOs

• Minimum disk access (2 disk reads & 2 disk writes)• +1 disk write for fault tolerance

W1 R2 W2R1

30

1 12 23 3 3

W3

Prefetcher & Mappers

• Prefetcher loads data for multiple mappers

• Mappers do not read input from disks

<MR2><Hadoop MapReduce>

Mapper1 processing Blk1

Mapper2 processing Blk2

Time

Diskthroughput

CPUutilization

2 mapperson a node

Blk1

Time

Prefetcher Blk2 Blk3

Blk12

Blk11

Blk22

Blk21

Blk32

Blk31

Blk4

Blk42

Blk41

Diskthroughput

CPUutilization

2 mapperson a node

31

Push-model in MR2

• Node-to-node network connection for pushing data• To reduce # network connections

• Data transfer from memory buffer• Mappers stores spills in send buffer• Spills are pushed to reducer sides by sender

• Fault tolerance (can be turned on/off)• Input ranges of each spill are known to master for reproduce• For fast recovery

• store spills on disk for fast recovery (extra disk write)

32

similar to Flink’s pipelined execution

MR2 does local sorting before pushing data

similar to Spark

Receiver’s managed memory

Receiver & merger & preloader & reducer

• Merger produces a file from different partition data• sorts each partition data

• and then does interleaving

• Preloader preloads each group into reduce buffer

• Reducers do not read data directly from disks• MR2 can eliminate concurrent disk reads from reducers thanks to Preloader

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

… …

Preloader loads each group(1 disk access for 4 partitions)

33

Result of sorting 80GB/node (3.2TB) with MR2

MapReducein Hadoop-2.7.1

Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR2

Time (sec) 2157 1887 2171 1480 890

MR2 speedup over other engines

2.42 2.12 2.44 1.66 -

21571887

2171

1480

890

0

500

1000

1500

2000

2500

MapReduce

in Hadoop-2.7.1

Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR2

Tim

e (se

conds)

34

Disk & network throughput

1. DataSource / Mapping• Prefetcher is effective

• MR2 shows higher disk throughput

2. Partition / Shuffling• Records to shuffle are

generated faster from in MR2

3. DataSink / Reducing• Preloader is effective

• Almost 2x throughput

Disk read

Disk write

Out

In



Out

In

Disk read

Disk write

Flink MR2

11

2

2

3

3

35

• Experimental results using 10 nodes

PUMA (PUrdue MApreduce benchmarks suite)

36

Outline



• Experimental results & analysis


• Conclusion

37

Conclusion

• Pipelined execution for both batch and streaming processing

• Even better than other batch processing engines for

TeraSort & HashJoin

• Shortcomings due to pipelined execution

• No fine-grained fault tolerance

• No map output compression

• Low disk throughput during the post-shuffling phase

38

Thank you!

Any question?

39

Technology

Dongwon Kim – A Comparative Performance Evaluation of Flink