89
1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

Embed Size (px)

Citation preview

Page 1: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

1Challenge the future

Storage in Big Data SystemsAnd the roles Flash can play

Tom Hubregtsen

Page 2: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

2Challenge the future

Subtitle

Page 3: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

3Challenge the future

Table of contents

• Evolution of Big Data Systems

• Research questions

• Background information and experiments

• Conclusion and future work

• Discussion

Page 4: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

4Challenge the future

Evolution of Big Data Systems

High Performance Computing• Scalable: Yes• Resilient: Yes• Easy to use: No

Big Data Systems are:• Scalable• Resilient• Easy to use

Page 5: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

5Challenge the future

Generation 1: MapReduce• Workload:

Batch/Unstructured• Resiliency (Hadoop):

through data replication

• Key parameter: Disk bandwidth

Evolution of Big Data Systems

Big Data Systems are:• Scalable• Resilient• Easy to use

Page 6: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

6Challenge the future

Generation 2• Workload:

Interactive/Iterative• Resiliency (Spark):

through in-memory re-computation

• Key parameter: Memory capacity

Generation 1: MapReduce• Workload:

Batch/Unstructured• Resiliency (Hadoop):

through data replication

• Key parameter: Disk bandwidth

Evolution of Big Data Systems

Big Data Systems are:• Scalable• Resilient• Easy to use

Page 7: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

7Challenge the future

Generation 2• Workload:

Interactive/Iterative• Resiliency (Spark):

through in-memory re-computation

• Key parameter: Memory capacity

How could Flash fit in?

Generation 1: MapReduce• Workload:

Batch/Unstructured• Resiliency (Hadoop):

through data replication

• Key parameter: Disk bandwidth

Big Data Systems are:• Scalable• Resilient• Easy to use

Page 8: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

8Challenge the future

Generation 2• Workload:

Interactive/Iterative• Resiliency (Spark):

through in-memory re-computation

• Key parameter: Memory capacity

How could Flash fit in?

Generation 1: MapReduce• Workload:

Batch/Unstructured• Resiliency (Hadoop):

through data replication

• Key parameter: Disk bandwidth

Big Data Systems are:• Scalable• Resilient• Easy to use

Page 9: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

9Challenge the future

Generation 2• Workload:

Interactive/Iterative• Resiliency (Spark):

through in-memory re-computation

• Key parameter: Memory capacity

How could Flash fit in?

Generation 1: MapReduce• Workload:

Batch/Unstructured• Resiliency (Hadoop):

through data replication

• Key parameter: Disk bandwidth

Big Data Systems are:• Scalable• Resilient• Easy to use

Page 10: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

10Challenge the future

Difference

DRAM Flash HDD Unit

Type DDR 1600 SATA SATA

Bandwidth/$

1 0.1 0.01 Gb/s/$

IOPS/$ 1,000,000 1,000 1 IOPS/$

Capacity/$ 100 1000 10,000 GB/$

Page 11: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

11Challenge the future

Research questions

Can Flash be used to further optimize Big Data Systems?

• How does Spark relate to Hadoop for an iterative algorithm?

• How does Spark perform when constraining available memory?

• Can we improve Spark by using Flash connected as file storage?

• Can we improve Spark by using Flash connected as secondary object store?

Page 12: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

13Challenge the future

Single Source Shortest Path

Page 13: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

14Challenge the future

Single Source Shortest Path

0

Page 14: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

15Challenge the future

Single Source Shortest Path

0

1

1

1

Page 15: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

16Challenge the future

Single Source Shortest Path

0

1

1

1

2

2

2

Page 16: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

17Challenge the future

Single Source Shortest Path

0

1

1

1

2

2

2

3

Page 17: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

18Challenge the future

Single Source Shortest Path

ApacheSpark

ApacheHadoop

Page 18: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

19Challenge the future

Single Source Shortest Path- Generation 1: Apache Hadoop

Page 19: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

20Challenge the future

HHadoop:Initializationstep

Page 20: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

21Challenge the future

Hadoop:Iterativestep

Page 21: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

22Challenge the future

Single Source Shortest Path- Generation 2: Apache Spark

Page 22: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

23Challenge the future

Spark:Initializationstep

Page 23: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

24Challenge the future

Spark:Iterativestep

Page 24: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

25Challenge the future

Difference

• Main difference: In-memory computation

• Effects:- No use of HDFS on HDD other than input and output- No need to keep static data in data flow

Page 25: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

26Challenge the future

Experiment 1a: Spark vs Hadoop- Overview

• Research question: How does Apache Spark relate to Apache Hadoop for an iterative algorithm?

• Limitation: Under normal conditions

• Expectations:Initialization step: Apache Spark 2x fasterIterative step: Apache Spark 20x-100x faster

Page 26: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

27Challenge the future

Experiment 1a: Spark vs Hadoop- Setup

• Algorithm: Six degrees of separation from Kevin Bacon

• Input set: 10,000 movies, 1-101 actors per movie

• Hardware: IBM Power System S882L- two 12-core 3.02 GHz Power8 processor cards- 512 GB DRAM- Single Hard Disk Drive

• Software: - Ubuntu 14.04 Little Endian - Java 7.1 - Apache Hadoop 2.2.0- Apache Spark 1.1.0

Page 27: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

28Challenge the future

Experiment 1a: Spark vs Hadoop

Page 28: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

29Challenge the future

Experiment 1a: Spark vs Hadoop

~2x

Page 29: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

30Challenge the future

Experiment 1a: Spark vs Hadoop

~30x

Page 30: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

31Challenge the future

map phase sort phase* reduce phase total1

10

100

1000

Apache HadoopApache Spark

Tim

e in s

econds

Experiment 1a: Spark vs Hadoop- Iterative step per phase

~90x ~105x~10x

* sort+overhead

Page 31: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

32Challenge the future

Experiment 1a: Spark vs Hadoop- Conclusion

• Research question: How does Apache Spark relate to Apache Hadoop for an iterative algorithm?

• Expectations:Initialization step: Apache Spark 2x fasterIterative step: Apache Spark 20x-100x faster

• Results:Initialization step: Apache Spark 2x fasterIterative step: Apache Spark 30x-100x faster

• Conclusion: Apache Spark performs equal-or-better than Apache Hadoop under normal conditions

Page 32: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

34Challenge the future

Spark RDDs

Definition: Read-only, partitioned collection of records

RDDs can only be created from• Data in stable storage• Other RDDs

Consist of 5 pieces of information:• Set of partitions• Set of dependencies on parent RDD• Function to transform data from the parent RDD• Metadata about its partitioning scheme• Metadata about its data placement

Page 33: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

35Challenge the future

Spark RDDs: Lineage

Definition: Read-only, partitioned collection of records

RDDs can only be created from• Data in stable storage• Other RDDs

Consist of 5 pieces of information:• Set of partitions• Set of dependencies on parent RDD• Function to transform data from the parent RDD• Metadata about its partitioning scheme• Metadata about its data placement

Lineage

Page 34: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

36Challenge the future

Spark RDDs: Dependencies

Page 35: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

37Challenge the future

Spark: Memory management

General

Shuffle

RDD

20%

20%

60%

Page 36: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

38Challenge the future

Experiment 1b: Constrain memory- Overview• Research question: How does Apache Spark perform

when constraining available memory?

• Expectations:Degrade gracefully to the performance of Apache Hadoop

Page 37: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

39Challenge the future

Experiment 1b: Constrain memory- Setup• Algorithm: Six degrees of separation from Kevin

Bacon

• Input set: 10,000 movies, 1-101 actors per movie

• Hardware: IBM Power System S882L- two 12-core 3.02 GHz Power8 processor cards- 512 GB DRAM- Single Hard Disk Drive

• Software: - Ubuntu 14.04 Little Endian - Java 7.1 - Apache Spark 1.1.0 with varying memory sizes- Apache Hadoop 2.2.0 with no memory constrains

Page 38: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

40Challenge the future

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 150

0

200

400

600

800

1000

1200

Spark no cache

Available memory in Gigabytes

Execu

tion t

ime in s

eco

nds

Experiment 1b: Constrain memory- No explicit cache

Page 39: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

41Challenge the future

Spark: RDD caching

Shuffle region

RDD region

Page 40: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

42Challenge the future

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 150

0

200

400

600

800

1000

1200

Spark no cacheSpark cache iterative

Available memory in Gigabytes

Execu

tion t

ime in s

eco

nds

Experiment 1b: Constrain memory- Cache the iterative RDD

Page 41: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

43Challenge the future

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 150

0

200

400

600

800

1000

1200

Spark no cacheSpark cache iterativeSpark cache all

Available memory in Gigabytes

Execu

tion t

ime in s

eco

nds

Experiment 1b: Constrain memory- Cache all RDDs

Page 42: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

44Challenge the future

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 150

0

100

200

300

400

500

600

700

800

900

Spark cache iterativeHadoop

Available memory in Gigabytes

Execu

tion t

ime in s

eco

nds

Experiment 1b: Constrain memory- Hadoop vs Spark constrained

Page 43: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

45Challenge the future

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 150

0

100

200

300

400

500

600

700

800

900

Spark cache iterativeHadoop

Available memory in Gigabytes

Execu

tion t

ime in s

eco

nds

Experiment 1b: Constrain memory- Hadoop vs Spark constrainedRoom for

improvement!

Page 44: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

46Challenge the future

Experiment 1b: Constrain memory- Conclusion• Research question: How does Apache Spark perform

when constraining available memory?

• Expectations:Degrade gracefully to the performance of Apache Hadoop

• Conclusion:Performance degrades gracefully to a performance worse than Apache Hadoop

Page 45: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

48Challenge the future

Data storage: General ways to store

Serialization

OS involvement

Serialized in the file system

yes yes

Key_value store in OS semi yes

Key_value store in user space

semi no

User space object store no no

Page 46: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

49Challenge the future

Data storage: General ways to store

Serialization

OS involvement

Serialized in the file system

yes yes

Key_value store in OS semi yes

Key_value store in user space

semi no

User space object store no no

Page 47: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

50Challenge the future

Data storage: CAPI interface

Page 48: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

51Challenge the future

Data storage: Data in Apache Spark

Page 49: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

52Challenge the future

Data storage: Data in Apache Spark

11

2

Page 50: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

53Challenge the future

Experiment 2a: Flash with a file system- Overview• Research question: Can we improve Spark by using

Flash connected as file storage?

• Expectations:Speedup when loading/storing I/O, and when spilling

• Sanity check:Ram-disk before Flash as File System

Page 51: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

54Challenge the future

Experiment 2a: Flash with a file system- Setup• Algorithm: Six degrees of separation from Kevin

Bacon

• Input set: 10,000 movies, 1-101 actors per movie

• Hardware: IBM Power System S882L- two 12-core 3.52 GHz Power8 processor cards- 256GB DRAM- Single Hard Disk Drive

• Software: - Ubuntu 14.04 Little Endian - Java 7.1 - Apache Spark 1.1.0 with varying memory sizes

Page 52: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

55Challenge the future

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49100

1000

10000

Spark on HDDSpark on ramdiskBaseline

Available memory in gigabytes

Exe

cu

tio

n t

ime

in

m

ilis

eco

nd

s,

usin

g a

lo

g-

ari

thm

ic s

ca

leExperiment 2a: Flash with a file

system- Sanity check: ram-disk

1.01x

Page 53: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

56Challenge the future

Experiment 2a: Flash with a file system- Discussion

+ Faster writing speeds

- Data aggregation

- OS involvement

Page 54: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

57Challenge the future

Experiment 2a: Flash with a file system- Conclusion• Research question: Can we improve Spark by using

Flash connected as file storage?

• Expectations:Speedup when loading/storing I/O, and when spilling

• Sanity check:Ram-disk before Flash as File System

• Results: No noticeable speedup

• Conclusion: No, as it did not show a noticeable speedup

Page 55: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

59Challenge the future

Data storage: Data in Apache Spark

11

2

Page 56: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

60Challenge the future

Experiment 2b: Flash as object store- Overview• Research question: Can we improve Spark by using

Flash connected as secondary object store?

• Expectations:Noticeable speedup due to lack of Operating System involvement and faster writing speeds

Page 57: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

61Challenge the future

Experiment 2b: Flash as object store- Setup• Algorithm: Six degrees of separation from Kevin

Bacon

• Input set: 10,000 movies, 1-101 actors per movie

• Server: IBM Power System S882L- two 12-core 3.52 GHz Power8 processor cards- 256GB DRAM- Single Hard Disk Drive

• Flash storage: IBM FlashSystem 840 with CAPI

• Software: - Ubuntu 14.04 Little Endian - Java 7.1 - Apache Spark 1.1.0 with 3GB of memory

Page 58: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

62Challenge the future

Experiment 2b: Flash as object store- Results

1.01xExecution mode

Execution time in seconds

Overhead in seconds

Normal execution

208 -

Constrained memory

262 54

Constrained using CAPI Flash

225 17

Page 59: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

63Challenge the future

Experiment 2b: Flash as object store- Results

1.01xExecution mode

Execution time in seconds

Overhead in seconds

Normal execution

208 -

Constrained memory

262 54

Constrained using CAPI Flash

225 17

~70% reduction

1.15x

Page 60: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

64Challenge the future

Experiment 2b: Flash as object store- Discussion

+ Faster writing speeds

+ No OS involvement

- Data aggregation (future work)

Page 61: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

65Challenge the future

Experiment 2b: Flash as object store- Conclusion• Research question: Can we improve Spark by using

Flash connected as secondary object store?

• Expectations:Noticeable speedup due to lack of Operating System involvement and faster writing speeds

• Results: 70% reduction in overhead, 1.16x speedup

• Conclusion: Yes, as it showed a noticeable speedup

Page 62: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

67Challenge the future

Conclusion

Can Flash be used to further optimize Big Data Systems?

• How does Spark relate to Hadoop for an iterative algorithm?

• How does Spark perform when constraining available memory?

• Can we improve Spark by using Flash connected as file storage?

• Can we improve Spark by using Flash connected as secondary object store?

Page 63: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

68Challenge the future

Conclusion

Can Flash be used to further optimize Big Data Systems?

• How does Spark relate to Hadoop for an iterative algorithm?- Equal-or-better

• How does Spark perform when constraining available memory?

• Can we improve Spark by using Flash connected as file storage?

• Can we improve Spark by using Flash connected as secondary object store?

Page 64: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

69Challenge the future

Conclusion

Can Flash be used to further optimize Big Data Systems?

• How does Spark relate to Hadoop for an iterative algorithm?- Equal-or-better

• How does Spark perform when constraining available memory?- Degrade gracefully to a performance worse than Apache Hadoop

• Can we improve Spark by using Flash connected as file storage?

• Can we improve Spark by using Flash connected as secondary object store?

Page 65: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

70Challenge the future

Conclusion

Can Flash be used to further optimize Big Data Systems?

• How does Spark relate to Hadoop for an iterative algorithm?- Equal-or-better

• How does Spark perform when constraining available memory?- Degrade gracefully to a performance worse than Apache Hadoop

• Can we improve Spark by using Flash connected as file storage?- No, as it did not show a noticeable speedup

• Can we improve Spark by using Flash connected as secondary object store?

Page 66: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

71Challenge the future

Conclusion

Can Flash be used to further optimize Big Data Systems?

• How does Spark relate to Hadoop for an iterative algorithm?- Equal-or-better

• How does Spark perform when constraining available memory?- Degrade gracefully to a performance worse than Apache Hadoop

• Can we improve Spark by using Flash connected as file storage?- No, as it did not show a noticeable speedup

• Can we improve Spark by using Flash connected as secondary object store?- Yes, as it showed a noticeable speedup

Page 67: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

72Challenge the future

Conclusion

Can Flash be used to further optimize Big Data Systems?

• Our measured noticeable speedup gives a strong indication that Big Data Systems can be further optimized with CAPI Flash

Page 68: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

73Challenge the future

Future work

• Remove overhead

• Flash as primary object store

Page 69: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

74Challenge the future

Discussion

Page 70: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

75Challenge the future

Contact details

Tom Hubregtsen

• Email: [email protected]

• Linkedin: www.linkedin.com/in/thubregtsen

Page 71: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

77Challenge the future

Backup slides

Page 72: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

78Challenge the future

Data storage: Flash in the Power8

Page 73: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

79Challenge the future

Writing speeds in us

Page 74: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

80Challenge the future

Experiment 1: Spark vs Hadoop

~30x

Page 75: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

81Challenge the future

Experiment 1: Spark vs Hadoop

~30x

Page 76: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

82Challenge the future

Experiment 1: Staged timing- Spark iterative (log scale)

Series10

5

50

500

Initialisation + Stage 1Stage 2Stage 3Stage 4Stage 5Stage 6Shutdown

Page 77: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

83Challenge the future

Experiment 1: Staged timing

Series10

5

50

500

Initialisation + Stage 1Stage 2Stage 3Stage 4Stage 5Stage 6Shutdown

18s

Page 78: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

84Challenge the future

Experiment 1: Staged timing

Series10

5

50

500

Initialisation + Stage 1Stage 2Stage 3Stage 4Stage 5Stage 6Shutdown

18s

12s

Page 79: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

85Challenge the future

Experiment 1: Staged timing

Series10

5

50

500

Initialisation + Stage 1Stage 2Stage 3Stage 4Stage 5Stage 6Shutdown

Mapper: ~2.0s => 180/2=90x Reducer: ~1.3s => 137/1.3=105xSorter: 15-3.3-overhead => ?Overhead: 15-3.3-sorter => ?

18s

12s

Page 80: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

86Challenge the future

Spark: Execution

rdd1.join(rdd2) .groupBy(…)

.filter(…)

RDD Objects

build operator DAG agnostic

to operators!

doesn’t know about

stages

DAGScheduler

split graph into stages of

taskssubmit each

stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster manager

retry failed or straggling

tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

ThreadsTask

stagefailed

Source: Matei Zaharia, Spark

Page 81: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

87Challenge the future

Hadoop - Execution

Page 82: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

88Challenge the future

Hadoop - Scalable

Page 83: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

89Challenge the future

Hadoop - Resilient

Page 84: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

90Challenge the future

Hadoop - Ease of use

Page 85: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

91Challenge the future

Spark- Execution

Page 86: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

92Challenge the future

Spark- Execution

Page 87: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

93Challenge the future

Spark RDDs: Resiliency and lazy evaluation

Page 88: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

94Challenge the future

Page 89: 1 Challenge the future Storage in Big Data Systems And the roles Flash can play Tom Hubregtsen

95Challenge the future

Characteristics of different storage

DRAM Flash HDD

Type DDR3 1600 SATA SATA

Bandwidth 102.4 Gb/s 12 Gb/s 1 Gb/s*

Bandwidth/$

1.5*100 Gb/s/$ 0.9*10-2 Gb/s/$

0.7*10-3 Gb/s/$

IOPS 100,000,000 100,000 100

IOPS/$ 1.4*106 IOPS/$ 7.7*102 IOPS/$ 0.7*10-1 IOPS/$

Capacity 8 GB 240 GB 4,000 GB

Capacity/$ 1.1*102 GB/$ 1.8*103 GB/$ 2.8*104 GB/$

Cost $70 $130 $140*: Actual writing speed