31
The State of Spark And Where We’re Going Next Matei Zaharia

The State of Spark And Where Were Going Next Matei Zaharia

Embed Size (px)

Citation preview

Page 1: The State of Spark And Where Were Going Next Matei Zaharia

The State of SparkAnd Where We’re Going Next

Matei Zaharia

Page 2: The State of Spark And Where Were Going Next Matei Zaharia

Community Growth

Page 3: The State of Spark And Where Were Going Next Matei Zaharia

Project History

Spark started as research project in 2009

Open sourced in 2010»1st version was 1600 LOC, could run Wikipedia

demo

Growing community since

Entered Apache Incubator in June 2013

Page 4: The State of Spark And Where Were Going Next Matei Zaharia

Development CommunityWith over 100 developers and 25 companies, one of the most active communities in big data

Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)

Past 6 months: more active devs than Hadoop MapReduce!

Page 5: The State of Spark And Where Were Going Next Matei Zaharia

Development CommunityHealthy across the whole ecosystem

Page 6: The State of Spark And Where Were Going Next Matei Zaharia

Release Growth

Spark 0.6:- Java API, Maven, standalone mode

- 17 contributors

Sept ‘13Feb ‘13Oct ‘12

Spark 0.7:- Python API, Spark Streaming

- 31 contributors

Spark 0.8:- YARN, MLlib, monitoring UI

- 67 contributors

Page 7: The State of Spark And Where Were Going Next Matei Zaharia

YARN support (Yahoo!)

Columnar compression in Shark (Yahoo!)

Fair scheduling (Intel)

Metrics reporting (Intel, Quantifind)

New RDD operators (Bizo, ClearStory)

Scala 2.10 support (Imaginea)

Some Community Contributions

Page 8: The State of Spark And Where Were Going Next Matei Zaharia

Conferences

AMP Camp 1 (Aug 2012)

AMP Camp 2 (Aug 2013)

Spark Summit (Nov 2013)

0

100

200

300

400

500

Att

en

dees

Page 9: The State of Spark And Where Were Going Next Matei Zaharia

Projects Built on Spark

Spark

Spark Streamin

g(real-time)

GraphX(graph)

Shark(SQL)

MLbase(machine learning)

BlinkDB

Page 10: The State of Spark And Where Were Going Next Matei Zaharia

What’s Next?

Page 11: The State of Spark And Where Were Going Next Matei Zaharia

Our View

While big data tools have advanced a lot, they are still far too difficult to tune and use

Goal: design big data systems that are as powerful & seamless as those for small data

Page 12: The State of Spark And Where Were Going Next Matei Zaharia

Current Priorities

Standard libraries

Deployment

Out-of-the-box usability

Enterprise use +

Page 13: The State of Spark And Where Were Going Next Matei Zaharia

Standard Libraries

While writing K-means in 30 lines is great, it’s even better to call it from a library!

Spark’s MLlib and GraphX will be standard libraries supported by core developers

»MLlib in Spark 0.8 with 7 algorithms»GraphX coming soon»Both operate directly on RDDs

Page 14: The State of Spark And Where Were Going Next Matei Zaharia

Standard Libraries

val rdd: RDD[Array[Double]] = ...val model = KMeans.train(rdd, k = 10)

val graph = Graph(vertexRDD, edgeRDD)val ranks = PageRank.run(graph, iters = 10)

Page 15: The State of Spark And Where Were Going Next Matei Zaharia

Standard Libraries

Beyond these libraries, Databricks is investing heavily in higher-level projects

Spark Streaming:easier 24/7 operation and optimizations coming in 0.9

Shark:calling Spark libs (e.g. MLlib), optimizer, Hive 0.11 & 0.12Goal: a complete and interoperable stack

Page 16: The State of Spark And Where Were Going Next Matei Zaharia

Deployment

Want Spark to easily run anywhere

Spark 0.8: much improved YARN, EC2 support

Spark 0.8.1: support for YARN 2.2

SIMR: launch Spark in MapReduce clusters as a Hadoop job (no installation needed!)

»For experimenting; see talk by Ahir

Page 17: The State of Spark And Where Were Going Next Matei Zaharia

Monitoring and metrics (0.8)

Better support for large # of tasks (0.8.1)

High availability for standalone mode (0.8.1)

External hashing & sorting (0.9)

Ease of Use

Long-term: remove need to tune beyond defaults

Page 18: The State of Spark And Where Were Going Next Matei Zaharia

Next Releases

Spark 0.8.1 (this month)»YARN 2.2, standalone mode HA, optimized

shuffle, broadcast & result fetching

Spark 0.9 (Jan 2014)»Scala 2.10 support, configuration system,

Spark Streaming improvements

Page 19: The State of Spark And Where Were Going Next Matei Zaharia

What Makes Spark Unique?

Page 20: The State of Spark And Where Were Going Next Matei Zaharia

Big Data Systems Today

MapReduce

Pregel

Dremel

GraphLab

Storm

Giraph

Drill Tez

Impala

S4 …

Specialized systems(iterative, interactive and

streaming apps)

General batchprocessing

Page 21: The State of Spark And Where Were Going Next Matei Zaharia

Spark’s Approach

Instead of specializing, generalize MapReduceto support new apps in same engine

Two changes (general task DAG & data sharing) are enough to express previous models!

Unification has big benefits»For the engine»For users

Spark

Str

eam

ing

Gra

phX

Shark

MLb

ase

Page 22: The State of Spark And Where Were Going Next Matei Zaharia

Code Size

Hadoo

p Map

Reduc

e

Stor

m (S

tream

ing)

Impa

la (S

QL)

Giraph

(Gra

ph)

Spar

k0

20000400006000080000

100000120000140000

non-test, non-example source lines

Page 23: The State of Spark And Where Were Going Next Matei Zaharia

Code Size

Hadoo

p Map

Reduc

e

Stor

m (S

tream

ing)

Impa

la (S

QL)

Giraph

(Gra

ph)

Spar

k0

20000400006000080000

100000120000140000

non-test, non-example source lines

Streaming

Page 24: The State of Spark And Where Were Going Next Matei Zaharia

Code Size

Hadoo

p Map

Reduc

e

Stor

m (S

tream

ing)

Impa

la (S

QL)

Giraph

(Gra

ph)

Spar

k0

20000400006000080000

100000120000140000

non-test, non-example source lines

StreamingShark*

* also calls into Hive

Page 25: The State of Spark And Where Were Going Next Matei Zaharia

Code Size

Hadoo

p Map

Reduc

e

Stor

m (S

tream

ing)

Impa

la (S

QL)

Giraph

(Gra

ph)

Spar

k0

20000400006000080000

100000120000140000

non-test, non-example source lines

Streaming

GraphX

Shark*

* also calls into Hive

Page 26: The State of Spark And Where Were Going Next Matei Zaharia

Performance

0

5

10

15

20

25

30

35

40

45

Hiv

eIm

pala

(d

isk)

Imp

ala

(m

em

)S

hark

(d

isk)

Sh

ark

(m

em

)

Re

sp

on

se

Tim

e (

s)

SQL0

5

10

15

20

25

30

Had

oop

Gir

ap

h

Gra

ph

X

Re

sp

on

se

Tim

e (

min

)Graph

0

5

10

15

20

25

30

35

Sto

rm

Sp

ark

Th

rou

gh

pu

t (M

B/s

/no

de

)

Streaming

Page 27: The State of Spark And Where Were Going Next Matei Zaharia

What it Means for UsersSeparate frameworks:

…HDFS read

HDFS writeE

TL HDFS

readHDFS writetr

ai

n HDFS read

HDFS writeq

ue

ry

HDFS

HDFS read E

TL

trai

nq

ue

ry

Spark: Interactiveanalysis

Page 28: The State of Spark And Where Were Going Next Matei Zaharia

Combining Processing Types

val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”)

val model = KMeans.train(points, 10)

sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

From Scala:

Page 29: The State of Spark And Where Were Going Next Matei Zaharia

Combining Processing Types

GENERATE KMeans(tweet_locations) AS TABLE tweet_clusters

// Scala table generating function (TGF):object KMeans { @Schema(spec = “x double, y double, cluster int”) def apply(points: RDD[(Double, Double)]) = { ... }}

From SQL (in Shark 0.8.1):

Page 30: The State of Spark And Where Were Going Next Matei Zaharia

Conclusion

Next challenge in big data will be complex and low-latency applications

Spark offers a unified engine to tackle and combine these apps

Best strength is the community: enjoy Spark Summit!

Page 31: The State of Spark And Where Were Going Next Matei Zaharia

ContributorsAaron DavidsonAlexander PivovarovAli GhodsiAmeet TalwalkarAndre ShumacherAndrew AshAndrew PsaltisAndrew XiaAndrey KouznetsovAndy FengAndy KonwinskiAnkur DaveAntonio LupherBenjamin HindmanBill ZhaoCharles ReissChris MattmannChristoph GrothausChristopher NguyenChu TongCliff EngleCody KoeningerDavid McCauleyDenny BritzDmitriy LyubimovEdison TungEric ZhangErik van Oosten

Ethan JewettEvan ChanEvan SparksEwen Cheslack-PostavaFabrizio MiloFernand PajotFrank DaiGavin LiGinger SmithGiovanni DelussuGrace HuangHaitao YaoHaoyuan LiHarold LimHarvey FengHenry MilnerHenry SaputraHiral PatelHolden KarauIan BussImran RashidIsmael JumaJames PhillpottsJason DaiJerry ShaoJey KottalamJoseph E. GonzalezJosh Rosen

Justin MaKalpit ShahKaren FengKarthik TungaKay OusterhoutKody KoenigerKonstantin BoudnikLee Moon SooLian ChengLiang-Chi HsiehMarc MercerMarek KolodziejMark HamstraMatei ZahariaMatthew TaylorMichael HeuerMike PottsMikhail BautinMingfei ShiMosharaf ChowdhuryMridul MuralidharanNathan HowellNeal WigginsNick PentreathOlivier GriselPatrick WendellPaul CavallaroPaul Ruan

Peter SankauskasPierre BorckmansPrabeesh K.Prashant SharmaRam SriharshaRavi PandyaRay RacineReynold XinRichard BenkovskyRichard McKinleyRohit RaiRoman TkalenkoRyan LeCompteS. KumarSean McNamaraShane HuangShivaram VenkataramanStephen HabermanTathagata DasThomas DudziakThomas GravesTimothy HunterTyson HamiltonVadim ChekanWu ZemingXinghao Pan