23
Matei Zaharia, Mosharaf Chowdhury,Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive, LanguageIntegrated Cluster Computing UC Berkeley

Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Matei  Zaharia,  Mosharaf  Chowdhury,  Tathagata  Das,  Ankur  Dave,  Justin  Ma,  Murphy  McCauley,  Michael  Franklin,  Scott  Shenker,  Ion  Stoica  

Spark  Fast,  Interactive,  Language-­‐Integrated  Cluster  Computing  

UC  Berkeley  

Page 2: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Background  MapReduce  and  its  variants  greatly  simplified  big  data  analytics  by  hiding  scaling  and  faults  

However,  these  systems  provide  a  restricted  programming  model  

Can  we  design  similarly  powerful  abstractions  for  a  broader  class  of  applications?  

Page 3: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Motivation  Most  current  cluster  programming  models  are  based  on  acyclic  data  flow  from  stable  storage  to  stable  storage  

Map  

Map  

Map  

Reduce  

Reduce  

Input   Output  

Page 4: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Motivation  

Map  

Map  

Map  

Reduce  

Reduce  

Input   Output  

Benefits  of  data  flow:  runtime  can  decide  where  to  run  tasks  and  can  automatically  

recover  from  failures  

Most  current  cluster  programming  models  are  based  on  acyclic  data  flow  from  stable  storage  to  stable  storage  

Page 5: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Motivation  Acyclic  data  flow  is  inefficient  for  applications  that  repeatedly  reuse  a  working  set  of  data:  » Iterative  algorithms  (machine  learning,  graphs)  » Interactive  data  mining  tools  (R,  Excel,  Python)  

With  current  frameworks,  apps  reload  data  from  stable  storage  on  each  query  

Page 6: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Spark  Goal  Efficiently  support  apps  with  working  sets  » Let  them  keep  data  in  memory  

Retain  the  attractive  properties  of  MapReduce:  » Fault  tolerance  (for  crashes  &  stragglers)  » Data  locality  » Scalability  

Solution:  extend  data  flow  model  with  “resilient  distributed  datasets”  (RDDs)  

Page 7: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Outline  Spark  programming  model  

Applications  

Implementation  

Demo  

Page 8: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Programming  Model  Resilient  distributed  datasets  (RDDs)  » Immutable,  partitioned  collections  of  objects  » Created  through  parallel  transformations  (map,  filter,  groupBy,  join,  …)  on  data  in  stable  storage  » Can  be  cached  for  efficient  reuse  

Actions  on  RDDs  » Count,  reduce,  collect,  save,  …  

Page 9: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Example:  Log  Mining  Load  error  messages  from  a  log  into  memory,  then  interactively  search  for  various  patterns  

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block  1  

Block  2  

Block  3  

Worker  

Worker  

Worker  

Driver  

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks  

results  

Cache  1  

Cache  2  

Cache  3  

Base  RDD  Transformed  RDD  

Action  

Result:  full-­‐text  search  of  Wikipedia  in  <1  sec  (vs  20  sec  for  on-­‐disk  data)  Result:  scaled  to  1  TB  data  in  5-­‐7  sec  

(vs  170  sec  for  on-­‐disk  data)  

Page 10: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

RDD  Fault  Tolerance  RDDs  maintain  lineage  information  that  can  be  used  to  reconstruct  lost  partitions  

Ex:  

 

 

cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()

HdfsRDD  path:  hdfs://…  

FilteredRDD  func:  contains(...)  

MappedRDD  func:  split(…)   CachedRDD  

Page 11: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Example:  Logistic  Regression  

Goal:  find  best  line  separating  two  sets  of  points  

+

+ + +

+

+

+ + +

– – –

– – –

+

target  

random  initial  line  

Page 12: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Example:  Logistic  Regression  val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Page 13: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Logistic  Regression  Performance  

0  500  

1000  1500  2000  2500  3000  3500  4000  4500  

1   5   10   20   30  

Run

ning

 Tim

e  (s)  

Number  of  Iterations  

Hadoop  

Spark  

127  s  /  iteration  

first  iteration  174  s  further  iterations  6  s  

Page 14: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Spark  Applications  Twitter  spam  classification  (Monarch)  

In-­‐memory  analytics  on  Hive  data  (Conviva)  

Traffic  prediction  using  EM  (Mobile  Millennium)  

K-­‐means  clustering  

Alternating  Least  Squares  matrix  factorization  

Network  simulation  

Page 15: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Conviva  GeoReport  

Aggregations  on  many  group  keys  w/  same  WHERE  clause  

Gains  come  from:  » Not  re-­‐reading  unused  columns  » Not  re-­‐reading  filtered  records  » Not  repeated  decompression  »  In-­‐memory  storage  of  deserialized  objects  

0.5  

20  

0   5   10   15   20  

Spark  

Hive  

Time  (hours)  

Page 16: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Generality  of  RDDs  RDDs  can  efficiently  express  many  proposed  cluster  programming  models  » MapReduce  =>  map  and  reduceByKey  operations  » Dryad  =>  Spark  runs  general  DAGs  of  tasks  » Pregel  iterative  graph  processing  =>  Bagel  » SQL  =>  Hive  on  Spark  (Hark?)  

Can  also  express  apps  that  neither  of  these  can  

Page 17: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Implementation  Spark  runs  on  the  Mesos  cluster  manager,  letting  it  share  resources  with  Hadoop  &  other  apps  

Can  read  from  any  Hadoop  input  source  (e.g.  HDFS)  

Spark   Hadoop   MPI  

Mesos  

Node   Node   Node   Node  

…  

~7000  lines  of  code;  no  changes  to  Scala  compiler  

Page 18: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Language  Integration  Scala  closures  are  Serializable  Java  objects  » Serialize  on  master,  load  &  run  on  workers  

Not  quite  enough  » Nested  closures  may  reference  entire  outer  scope  » May  pull  in  non-­‐Serializable  variables  not  used  inside  » Solution:  bytecode  analysis  +  reflection  

Other  tricks  using  custom  serialized  forms  (e.g.  “accumulators”  as  syntactic  sugar  for  counters)  

Page 19: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Interactive  Spark  Modified  Scala  interpreter  to  allow  Spark  to  be  used  interactively  from  the  command  line  

Required  two  changes:  » Modified  wrapper  code  generation  so  that  each  line  typed  has  references  to  objects  for  its  dependencies  » Distribute  generated  classes  over  the  network  

Enables  in-­‐memory  exploration  of  big  data  

Page 20: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Demo  

Page 21: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Conclusion  Spark’s  resilient  distributed  datasets  are  a  simple  programming  model  for  a  wide  range  of  apps  

Download  our  open  source  release  at  

www.spark-­‐project.org  

[email protected]  

Page 22: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Related  Work  DryadLINQ  » Build  queries  through  language-­‐integrated  SQL  operations  on  lazy  datasets  » Cannot  have  a  dataset  persist  across  queries  

Relational  databases  » Lineage/provenance,  logical  logging,  materialized  views  

Piccolo  » Parallel  programs  with  shared  distributed  hash  tables;  similar  to  distributed  shared  memory  

Iterative  MapReduce  (Twister  and  HaLoop)  » Cannot  define  multiple  distributed  datasets,  run  different  map/reduce  pairs  on  them,  or  query  data  interactively  

Page 23: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Related  Work  Distributed  shared  memory  (DSM)  » Very  general  model  allowing  random  reads/writes,  but  hard  to  implement  efficiently  (needs  logging  or  checkpointing)  

RAMCloud  » In-­‐memory  storage  system  for  web  applications  » Allows  random  reads/writes  and  uses  logging  like  DSM  

Nectar  » Caching  system  for  DryadLINQ  programs  that  can  reuse  intermediate  results  across  jobs  » Does  not  provide  caching  in  memory,  explicit  support  over  which  data  is  cached,  or  control  over  partitioning  

SMR  (functional  Scala  API  for  Hadoop)