24
1 © Cloudera, Inc. All rights reserved. Deploying Spark Streaming with Ka@a Gotchas and Performance Analysis Nishkam Ravi Cloudera

DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Deploying  Spark  Streaming  with  Ka@a    Gotchas  and  Performance  Analysis      

Nishkam  Ravi      Cloudera  

Page 2: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

2  ©  Cloudera,  Inc.  All  rights  reserved.  

UniHng  Spark  and  Hadoop    The  One  PlaLorm  IniHaHve  

Management  Leverage  Hadoop-­‐naHve  resource  management.  

Security  Full  support  for  Hadoop  security  

and  beyond.  

Scale  Enable  10k-­‐node  clusters.  

Streaming  Support  for  80%  of  common  stream  

processing  workloads.  

Page 3: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

3  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  Customer  Use  Cases  Core  Spark   Spark  Streaming  

•  PorLolio  Risk  Analysis  •  ETL  Pipeline  Speed-­‐Up  •  20+  years  of  stock  data  Financial  

Services  

Health  

•  IdenHfy  disease-­‐causing  genes  in  the  full  human  genome  

•  Calculate  Jaccard  scores  on  health  care  data  sets  

ERP  

•  OpHcal  Character  RecogniHon  and  Bill  ClassificaHon  

•  Trend  analysis    •  Document  classificaHon  (LDA)  •  Fraud  analyHcs  Data  

Services  

1010  

•  Online  Fraud  DetecHon  Financial  Services  

Health  

•  Incident  PredicHon  for  Sepsis  

Retail  

•  Online  RecommendaHon  Systems  •  Real-­‐Time  Inventory  Management  

Ad  Tech  

•  Real-­‐Time  Ad  Performance  Analysis  

Page 4: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

4  ©  Cloudera,  Inc.  All  rights  reserved.  

•  In-­‐memory  cluster  compute  system  •  Exports  intuiHve  one-­‐liners  for  data  processing  •  map,  reduceByKey,  filter,  disHnct,  union,  sortBy..  

• General  task  graphs  •  TransformaHons  pipelined  where  possible  

• Great  performance  •  Orders  of  magnitude  faster  than  MapReduce  

•  Fault  tolerant  • Unified  mulH-­‐purpose  stack    •  Procedural,  ML,  Graph,  SQL,  Streaming,  etc  

Driver  

Worker  

Worker  

Worker  

Tasks  

Tasks  

Tasks  

Executors  

Executors  

Executors  

Page 5: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Streaming  

• Extension  of  Spark  Core  • Real-­‐Hme  data  processing  • Use  cases  •  Sensor  data  aggregaHon  •  Filtered  storage  •  Tweet  analysis  • Web  stats  •  Fraud  detecHon  …    

       

 `  

Volume  Variety  Velocity  

50B  connected  devices  by  2020  

Page 6: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

6  ©  Cloudera,  Inc.  All  rights  reserved.  

val    dataRDD  =  sparkContext.textFile("hdfs://...")  dataRDD.flatMap(line  =>  line.split("  ")).map(word  =>  (word,  1)).reduceByKey(_  +  _).collect()  

val    dataRDD  =  [email protected](..)  dataRDD.flatMap(line  =>  line.split("  ")).map(word  =>  (word,  1)).reduceByKey(_  +  _).collect()  

val    dataRDD  =  TwilerUHls.createStream(..)  dataRDD.flatMap(line  =>  line.split("  ")).map(word  =>  (word,  1)).reduceByKey(_  +  _).collect()  

val    dataRDD  =  streamingContext.actorStream(..)  dataRDD.flatMap(line  =>  line.split("  ")).map(word  =>  (word,  1)).reduceByKey(_  +  _).collect()  

Spark  Streaming  Examples  

Page 7: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Streaming  AbstracHons  •  Dstreams  and  Receivers  • Micro-­‐batches  of  input  data  stream  

 • Micro-­‐batch                RDD    

 

 • Numerous  transformaHons  supported  • map,  flatMap,  filter,  join,  reduceByKey..  •  updateStateByKey,  reduceByKeyAndWindow  

 

Page 8: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Windowed  OperaHons  and  Reliability  

 

 

•  CheckpoinHng    •  Restore  state  aoer  driver  failure  •  streamingContext.checkpoint(<hdfs>)  

• Write  Ahead  Logs  •  Recover  data  blocks  aoer  failure  •  spark.streaming.receiver.writeAheadLog      true  

Page 9: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Ka@a  

• Publish-­‐subscribe  system  

Producers  

Ka@a  brokers  

.  .  

Page 10: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Ka@a  Topics  and  ParHHons  

ka@a-­‐topics.sh  –zookeeper  <ip:port>  -­‐-­‐create  –topic  <name>  -­‐-­‐parHHons  <num>  -­‐-­‐replicaHon-­‐factor  <k>  

Page 11: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Streaming  +  Ka@a  

Producers  

Ka@a  brokers  

Page 12: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Streaming  +  Ka@a  

     //  Create  streaming  context  with  batchSize          val  sparkConf  =  new  SparkConf().setAppName("Ka@aWordCount")          val  ssc  =    new  StreamingContext(sparkConf,  DuraHon(batchSize))          //  Set  up  Receivers  and  DStreams  with  numReceivers  =  numParHHons              val  messages:  Array[ReceiverInputDStream[(String,  String)]]  =  new  Array[ReceiverInputDStream[(String,  String)]](numParHHons)          for  (i  <-­‐  0  unHl  numParHHons){              messages(i)  =  [email protected](ssc,  zookeepers,  group,  topicMap,  StorageLevel.MEMORY_ONLY)          }          val  lines  =  ssc.union(messages)                    //  Process  data        lines.flatMap(_.split("\\s+")).map(x  =>  (x,  1)).reduceByKey(_  +  _).print()  

•  Receiver-­‐based  •  Each  Dstream  is  associated  with  a  receiver  •  MulHple  receivers  can  receive  data  simultaneously  •  Receivers  are  scheduled  in  a  round  robin  fashion  (starHng  Spark  1.5)    

Page 13: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Streaming  +  Ka@a    •  Receiver-­‐less  (Direct  API)  

•  Introduced  in  Spark  1.3  •  No  receivers  are  launched  •  Data  is  received  directly  from  Ka@a  using  low-­‐level  API  

•  Pros  •  Simplified  parallelism  and  code  

•  No  need  for  mulHple  DStreams  •  One-­‐to-­‐one  mapping    

•  Num  RDD  parHHons  =  Num  Ka@a  parHHons  •  Does  not  require  WAL  

•  Leverages  Ka@a  for  data  recovery  •  EffecHvely  Exactly-­‐once  read  semanHcs    

•  Cons  •  Zookeeper  not  updated  •  Re-­‐parHHoning  for  sufficient  parallelism  

Page 14: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Streaming  +  Ka@a  (Direct)  

       

     //  Create  context  with  batchSize          val  sparkConf  =  new  SparkConf().setAppName("DirectKa@aWordCountNew")          val  ssc  =    new  StreamingContext(sparkConf,  DuraHon(batchSize))              //  Set  up  direct  ka@a  stream          val  lines  =  [email protected][String,  String,  StringDecoder,  StringDecoder](ssc,  ka@aParams,  topicSet)                    //  Process  data          lines.reparHHon(numWorkers  *  16).flatMap(_.split("\\s+")).map(x  =>  (x,  1)).reduceByKey(_  +  _).print()  

Page 15: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Performance  Analysis  •  Requests  from  customers    •  To  benchmark  clusters  •  Understand  hardware  requirements  •  What  to  expect  in  terms  of  performance  •  Configure/tune  streaming  jobs  

 •  Goals  •  Measure  throughput  of  Spark  Streaming  +  Ka@a  •  Measure  latency  for  small  batch  sizes  •  Determine  if  sub-­‐second  latencies  are  achievable  •  Explore  configuraHon  space  •  Compare  different  APIs  •  IdenHfy  (and  fix)  issues  

•  Hardware    •  Intel  Xeon  L5630  2.13GHz  •  16  vcores,  48GB  per  node    

 •  Setup    •  Ka@a  brokers  on  1-­‐2  nodes  •  Ka@a  producers  on  1-­‐2  nodes  •  Both  standalone  and  YARN  •  Different  configuraHons  opHons  •  Receiver-­‐based  API  •  Receiver-­‐less  (direct)  API  •  Complex  event  processing  

•  WordCount  (CPU-­‐bound)  

 

Page 16: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Issues/observaHons  

•  Poor  receiver  scheduling  •  Fixed  to  round  robin  •  Improved  throughput  

•  Ka@a  parHHons  •  Non-­‐uniform  distribuHon  •  Use  one  producer  per  parHHon  

•  Direct  API  •  High  iniHal  processing  Hme  •  Stable  throughput  •  ReparHHoning  needed    

•  Receiver-­‐based  API  •  Stable  processing  Hme  •  High  variaHon  in  throughput  

Page 17: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Common  piLalls  

•  Executor  (mis)configuraHon  •  Number,  size  

•  Insufficient  parallelism  •  Receive,  compute  •  Number  of  Ka@a  parHHons  

•  YARN  containers  fail  •  Fetch  failures  •  Timeout  values,  ulimit  

•  Caching  and  serializaHon    

• Use  of  collect()  and  such  • Use  of  groupByKey()  •   reduceBykey()  

• Use  of  rdd.forEach()  •  rdd.forEachParAAon()  

• GC  tuning  •  Use  smaller  executors  •  Higher  Hmeout  values  •  Tungsten  

Page 18: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

18  ©  Cloudera,  Inc.  All  rights  reserved.  

ConfiguraHon  and  tuning  •  Executors  •  Executor  misconfiguraHon  accounts  for            ~20%  of  support  Hckets  •  Not  too  big,  not  too  many  •  2-­‐6  executors  per  node        

•  spark.executor.instances  •  Medium-­‐sized  (8-­‐16GB  per  executor)  

•  spark.executor.memory  •  spark.executor.cores  •  spark.yarn.executor.memoryOverhead  

•  Storage  level  •  MEM_ONLY  or  MEM_ONLY_SER          (for  performance)  

         

 

•  SerializaHon  •  spark.serializer    <KryoSerializer>  •  spark.kryo.classesToRegister    <classes>  

•  Dynamic  allocaHon  •  spark.dynamicAllocaHon.minExecutors      1  •  spark.dynamicAllocaHon.maxExecutors      spark.executor.instances  •  spark.scheduler.maxRegisteredResourcesWaiHngTime  

• Others  •  spark.default.parallelism      >  4  x  num_cores    •  spark.shuffle.consolidateFiles    true  •  spark.driver.maxResultSize    0  •  spark.shuffle.memoryFracHon    0.8  (if  no  RDDs  cached)  •  yarn.nodemanager.resource.memory-­‐mb    <node  memory>  •  yarn.nodemanager.resource.cpu-­‐vcores        <node  num-­‐cores>  

Page 19: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

19  ©  Cloudera,  Inc.  All  rights  reserved.  

ConfiguraHon  and  tuning  • Micro-­‐batch  size  •  ApplicaHon  dependent  •  Large  enough  so  queue  remains  bounded  •  Latency  =  processing  Ame  +  wait_in_queue  +  f(batch  size)  •  Processing  Hme  should  be  <  batch  size  

•  Level  of  parallelism  •  spark.streaming.blockInterval  •  Num  tasks  =  batch  size  /  block  interval  •  Block  interval  should  be  small  enough  to  permit  sufficient  parallelism  •  High  data-­‐receive  parallelism  

Page 20: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

20  ©  Cloudera,  Inc.  All  rights  reserved.  

ConfiguraHon  and  tuning  •  For  receiver-­‐based  API  •  num_receivers  =  K  *  num_nodes    •  Typically  2  <=  K  <=  4  •  K  =  1    to    (cores_per_node/2)      •  num_kaRa_parAAons  =  num_receivers  •  spark.streaming.receiver.maxRate      

•  For  direct  API  •  num_kaRa_parAAons  =  total_cores    to    (total_cores*4)  •  AlternaHvely      num_kaRa_parAAons  =  K  *  num_nodes  

• With  reparHHoning  •  num_RDD_parAAons  =  N  *  total_cores    (1  <=  N  <=  2)  

•  [email protected]    

Page 21: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

21  ©  Cloudera,  Inc.  All  rights  reserved.  

Results  for  Spark  Streaming  +  Ka@a  Direct  API  

 •  Throughput  (MB/sec/node)  •  As  a  funcHon  of  batch  size    

•  Processing  Hme  ~=  batch  size  •  Scales  nicely  •  Higher  throughput  for  1  worker  •  Throughput  falls  significantly        for  smaller  batch  sizes  

Batch  Size  

Throughp

ut    (MB/sec/no

de)  

0  

5  

10  

15  

20  

25  

30  

35  

2   1   0.5   0.4   0.3  

1  worker  node  

3  worker  nodes  

5  workers  nodes  

Page 22: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

22  ©  Cloudera,  Inc.  All  rights  reserved.  

Results  for  Spark  Streaming  +  Ka@a  Receiver-­‐based  API  

 •  Throughput  (MB/sec/node)  •  As  a  funcHon  of  batch  size  •  Without  WAL    

•  Processing  Hme  ~=  batch  size  •  Scales  nicely  •  Higher  throughput  for  1  worker  •  Throughput  falls  significantly        for  smaller  batch  sizes  

Batch  Size  

Throughp

ut    (MB/sec/no

de)  

0  

5  

10  

15  

20  

25  

30  

35  

40  

2   1   0.5   0.4   0.3  

1  worker  node  

3  worker  nodes  

5  workers  nodes  

Page 23: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

23  ©  Cloudera,  Inc.  All  rights  reserved.  

Conclusions    • High  throughput  (25-­‐30  MB/sec/node)  •  For  complex  events  

•  Sub-­‐second  latencies  achievable  •  Smallest  batch  size:  300  msec  •  Similar  performance  for  YARN  and  standalone    • Direct  API  recommended  • No  receivers  needed  •  Easier  to  code  and  reason  about  parallelism  •  Leverages  Ka@a’s  data  recovery  •  Similar  performance  as  receiver-­‐based  API  

•  Despite  reparHHoning  

•  Stable  throughput  

Please  note  that  the  results  would  depend  on  hardware  specs,  soUware  configuraAon  and  nature  of  workload  

Page 24: DeployingSparkStreamingwithKaa › sites › events › files › ...©"Cloudera,"Inc."All"rights"reserved." 2 UniHng"Spark"and"Hadoop" " TheO ne"Plaorm"IniHav e" Management(Leverage"HadoopNnave"

24  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  You  Nishkam  Ravi    [email protected]