26
Spark on YARN: The Road Ahead Marcelo Vanzin

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Embed Size (px)

Citation preview

Page 1: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Spark on YARN: The Road Ahead Marcelo Vanzin

Page 2: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

What to expect... -  What is YARN? -  Why YARN for Spark? -  Current state and future direction for:

-  Resource Management -  Security

Page 3: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

What is YARN? Resource management and negotiation -  Manages vcores and memory -  Multiple users -  Multiple apps -  Advanced features

Page 4: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Why YARN for Spark? -  Advanced resource management features -  Dynamic allocation for Spark executors -  Security

Page 5: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Resource Management Lots of knobs to turn, lots of trial and error.

yarn.nodemanager.resource.memory-­‐mb  

Executor  Container  

spark.yarn.executor.memoryOverhead  (10%)  

spark.executor.memory  

spark.shuffle.memoryFracBon  (0.4)   spark.storage.memoryFracBon  (0.6)  

Page 6: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

RM (Cont’d) Dynamic allocation

-  No need to specify number of executors! -  Application grows and shrinks based on

outstanding task count -  Still need to specify everything else…

Page 7: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

SparkContext

Job 1

Task Task

Executor 1

Task

Task Executor 2

Task

Dynamic Allocation

Page 8: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Dynamic Allocation (Cont’d) How to make dynamic allocation better?

-  Reduce allocation latency -  Data locality hints -  Handle cached RDDs

Page 9: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Data Locality -  Allocate executors

close to data -  More predictable

performance

-  SPARK-4352

Page 10: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Cached RDDs -  Short term: avoid discarding cached data

-  Keep executors around. -  Long term:

-  Cache rebalance -  Container resizing -  Off-heap caching

Page 11: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Memory Sizing How to make application sizing easier? -  Soft container limits (YARN-3119) -  Overhead memory -  Simplified parameters

Page 12: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Overhead Memory Everything that is not the Java heap. Very opaque, hard to size correctly. -  Better metrics to understand usage -  Better heuristics for automatic sizing

Page 13: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Future: Simplified Allocation Single parameter: “task size”. Spark figures out the rest (vcores, heap, overhead). Container sizes can change to match available resources, data locality, etc.

task-size: s Container request: vcores = x memory = y

Page 14: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

RM Summary

“Dynamic allocation is not for everybody. It can only meet the needs of 90% of

applications.”

Page 15: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Security: Kerberos -  User authentication via Kerberos -  Supported in all services (YARN, HDFS,

HBase, Hive, etc) -  Uses per-service delegation tokens

Page 16: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Security: Delegation Tokens Delegation tokens allow for authentication without running into KDC limitations.

Page 17: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Tokens (Cont’d) Tokens have a TTL. This is a problem for long-lived applications -  Spark Streaming -  Thrift Server Solution: re-generate tokens periodically.

Page 18: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Tokens (Cont’d) Generating tokens requires a kerberos ticket -  Spark driver handles KDC login -  Need access to user credentials (keytab) Secure if coupled with encryption.

Page 19: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Security: Encryption -  HDFS supports wire encryption -  YARN supports wire encryption -  Spark not so much

Page 20: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Encryption (Cont’d) Where does Spark need encryption? -  Control plane -  File distribution -  Block Manager -  User UI / REST API -  Data-at-rest (shuffle files)

Page 21: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Encryption (Cont’d) Goal: easy to configure encryption. -  SSL for HTTP endpoints -  SASL everywhere else In 1.4: SASL available for shuffle data, SSL available for Akka / file distribution.

Page 22: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Spark UI Support SSL for HTTP UI / API -  SPARK-2750 -  Outstanding issues:

-  Certificate distribution (cluster mode) -  Shared certificates

Page 23: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Spark RPC -  Control plane encryption

-  Current: Akka-over-SSL -  Future: RPC-over-SASL -  SPARK-6028, SPARK-6230

-  File distribution -  Replace HTTP channel with RPC

Page 24: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Data-at-Rest Local shuffle files, mostly. -  Now: configure “local dirs” on encrypted

volumes -  Later: Spark itself encrypts shuffle files

(SPARK-5682).

Page 25: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Summary -  YARN provides lots of useful features -  Not always easy to use -  Focus on:

-  Usability -  Security -  Performance

Page 26: Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Questions?