31
Data-aware scheduling Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017

meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Data-aware schedulingSpark on Kubernetes with HDFS

Johannes M. Scheuermann

Karlsruhe, 18.10.2017

Page 2: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Johannes M. ScheuermannIT Engineering & Operations @ inovex

〉 Software-Defined Datacenters

〉 Infrastructure as Code

〉 Cloud technologies

〉 High Availability & Scalability

〉 Want‘s an IBM Z-Frame

〉 @johscheuer

2

Page 3: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Data-aware scheduling

Page 4: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• Why data-aware scheduling

• Data-ware for non Big-Data application

• Data-ware scheduler

• Big Data on Kubernetes• Spark on Kubernetes

• HDFS on Kubernetes

Agenda

Page 5: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• I’m not a scheduling expert

• Concept is/was a PoC

• Share learnings/ideas

• Get feedback from the community

Spoiler(s) J

Page 6: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Data-locality

Page 7: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Why data-locality?

Page 8: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Data-aware scheduling for non Big-Data

• Databases

• (large) image processing

• Video encoding

• (Web)-Cache

Page 9: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• Distributed (parallel) POSIX file system• Any workload with high performance (incl. throughput,

databases, small files)

• Can be deployed in containers, on kubelet hosts.• Linearly scalable performance.

• Fully fault-tolerant, split-brain safe

Quobyte – What is Quobyte

Page 10: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Quobyte - Architecture

Page 11: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• Metadata servers make placement decisions against

policies• on file level

• tiering, isolation, …

• keep stripes of files on disks of same machine => enable local read

• allow preferring writes to local storage servers => enable local

write

• Locality information can be retrieved per file• that’s where the scheduler hooks in

Quobyte - Placement

Page 12: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Running multiple schedulers

Page 13: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• Specify wanted Data

• Lookup Data Placement

• Remapping if Storage runs in Containers

• Schedule Pod

Scheduling data-aware (file-based)

Page 14: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Scheduler Architecture (4000ft)

Page 15: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Scheduler Architecture (1000ft)

Page 16: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Scheduler Architecture (containerized)

Page 17: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Benchmarks

Page 18: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

(Spark) Big-Data on Kubernetes

Page 19: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• https://github.com/apache-spark-on-k8s/spark

• Not the “faked” Spark on Kubernetes

• Still in development

• Still not in the official Apache Spark project

• Current: v2.2.0-0.4.0 • Alpha/Beta ?!

Spark on Kubernetes

Page 20: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Spark on Kubernetes

Spark Core

MesosYARNStandaloneKubernetes

StreamingMLlibSparkSQLGraphX

Page 21: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• Integrates with Kubernetes• RBAC

• Resource Quotas

• Audit logging

• Etc.

• Only cluster-mode

Spark on Kubernetes

Page 22: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Spark on Kubernetes (cluster-mode)

Page 23: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Spark + HDFS on Kubernetes

Driver 1

Driver 2

Executor 1

Executor 2.1

HDFS NN

HDFS DN 1 HDFS DN 2

PodNetwork

Host Network

Executor 2.2

Page 24: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Demo

Page 25: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• Rack-locality

• Node preferences

• Priority-based scheduling (K8s 1.8 alpha)

• NameNode HA

• Kerberos support

Missing pieces

Page 26: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Conclusions

Page 27: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• Good starting point

• Good integration

• Still some points open

• Work for better integration (more general)

• Play with it!

Conclusions

Page 28: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

28

We are hiring!

www.inovexperts.com

Page 29: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Q&A

Page 30: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

• https://issues.apache.org/jira/browse/SPARK-18278

• https://www.youtube.com/watch?v=0xRHONrWwvU&

feature=youtu.be

• https://www.youtube.com/watch?v=DxCDxi08HWo&f

eature=youtu.be

Further reading

Page 31: meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Johannes M. Scheuermanninovex GmbH

[email protected]

CC BY-NC-ND inovex.de +JohannesScheuermann

github.com/johscheuer

@johscheuer youtube.com/inovexGmbH