meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0

Data-aware schedulingSpark on Kubernetes with HDFS

Johannes M. Scheuermann

Karlsruhe, 18.10.2017

Johannes M. ScheuermannIT Engineering & Operations @ inovex

〉 Software-Defined Datacenters

〉 Infrastructure as Code

〉 Cloud technologies

〉 High Availability & Scalability

〉 Want‘s an IBM Z-Frame

〉 @johscheuer

2

Data-aware scheduling

• Why data-aware scheduling

• Data-ware for non Big-Data application

• Data-ware scheduler

• Big Data on Kubernetes• Spark on Kubernetes

• HDFS on Kubernetes

Agenda

• I’m not a scheduling expert

• Concept is/was a PoC

• Share learnings/ideas

• Get feedback from the community

Spoiler(s) J

Data-locality

Why data-locality?

Data-aware scheduling for non Big-Data

• Databases

• (large) image processing

• Video encoding

• (Web)-Cache

• Distributed (parallel) POSIX file system• Any workload with high performance (incl. throughput,

databases, small files)

• Can be deployed in containers, on kubelet hosts.• Linearly scalable performance.

• Fully fault-tolerant, split-brain safe

Quobyte – What is Quobyte

Quobyte - Architecture

• Metadata servers make placement decisions against

policies• on file level

• tiering, isolation, …

• keep stripes of files on disks of same machine => enable local read

• allow preferring writes to local storage servers => enable local

write

• Locality information can be retrieved per file• that’s where the scheduler hooks in

Quobyte - Placement

Running multiple schedulers

• Specify wanted Data

• Lookup Data Placement

• Remapping if Storage runs in Containers

• Schedule Pod

Scheduling data-aware (file-based)

Scheduler Architecture (4000ft)

Scheduler Architecture (1000ft)

Scheduler Architecture (containerized)

Benchmarks

(Spark) Big-Data on Kubernetes

• https://github.com/apache-spark-on-k8s/spark

• Not the “faked” Spark on Kubernetes

• Still in development

• Still not in the official Apache Spark project

• Current: v2.2.0-0.4.0 • Alpha/Beta ?!

Spark on Kubernetes

Spark on Kubernetes

Spark Core

MesosYARNStandaloneKubernetes

StreamingMLlibSparkSQLGraphX

• Integrates with Kubernetes• RBAC

• Resource Quotas

• Audit logging

• Etc.

• Only cluster-mode

Spark on Kubernetes

Spark on Kubernetes (cluster-mode)

Spark + HDFS on Kubernetes

Driver 1

Driver 2

Executor 1

Executor 2.1

HDFS NN

HDFS DN 1 HDFS DN 2

PodNetwork

Host Network

Executor 2.2

Demo

• Rack-locality

• Node preferences

• Priority-based scheduling (K8s 1.8 alpha)

• NameNode HA

• Kerberos support

Missing pieces

Conclusions

• Good starting point

• Good integration

• Still some points open

• Work for better integration (more general)

• Play with it!

Conclusions

28

We are hiring!

www.inovexperts.com

Q&A

• https://issues.apache.org/jira/browse/SPARK-18278

• https://www.youtube.com/watch?v=0xRHONrWwvU&

feature=youtu.be

• https://www.youtube.com/watch?v=DxCDxi08HWo&f

eature=youtu.be

Further reading

Johannes M. Scheuermanninovex GmbH

[email protected]

CC BY-NC-ND inovex.de +JohannesScheuermann

github.com/johscheuer

@johscheuer youtube.com/inovexGmbH

Documents

meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe, 18.10.2017. ... • Still not in the official Apache Spark project • Current: v2.2.0-0.4.0