Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Data-aware schedulingSpark on Kubernetes with HDFS
Johannes M. Scheuermann
Karlsruhe, 18.10.2017
Johannes M. ScheuermannIT Engineering & Operations @ inovex
〉 Software-Defined Datacenters
〉 Infrastructure as Code
〉 Cloud technologies
〉 High Availability & Scalability
〉 Want‘s an IBM Z-Frame
〉 @johscheuer
2
Data-aware scheduling
• Why data-aware scheduling
• Data-ware for non Big-Data application
• Data-ware scheduler
• Big Data on Kubernetes• Spark on Kubernetes
• HDFS on Kubernetes
Agenda
• I’m not a scheduling expert
• Concept is/was a PoC
• Share learnings/ideas
• Get feedback from the community
Spoiler(s) J
Data-locality
Why data-locality?
Data-aware scheduling for non Big-Data
• Databases
• (large) image processing
• Video encoding
• (Web)-Cache
• Distributed (parallel) POSIX file system• Any workload with high performance (incl. throughput,
databases, small files)
• Can be deployed in containers, on kubelet hosts.• Linearly scalable performance.
• Fully fault-tolerant, split-brain safe
Quobyte – What is Quobyte
Quobyte - Architecture
• Metadata servers make placement decisions against
policies• on file level
• tiering, isolation, …
• keep stripes of files on disks of same machine => enable local read
• allow preferring writes to local storage servers => enable local
write
• Locality information can be retrieved per file• that’s where the scheduler hooks in
Quobyte - Placement
Running multiple schedulers
• Specify wanted Data
• Lookup Data Placement
• Remapping if Storage runs in Containers
• Schedule Pod
Scheduling data-aware (file-based)
Scheduler Architecture (4000ft)
Scheduler Architecture (1000ft)
Scheduler Architecture (containerized)
Benchmarks
(Spark) Big-Data on Kubernetes
• https://github.com/apache-spark-on-k8s/spark
• Not the “faked” Spark on Kubernetes
• Still in development
• Still not in the official Apache Spark project
• Current: v2.2.0-0.4.0 • Alpha/Beta ?!
Spark on Kubernetes
Spark on Kubernetes
Spark Core
MesosYARNStandaloneKubernetes
StreamingMLlibSparkSQLGraphX
• Integrates with Kubernetes• RBAC
• Resource Quotas
• Audit logging
• Etc.
• Only cluster-mode
Spark on Kubernetes
Spark on Kubernetes (cluster-mode)
Spark + HDFS on Kubernetes
Driver 1
Driver 2
Executor 1
Executor 2.1
HDFS NN
HDFS DN 1 HDFS DN 2
PodNetwork
Host Network
Executor 2.2
Demo
• Rack-locality
• Node preferences
• Priority-based scheduling (K8s 1.8 alpha)
• NameNode HA
• Kerberos support
Missing pieces
Conclusions
• Good starting point
• Good integration
• Still some points open
• Work for better integration (more general)
• Play with it!
Conclusions
28
We are hiring!
www.inovexperts.com
Q&A
• https://issues.apache.org/jira/browse/SPARK-18278
• https://www.youtube.com/watch?v=0xRHONrWwvU&
feature=youtu.be
• https://www.youtube.com/watch?v=DxCDxi08HWo&f
eature=youtu.be
Further reading
Johannes M. Scheuermanninovex GmbH
CC BY-NC-ND inovex.de +JohannesScheuermann
github.com/johscheuer
@johscheuer youtube.com/inovexGmbH