meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe,...

Data-aware schedulingSpark on Kubernetes with HDFS

Johannes M. Scheuermann

Karlsruhe, 18.10.2017

Johannes M. ScheuermannIT Engineering & Operations @ inovex

〉 Software-Defined Datacenters

〉 Infrastructure as Code

〉 Cloud technologies

〉 High Availability & Scalability

〉 Want‘s an IBM Z-Frame

〉 @johscheuer

Data-aware scheduling

• Why data-aware scheduling

• Data-ware for non Big-Data application

• Data-ware scheduler

• Big Data on Kubernetes• Spark on Kubernetes

• HDFS on Kubernetes

Agenda

• I’m not a scheduling expert

• Concept is/was a PoC

• Share learnings/ideas

• Get feedback from the community

Spoiler(s) J

Data-locality

Why data-locality?

Data-aware scheduling for non Big-Data

• Databases

• (large) image processing

• Video encoding

• (Web)-Cache

• Distributed (parallel) POSIX file system• Any workload with high performance (incl. throughput,

databases, small files)

• Can be deployed in containers, on kubelet hosts.• Linearly scalable performance.

• Fully fault-tolerant, split-brain safe

Quobyte – What is Quobyte

Quobyte - Architecture

• Metadata servers make placement decisions against

policies• on file level

• tiering, isolation, …

• keep stripes of files on disks of same machine => enable local read

• allow preferring writes to local storage servers => enable local

• Locality information can be retrieved per file• that’s where the scheduler hooks in

Quobyte - Placement

Running multiple schedulers

• Specify wanted Data

• Lookup Data Placement

• Remapping if Storage runs in Containers

• Schedule Pod

Scheduling data-aware (file-based)

Scheduler Architecture (4000ft)

Scheduler Architecture (1000ft)

Scheduler Architecture (containerized)

Benchmarks

(Spark) Big-Data on Kubernetes

• https://github.com/apache-spark-on-k8s/spark

• Not the “faked” Spark on Kubernetes

• Still in development

• Still not in the official Apache Spark project

• Current: v2.2.0-0.4.0 • Alpha/Beta ?!

Spark on Kubernetes

Spark Core

MesosYARNStandaloneKubernetes

StreamingMLlibSparkSQLGraphX

• Integrates with Kubernetes• RBAC

• Resource Quotas

• Audit logging

• Etc.

• Only cluster-mode

Spark on Kubernetes

Spark on Kubernetes (cluster-mode)

Spark + HDFS on Kubernetes

Driver 1

Driver 2

Executor 1

Executor 2.1

HDFS NN

HDFS DN 1 HDFS DN 2

PodNetwork

Host Network

Executor 2.2

• Rack-locality

• Node preferences

• Priority-based scheduling (K8s 1.8 alpha)

• NameNode HA

• Kerberos support

Missing pieces

Conclusions

• Good starting point

• Good integration

• Still some points open

• Work for better integration (more general)

• Play with it!

Conclusions

We are hiring!

www.inovexperts.com

• https://issues.apache.org/jira/browse/SPARK-18278

• https://www.youtube.com/watch?v=0xRHONrWwvU&

feature=youtu.be

• https://www.youtube.com/watch?v=DxCDxi08HWo&f

eature=youtu.be

meetup 18 10 17 - inovex · Spark on Kubernetes with HDFS Johannes M. Scheuermann Karlsruhe,...

Documents

Spark and Spark SQL - Amir H. Payberah · Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71

McDonough Spark Tutorial Spark Summit 2013

Learning spark ch10 - Spark Streaming

S U M M I T - Amazon Web Services... · Task2/Slide1 Task Dispatcher Spark Driver Spark Worker Spark Worker Spark Worker - Spark Driver provisioning - Task parameters - Spark Workers

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Release 0.4.0 William H. Guss, Brandon Houghton

Budapest Spark Meetup - Apache Spark @enbrite.ly

PhalconEye Documentation - Read the Docsmedia.readthedocs.org/pdf/phalconeye/latest/phalconeye.pdf · •WordPress 3.9.1 •Joomla 3.3.0 •PhalconEye 0.4.0 5. PhalconEye Documentation,

Release 0.4.0 ibus-bogo Development Team

Spark Plug Thread Repair Spark Plug Spark Plug Sockets for Ford

Spark Concepts - Spark SQL, Graphx, Streaming

spectral - media.readthedocs.org filespectral , 0.4.0 The spectral-cube package provides an easy way to read, manipulate, analyze, and

Release 0.4.0 ibus-bogo Development Team · ibus-bogo Documentation, Release 0.4.0 ibus-bogo là mt engine x lý gõ ting Vit choIBus, mt phn mm qun lý các b gõ trong GNU/Linux

Spark Infrastructure - Australian Securities Exchange · Spark Infrastructure represents Spark Infrastructure Trust and its consolidated entities. Spark Infrastructure RE Limited

Project EBSF 2 Stuttgart, 18.10.2017 Dr. Mario Bareiß ... 5_EvoBus_public.pdf · Project EBSF_2 Stuttgart, 18.10.2017 Dr. Mario Bareiß, Johannes Klampfl EvoBus GmbH

django-html5-appcache Documentation · 2019. 4. 2. · django-html5-appcache Documentation, Release 0.4.0 This document refers to version 0.4.0 Application to manageHTML5 Appcache

Spark SQL | Apache Spark

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

Industry Agenda Infrastructure and Urban Development ... · 18.10.2017 · Dakhil Community Lead Infrastructure and Urban Development Industries Carla Sequeira Community Specialist