Introduction to Kubeflow - Oliver Wyman · Introduction to Kubeflow aronchick@ Machine Learning is...

Preview:

Citation preview

Introduction to Kubeflowaronchick@

Machine Learning is a way of solving problems without explicitly knowing how to create the solution

Google DC Ops

PUE == Power Usage Effectiveness

PUE == Power Usage Effectiveness

PUE == Power Usage Effectiveness

PUE == Power Usage Effectiveness

But...

Most FolksMagical

AIGoodness

LOTS OFPAIN

Why the Gap?

Composability

Portability

Scalability

Buildinga

Model

Logging

DataIngestion

DataAnalysis

DataTransform

-ation

DataValidation

Data Splitting

Trainer ModelValidation

TrainingAt Scale

Roll-out Serving Monitoring

Composability

Portability

Each ML Stage is an Independent System

System 6System 5

System 4

TrainingAt Scale

System 3System 1

DataIngestion

DataAnalysis

DataTransform

-ation

DataValidation

System 2

Buildinga

Model

ModelValidation

Serving LoggingMonitoringRoll-out

Data Splitting

Trainer

Portability

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

HW

Model

Laptop

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

HW

Model

Laptop

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

HW HW

Model Model

Laptop Training Rig

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

HW HW HW

Model Model Model

Laptop Training Rig Cloud

Scalability● Machine specific HW (GPU)● Limited (or unlimited) compute● Network & storage constraints

○ Rack, Server Locality○ Bandwidth constraints

● Heterogeneous hardware● HW & SW lifecycle management● Scale isn’t JUST about adding new

machines!○ Intern vs Researcher○ Scale to 1000s of experiments

You Know What’s Really Good at Composability,

Portability, and Scalability?

Containers and Kubernetes

Kubernetes

NFSCeph

CassandraMySQL

SparkAirflow

TensorflowCaffe

TF-ServingFlask+Scikit

Operating system (Linux, Windows)

CPU Memory DiskSSD GPU FPGA ASIC NIC

Jupyter

Quota

RBACMonitoring

Logging

GCP AWS Azure On-prem

Namespace

Kubernetes for ML

Kubernetes for ML● Supports accelerators in an extensible manner

○ GPUs already in progress○ Support for FPGAs, high perf NICs under discussion

● Existing Controllers simplify devops challenges○ K8S Jobs for Training○ K8S Deployments for Serving

● Handles 1000s of nodes● Container base images for ML workloads

But Wait, There’s More!● Kubernetes native scaling objects

○ Autoscaling cluster based on workload metrics○ Priority eviction for removal of low priority jobs○ Scaled to large number of pods (experiments)

● Passes through cluster specs for specific needs○ Scheduling jobs where the data needed to run them is○ Node labels for Heterogeneous HW (more in the future)○ Manage SW drivers and HW health via addons

But...

Oh, you want to use ML on K8s?

Before that, can you become an expert in:● Containers● Packaging● Kubernetes service endpoints● Persistent volumes● Scaling● Immutable deployments● GPUs, Drivers & the GPL● Cloud APIs● DevOps● ...

Kubeflow

Make it Easy for Everyone to Learn, Deploy and Manage Portable, Distributed ML

on Kubernetes(Everywhere)

Kubernetes + ML = Kubeflow = Win● Composability

○ Choose from existing popular tools○ Uses ksonnet packaging for easy setup

● Portability○ Build using cloud native, portable Kubernetes APIs○ Let K8s community solve for your deployment

● Scalability○ TF already supports CPU/GPU/distributed○ K8s scales to 5k nodes with same stack

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

HW HW HW

Model Model Model

Laptop Training Rig Cloud

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

HW HW HW

Model Model Model

Laptop Training Rig Cloud

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

HW HW HW

Model Model Model

Laptop Training Rig Cloud

Kubeflow

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

HW HW HW

Model Model Model

Laptop Training Rig Cloud

Kubeflow

Kubeflow

Storage

Framework

Tooling

UX

Model

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Drivers

OS

Accelerator

Runtime

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

HW HW HW

Model Model

Laptop Training Rig Cloud

Kubeflow

Kubeflow Kubeflow

Storage

Framework

Tooling

UX

Model

Storage

Framework

Tooling

UX

Model

Portability

Storage

Drivers

OS

Accelerator

Runtime

Framework

Tooling

UX

Drivers

OS

Accelerator

Runtime

Drivers

OS

Accelerator

Runtime

HW HW HW

Model

Laptop Training Rig Cloud

Kubeflow

Kubeflow Kubeflow Kubeflow

What’s in the Box?● Jupyter Hub - for collaborative & interactive training● A TensorFlow Training Controller● A TensorFlow Serving Deployment● Argo for workflows● SeldonCore for complex inference and non TF models● Reverse Proxy (Ambassador)● Wiring to make it work on any Kubernetes anywhere

LoggingRoll-out Monitoring

DataIngestion

DataAnalysis

DataTransform

-ation

DataValidation

Data Splitting

What’s in the Box?

Buildinga

ModelTrainer Model

ValidationTrainingAt Scale

Serving

Using Kubeflow# Initialize a ksonnet APPAPP_NAME=my-kubeflowks init ${APP_NAME}cd ${APP_NAME}

# Install Kubeflow componentsks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflowks pkg install kubeflow/coreks pkg install kubeflow/tf-servingks pkg install kubeflow/tf-job

# Deploy KubeflowNAMESPACE=kubeflowkubectl create namespace ${NAMESPACE}ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}ks apply default -c kubeflow-core

Don’t Like TensorFlow?# Initialize a ksonnet APPAPP_NAME=my-kubeflowks init ${APP_NAME}cd ${APP_NAME}

# Install Kubeflow componentsks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflowks pkg install kubeflow/coreks pkg install kubeflow/tf-servingks pkg install kubeflow/tf-job

# Deploy KubeflowNAMESPACE=kubeflowkubectl create namespace ${NAMESPACE}ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}ks apply default -c kubeflow-core

ks pkg install kubeflow/sklearn-job # Soon

Don’t Like TF Serving?# Initialize a ksonnet APPAPP_NAME=my-kubeflowks init ${APP_NAME}cd ${APP_NAME}

# Install Kubeflow componentsks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflowks pkg install kubeflow/coreks pkg install kubeflow/tf-servingks pkg install kubeflow/tf-job

# Deploy KubeflowNAMESPACE=kubeflowkubectl create namespace ${NAMESPACE}ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}ks apply default -c kubeflow-core

ks pkg install kubeflow/seldon-core # Soon

That’s It?

Yes…(For Now)

Yes…(For Now)

Yes…(For Now)

We’re Just Getting Started!

● Who’s helping?○ Redhat, Weave, CaiCloud, Canonical, many more

● What’s next...○ Easy to use accelerator integration○ Support for other popular tools like Spark ML, XGBoost, sklearn○ Autoscaled TF Serving○ tf.transform (programmatic data transforms)

● You tell us! (Or better yet, help!)

Kubeflow is Open- open community- open design- open source- open to ideas

https://github.com/kubeflow/kubeflowslack: kubeflow (http://kubeflow.slack.com)

twitter: @kubeflow@aronchick (aronchick@google.com)

@jeremylewi (jlewi@google.com)`

● As a data scientist, you want to use the right HW for the job

● Every variation is an opportunity for pain○ GPUs/FPGAs, ASICs, NICs ○ Kernel drivers, libraries, performance

● Even within an ML frameworks dependencies cause chaos○ Package management○ ML compilation

Container

Kernel

GPU FPGA Infiniband

Drivers

LibraryApp

App

Portability

Recommended