GPU cloud with Job scheduler and Container

Serverless GPU Cloudwith Job scheduler and

ContainerAndrew yongjoon kong

CloudComputingCell, kakao

andrew.kong@kakaocorp.com

• Cloud Technical Advisory for Government Broad Cast Agency• Adjunct Prof. Ajou Univ• Korea Data Base Agency Acting Professor for Bigdata• Member of National Information Agency Bigdata Advisory committee • Kakao à Daum Kakao à Kakaocorp, Cloud Computing Cell lead• Talks

• Embrace clouds (2017, openstack days, korea) • Full route based network with linux (2016, netdev, Tokyo)• SDN without SDN (2015, openstack, Vancouber)

Who am I

Andrew. Yongjoon kong

Supervised,Korean edition

Korean Edition.

2nd Editions are coming…

Serverless computing is rising

Serverless computing , GPU

Serverless computing , GPU, Docker

Serverless framework

lots of serverless framework:• Apache OpenWhisk• Iron.io• Openstack’s Picasso• Gestalt ( based on DC/OS)• Fission ( based on kubernetes)• Runway ( kakao’s private FaaS)What these framework’s purpose?• connecting, mostly• flow and automation

Serverless framework

Connection is very good virtue in public cloud• there’s no resource depletion in public cloud.

connection/automation is directly related with cost savings

• in private cloud, there’s screams for the resources (especially GPU) from the engineers.

• The thing is that “Winner takes it all” • à care for scheduling

Job scheduler

Scheduling User’s Job based on Algorithm• FIFO• Fair Share• BackFill• Preemption

Job comprises two parts• The resources

• CPU, Compute Nodes, Memory, Disk and Even Walltime

• Job scheduling system manage the quota per queue, user, user group

• The runnable execution • Traditionally, The executable command• e.g. saved_model_cli run --dir /tmp/saved_model_dir --tag_set serve

Job sample

Sample Job script

The traditional issue is how we distribute the commandand the data (you can’t specify node in batch system)

#!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 cd /home/rcf-proj3/pv/test/ sourcemkdir /test/test/dir/usr/usc/sas/default/setup.sh sas my.sas

execution

resource

Job scheduler system layout

SharedFile system can handle the file locating issue. à Shared Filesystem is too expensive. àModern

environment it is much easier with the container,

http://beagle.ci.uchicago.edu/using-beagle/

This could be changed bycontainer and registry

Job scheduler system, GPU and Container

add GPU resource to Job Script.use NVIDIA Docker for the command…then scheduler will do the job #!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 #PBS -l gpus=8NV_GPU=$NV_GPU nvidia-docker run --net host -e PASSWORD=root -e USERNAME=root -e PORT=$PORT idock.daumkakao.io/dkos/nvidia-cuda-sshd:dev

dockerregistry

computenode

master ( scheduler )

AI Development Cycle over compute resource

computenode

Training Model on

large scale with

Massive data

Inference thru. the

Develop model on

personal env.

Abstract these to “job(resource, executor)” The output is abstract to container

dockerregistry master ( scheduler )

abstraction

AI Service

The output is abstract to container

BTW, you need GPU and Other IT resource to show your effort to the public, as well

And, what about the monitoring & alert ?

The good thing is that if you make your effort with container à kakao cloud can help you

kakao cloud

Service Repo.

Service catalog

notification

scheduling

IaaS:KRANE

CentralizedMeasuringSystem:KEMI

CentralizedDeployingSystem:DKOS

Management Plane

DataCenter Contol/Dataplane

Event / Alert

Initial Setup

Change

IT operations.IT Services.

Some Numbers about kakao cloud

1563 projects

632 pull request since 2014.9

88aboutVMs are created/deleted per day

8703 vms

2,xxxprojects

913pull request since 2014.9

100aboutVMs are created/deleted per day

17,xxxvms

2016.8 2017.9

9x,xxx active cores

KakaocorpSomeinformationaboutkakao cloud

from grizzly to Kilo5 times upgraded

total 4Regionadditional service Heat/Trove/Sahara

from grizzly to Mitaka7 times upgraded

total 4RegionHeat/Trove/Octavia/barbican 2016.82017.10

event monitoring/alert platform kakao, KEMI

PhysicalServers

VirtualInstances Containers

Others(switches,

monitoring

KEMIIMS

(kakao CMDBAPI)

RuleEngine

Notification ETL

Data Center Information abstraction layer

predicting

scheduling

OpenstackHeat

OtherServiceAPI

Data Center (or Service ) Management Activity

control

KEMI stats KEMI log

Deployment abstraction in Kakao, DKOS

Data Center

User:Definesresource

PMcontainer

ServiceCatalogue

CentralizedDeployingSystem(DKOS)

Resource Pool Queuescheduler

manager

DKOS Archtecture

Services over DKOS

DKOS Situation

• Active cluster : 3 digit

• Total compute node : 4digit (vm+pm)

• Container counts : 5 digit

• Managed by?

DKOS Situation

• Why use DKOS(container)?• Container is easy• Container is cool• dc/os is great

• Nop!• Very summit point of integrated/automated infra service api

• authentication, authorization, compute resources, network, volumes

• Metering, logging • Monitoring, Notifications

kakao cloud now support GPU as well

Service Repo.

Service catalog

notification

scheduling

IaaS:KRANE

CentralizedMeasuringSystem:KEMI

CentralizedDeployingSystem:DKOS

Management Plane

DataCenter Contol/Dataplane

Event / Alert

Initial Setup

Change

IT operations.IT Services.

Thanks

Where are you from CMMI-Cloud perspective?

For CMM4, Time to embrace Clouds, not a Cloud

legacy

output:cloudTF

selfserviceDev

resource

output:krane

(openstackcloud)

limitedProd

resources

output:kemi

(MaaS)

AutomatedCloudUsage

output:DKOS(CaaS)

ManualCloudUsage

FederatedCloudusage

GPU cloud with Job scheduler and Container

Engineering

DU-08812-001 v03 | October 2019 NGC CONTAINER User Guide · GPU-accelerated applications. It includes GPU-accelerated CUDA libraries which enable drop-in acceleration across multiple

Meeting Scheduler

Cranberry Scheduler

genio scheduler

TYPO3 Scheduler

DEVELOPING SCHEDULER TEST CASES TO VERIFY SCHEDULER ... › ijesa › V6N1 › 6116ijesa01.pdf · Scheduler algorithm, scheduler implementation, cyclic executive, time-triggered co-operative

Summit Scheduler and Job Launch Introduction · 1 GPU jsrun –n 12 –a 1 –c 4 –g1 –bpacked:4 ./a.out 2 nodes, 3 tasks per socket 24 MPI tasks two tasks per GPU jsrun –n

Proactive Process-Level Live Migration in HPC Environmentsengelman/publications/... · zJob exec. resume nodes n0 n1 n2 n3 lamd scheduler lamd scheduler lamd scheduler lamd scheduler

Scheduler Activation

developing scheduler test cases to verify scheduler implementations

Diego container scheduler

Hadoop scheduler

Kube-Knots: Resource Harvesting through Dynamic Container … · 2019-09-09 · Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters Prashanth

Maximo Scheduler & Scheduler Plus

GPU Acceleration for Container on Intel Processor …schd.ws/hosted_files/lc3china2017/6e/Zhenyu Wang_LinuxCon...GPU Acceleration for Container on Intel Processor Graphics Zhenyu Wang

Scheduler Tutorial

Scheduler Installation

USING CONTAINERS FOR GPU APPLICATIONSon-demand.gputechconf.com/.../s7177-abecassis-calmels-containers... · 2 OUR TEAM Enable GPUs in the container ecosystem: • Monitoring • Orchestration

A GRID META-SCHEDULER USING COMMUNITY SCHEDULER …eprints.usm.my/31401/1/TAN_ZHEN_LING.pdf · A GRID META-SCHEDULER USING COMMUNITY SCHEDULER FRAMEWORK (CSF) WITH DIFFERENT PLUG-IN

Maximo scheduler