CS 744: NEXUS

CS 744: NEXUS

Shivaram VenkataramanFall 2021

Hi!

ADMINISTRIVIA

Course Project Proposals- Due Oct 25!- See Piazza for template

Midterm details- Oct 28th: Includes papers from Datacenter as a Computer to Nexus- Open book, open notes- Held in class time 9.30-10.45am Central Time

→ Monday

" " " "

MACHINE LEARNING: INFERENCEPyforchDDP ↳

Pipedream

EXAMPLE APPLICATION

Video analysis service- Thousands of streams, thousands of tenants- Each stream is processed by a DNN-based “query”- Latency SLOs (10s to 100s of ms)

-

f) trafficcameras

↳ stream data to thecloud

-

.

→ each tenantis

runniyadiffmoddq.org?mjitdoaObit detectormodel

find bounding boxes

⑦ and classify"

car"

Iiodyxahe YA

GOAL: HIGH GPU Utilization

Placement

Batching

Scheduler to

determine placementand batching ¥effahive when you

use anaccelerator like

↳ Overhead b- moving a modeltaqp#apu

↳ Decide which model G) goto

which GPU

larger batch size

↳ computationally efficient↳ wait until you

can

to frames per see

if13 → fixed overhead

form this batch☐÷? ✗ → per example

cost in this batch |

SCHEDULING BATCHED EXECUTIONY ¥

i'

e

,

r

I

7s☐ -l

A & C cannot

consider A , B being allocatedhe adooted

A for 75 ms,B for 50ms

,

A for 75ms . - -

latency (A) + Latency (B)will both

meet Sos / becausethat will

lead to sooviolation

BATCH-AWARE SCHEDULING

Inputs: Request rate, SLO for each model, Profiles at batch sizeApproach: Allocate “full” GPUs based on load. Handle residuals

Greedy Approximation

+putneed

tosatisfy \ every request latency

-

T Esto / if not drop / µ latency atvarious batch

T sizes

-

If reg .

rate = 1100 reqls .

1- GPU at bs : 16,

ix. ray ,,⇒ [Y÷) -- 8⇒ looney, →②

bs --6

dnd →

↳ satart by giving

pureeing everyresidual workload its own

" PV

smallerof too

•oopm|↳ mergeworkloads tother ⇒ "P"

→ 1am dutycycles

↳ workloads are"

squishy"

bs¥4

HANDLING COMPLEX QUERIES

Challenge:

How do we set latency SLOs forcomplex queries?

Étude → Simplequery

object →mgkae#detecting it

Carr1 1Additional challengewe don't know what

1-Sho

fraction cells"

car"

and what fraction calls"

face"

SCHEDULING COMPLEX QUERIES

Query Analysis to determine latency SLO splitsInputs: Models with request rate Ri latency SLO L

car

¥¥,

-. -

best batch rice

minimize resourcesused

-

Dynamic Programming"

pipedream

"

ADAPTIVE BATCHING

Clipper: Adapt the batch size based on the oldest request

fixed cost per req .

t

✗ is low ,P is high

then drop rate is high

☐M→ +s =

oldest request

needs to be

processed in 5ms

lower mybatch me in order

µ power hatch sire , ttit is

to meet this SLO"falling behind

" lower

BATCH-AWARE DISPATCH

Early-dropping scheme1. Scans queue using slidingwindow of batch size

2. Stop at the first request with that can execute entire window

if you drop this frame

contents"" frame might have

- similar

Drop requests at the head

of the queuewhich cannot meet Soo

OTHER FEATURES

Prefix Batching

GPU Multiplexing

Overlapping CPU and GPU computation

→ Transfer learning → Manymodels have

same"

prefix"

layersDo this by hashing ft differ in last layer

the models

↳ round robin fashion

→↳ Pre - processing & post - processing

NEXUS ARCHITECTUREthroughput

Adapting" is related

✗pvtpayes# to video

streams

complex queries→

Greedy Ialgorithm

SUMMARY

• ML Inference goals: latency SLO, GPU utilization• Nexus: Handle multiple tenants, multiple DNNs• Schedule using squishy bin packing• Breakdown SLO for complex queries, adaptive batching

p-Pete

Pytorch DistributedDataParallel Training APIOverlap compute, communication

PipeDreamGeneralize parallelism: Pipeline parallel Reduce network, maintain consistency

RayReinforcement learning applicationsActors and tasks, Local and global scheduler

PolluxScheduler ML training jobs in a clusterCo-adaptive scheduling to set batch size, LR

NexusSystem for ML InferenceMeet latency SLOs while ensuring high utilization

-

,scheduling

DISCUSSIONhttps://forms.gle/XQ4CfNzTTFsSrVv7A

Consider a scenario where you have a model that takes variable amount of time depending on the input. For example if a frame contains 100 cars it takes 250ms to process but if the frame has 1 car then it finishes in 10ms. What could be one shortcoming in using Nexus to schedule this model? SLO --250ms

when data switches from 1 car to 250 Cars-

"

Profile"

→-only saw batches with few cars

estimate batch sizebatted on

this

you might miss SLO if using large batch

→ can Nexus be more

"

dynamic"

2

→ If wetake a

" worst"

case view →under utilization

0 0

scaleswell

÷Iwd Elastiih0 0

or

II. aged•:*-0

Next class: SQL

Documents

CS 744: NEXUS