Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
CS 744: NEXUS
Shivaram VenkataramanFall 2021
Hi!
ADMINISTRIVIA
Course Project Proposals- Due Oct 25!- See Piazza for template
Midterm details- Oct 28th: Includes papers from Datacenter as a Computer to Nexus- Open book, open notes- Held in class time 9.30-10.45am Central Time
→ Monday
" " " "
MACHINE LEARNING: INFERENCEPyforchDDP ↳
Pipedream
EXAMPLE APPLICATION
Video analysis service- Thousands of streams, thousands of tenants- Each stream is processed by a DNN-based “query”- Latency SLOs (10s to 100s of ms)
-
f) trafficcameras
↳ stream data to thecloud
-
.
→ each tenantis
runniyadiffmoddq.org?mjitdoaObit detectormodel
find bounding boxes
⑦ and classify"
car"
Iiodyxahe YA
GOAL: HIGH GPU Utilization
Placement
Batching
Scheduler to
determine placementand batching ¥effahive when you
use anaccelerator like
↳ Overhead b- moving a modeltaqp#apu
↳ Decide which model G) goto
which GPU
larger batch size
↳ computationally efficient↳ wait until you
can
to frames per see
if13 → fixed overhead
form this batch☐÷? ✗ → per example
cost in this batch |
SCHEDULING BATCHED EXECUTIONY ¥
i'
e
,
r
I
7s☐ -l
A & C cannot
consider A , B being allocatedhe adooted
A for 75 ms,B for 50ms
,
A for 75ms . - -
latency (A) + Latency (B)will both
meet Sos / becausethat will
lead to sooviolation
BATCH-AWARE SCHEDULING
Inputs: Request rate, SLO for each model, Profiles at batch sizeApproach: Allocate “full” GPUs based on load. Handle residuals
Greedy Approximation
+putneed
tosatisfy \ every request latency
-
T Esto / if not drop / µ latency atvarious batch
T sizes
-
If reg .
rate = 1100 reqls .
1- GPU at bs : 16,
ix. ray ,,⇒ [Y÷) -- 8⇒ looney, →②
bs --6
dnd →
↳ satart by giving
pureeing everyresidual workload its own
" PV
smallerof too
•oopm|↳ mergeworkloads tother ⇒ "P"
→ 1am dutycycles
↳ workloads are"
squishy"
bs¥4
HANDLING COMPLEX QUERIES
Challenge:
How do we set latency SLOs forcomplex queries?
Étude → Simplequery
object →mgkae#detecting it
Carr1 1Additional challengewe don't know what
1-Sho
fraction cells"
car"
and what fraction calls"
face"
SCHEDULING COMPLEX QUERIES
Query Analysis to determine latency SLO splitsInputs: Models with request rate Ri latency SLO L
car
¥¥,
-. -
best batch rice
minimize resourcesused
-
Dynamic Programming"
pipedream
"
ADAPTIVE BATCHING
Clipper: Adapt the batch size based on the oldest request
fixed cost per req .
t
✗ is low ,P is high
then drop rate is high
☐M→ +s =
oldest request
needs to be
processed in 5ms
lower mybatch me in order
µ power hatch sire , ttit is
to meet this SLO"falling behind
" lower
BATCH-AWARE DISPATCH
Early-dropping scheme1. Scans queue using slidingwindow of batch size
2. Stop at the first request with that can execute entire window
if you drop this frame
contents"" frame might have
- similar
Drop requests at the head
of the queuewhich cannot meet Soo
OTHER FEATURES
Prefix Batching
GPU Multiplexing
Overlapping CPU and GPU computation
→ Transfer learning → Manymodels have
same"
prefix"
layersDo this by hashing ft differ in last layer
the models
↳ round robin fashion
→↳ Pre - processing & post - processing
NEXUS ARCHITECTUREthroughput
Adapting" is related
✗pvtpayes# to video
streams
complex queries→
Greedy Ialgorithm
SUMMARY
• ML Inference goals: latency SLO, GPU utilization• Nexus: Handle multiple tenants, multiple DNNs• Schedule using squishy bin packing• Breakdown SLO for complex queries, adaptive batching
p-Pete
Pytorch DistributedDataParallel Training APIOverlap compute, communication
PipeDreamGeneralize parallelism: Pipeline parallel Reduce network, maintain consistency
RayReinforcement learning applicationsActors and tasks, Local and global scheduler
PolluxScheduler ML training jobs in a clusterCo-adaptive scheduling to set batch size, LR
NexusSystem for ML InferenceMeet latency SLOs while ensuring high utilization
-
,scheduling
DISCUSSIONhttps://forms.gle/XQ4CfNzTTFsSrVv7A
Consider a scenario where you have a model that takes variable amount of time depending on the input. For example if a frame contains 100 cars it takes 250ms to process but if the frame has 1 car then it finishes in 10ms. What could be one shortcoming in using Nexus to schedule this model? SLO --250ms
when data switches from 1 car to 250 Cars-
"
Profile"
→-only saw batches with few cars
estimate batch sizebatted on
this
you might miss SLO if using large batch
→ can Nexus be more
"
dynamic"
2
→ If wetake a
" worst"
case view →under utilization
0 0
scaleswell
÷Iwd Elastiih0 0
or
II. aged•:*-0
Next class: SQL