46
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1

PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications

Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin

1

Page 2: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Deep learning powers intelligent applications in many domains

2

Page 3: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Training and inference

3

Training Inference

High throughput Low latency

Page 4: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

GPUs clusters for DL workloads

4

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Page 5: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Separate clusters for training and inference

5

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Cluster for training

Cluster for inference

Page 6: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Utilization of GPU clusters is low

6

Training

Inference

100%

75%

Daytime Midnight

50%

25%

Daytime Midnight

50%

25%

Today: separate clusters Ideal: shared clusters

Daytime Midnight

50%

25%

Page 7: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Context switching overhead is high

7

New model

Old model

Page 8: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Context switching overhead is high

8

InferResNet

TrainBERT

NVIDIA T4

Latency: 6s

Page 9: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Drawbacks of existing solutions

9

Latency: 6s

• NVIDIA MPS• High overhead due to contention

• Salus[MLSys’20]• Requires all the models to be preloaded into the GPU memory

Page 10: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Goal: fast context switching

10

Latency: 6s

• Enable GPU-efficient multiplexing of multiple DL apps with fine-grained time-sharing

• Achieve millisecond-scale context switching latencies and high throughput

Page 11: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch overview: architecture

11

Controller

ActiveWorker

GPU

MemoryDaemon

StandbyWorker

StandbyWorker

NewTask

Page 12: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch overview: execution

• Stop the current task and prepare for the next task.

• Execute the task with pipelined model transmission.

• Clean the environment for the previous task.

12

Controller

ActiveWorker

GPU

MemoryDaemon

StandbyWorker

StandbyWorker

NewTask

Page 13: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Sources of context switching overhead

13

Task cleaning

Task initialization

Memory allocation

Model transmission

Page 14: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

How to reduce the overhead?

14

Pipelinedmodel transmission

Model transmission

Page 15: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

DL models have layered structures

15

Input

Layer-1

Layer-2

Layer-N

Output

ForwardPropagation

BackwardPropagation

Page 16: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Sequential model transmission and execution

16

T0 E0

model transmissionover PCIe

task executionon GPU

T1 Tn-1 E1 En-1T2 E2

Transmit layer 0 Execute layer 0

Page 17: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Pipelined model transmission and execution

17

PCIe

GPU

T0 T1 Tn-1T2

E0 E1 En-1E2

Page 18: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Pipelined model transmission and execution

18

PCIe

GPU

Transmit layer 0

T0 T1 Tn-1T2

E0 E1 En-1E2

Page 19: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Pipelined model transmission and execution

19

Transmit layer 1

Execute layer 0

PCIe

GPU

T0 T1 Tn-1T2

E0 E1 En-1E2

Page 20: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Pipelined model transmission and execution

20

Transmit layer 2

Execute layer 1

PCIe

GPU

T0 T1 Tn-1T2

E0 E1 En-1E2

Page 21: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Pipelined model transmission and execution

21

1.Multiple calls to PCIe;2.Synchronize transmission and execution.

Page 22: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Pipelined model transmission and execution

22

PCIe

GPU Group(0, i)

Group(i+1, j)

Group(0, i)

Group(i+1, j)

Group(k, n-1)

Group(k, n-1)

Page 23: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Pipelined model transmission and execution

23

• Exponential time to find the optimal strategy• Two heuristics for pruning

Page 24: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

How to reduce the overhead?

24

Unifiedmemory management

Task cleaning

Task initialization

Memory allocation

Model transmission

Page 25: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Unified memory management

25

GPU memory

MemoryDaemon

WorkersPointer

Offset

Manage model parameters.Allocate GPU memory.

Page 26: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

How to reduce the overhead?

26

Active-standbyworker switching

Task cleaning

Task initialization

Memory allocation

Model transmission

Page 27: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Active-standby worker switching

27

Old Task

New Task

Init. Execute

Init. Execute Clean

Time

Clean

New Task Starts

Page 28: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Active-standby worker switching

28

Old Task

New Task

Init. Execute

Time

Clean

New Task Starts

Init.2

Init.1

Execute Clean

Page 29: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Active-standby worker switching

29

Old Task

New Task

Init. Execute

Time

Clean

New Task Starts

Init.2

Init.1

Execute Clean

Launch the process.Create CUDA context.

Allocate GPU memory.

Page 30: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Active-standby worker switching

30

Old Task

New Task

Init. Execute

Time

Clean

New Task Starts

Init.2

Init.1

Execute Clean

Page 31: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Implementation

• Testbed: AWS EC2• p3.2xlarge: PCIe 3.0x16, NVIDIA Tesla V100 GPU

• g4dn.2xlarge: PCIe 3.0x8, NVIDIA Tesla T4 GPU

• Software• CUDA 10.1

• PyTorch 1.3.0

• Models• ResNet-152

• Inception-v3

• BERT-base

31

Page 32: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Evaluation

• Can PipeSwitch satisfy SLOs?

• Can PipeSwitch provide high utilization?

• How well do the design choices of PipeSwitch work?

32

Page 33: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Evaluation

• Can PipeSwitch satisfy SLOs?

• Can PipeSwitch provide high utilization?

33

Page 34: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch satisfies SLOs

NVIDIA Tesla V100 NVIDIA Tesla T4

34

Page 35: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch satisfies SLOs

NVIDIA Tesla V100 NVIDIA Tesla T4

35

33ms

Page 36: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch satisfies SLOs

NVIDIA Tesla V100 NVIDIA Tesla T4

36

39ms

Page 37: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch satisfies SLOs

NVIDIA Tesla V100 NVIDIA Tesla T4

37

340ms

Page 38: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch satisfies SLOs

NVIDIA Tesla V100 NVIDIA Tesla T4

38

6.5s

Page 39: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch satisfies SLOs

39

PipeSwitch achieves low context switching latency.

Page 40: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch provide high utilization

40

Scheduling cycles

Page 41: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch provide high utilization

41

Scheduling cycles

Page 42: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch provide high utilization

42

Scheduling cycles

Page 43: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch provide high utilization

43

Scheduling cycles

Page 44: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

PipeSwitch provide high utilization

44

PipeSwitch achieves near 100% utilization.

Page 45: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

Summary

• GPU clusters for DL applications suffer from low utilization• Limited share between training and inference workloads

• PipeSwitch introduces pipelined context switching• Enable GPU-efficient multiplexing of DL apps with fine-grained time-sharing

• Achieve millisecond-scale context switching latencies and high throughput

45

Page 46: PipeSwitch: Fast Pipelined Context Switching for Deep ... · Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1. Deep learning powers intelligent applications in many domains 2. Training

46

Thank you!

[email protected]