51
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization Paras Jain Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez Checkmate

Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

Paras JainJoint work with: Ajay Jain, Ani Nrusimha, Amir Gholami,Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez

Checkmate

Page 2: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io2

BigGAN (2018)Image generation

Brock et al. 2019

VideoBERT (2019)Video generation

Sun et al. 2019

GPT-2 (2019)Text generation

Radford et al. 2019

Page 3: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

0

400

800

1200

1600

Resnet50R-FCN

DeepLab V2BERTLarge

GPT-2

Para

met

er

coun

t (10

6 ) Emerging trend:

Rapid growth in model size

Figure adapted from NVIDIA3

Page 4: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

AlexN

et,2012

VGG19,2014

Inceptionv3,2015

ResN

et-152,2015

DenseN

et-201,2016

ResN

eXt-101,2016

FCN8s,2017

Transformer,2017

RoB

ERTa,2018

BigG

AN,2018

0GB

5GB

10GB

15GB

Per-GPUmem

oryusag

e

GPU memory limit

?

RAM usageper-GPU

State-of-the-art models have hit a memory capacity wall.

4

Limited GPU memory is slowing progress in new deep learning models!

Chen et al. 2016Gomez et al. 2017Pohlen et al. 2017

Liu et al. 2019Dai et al. 2019

Child et al. 2019

Cited memory as limiting factor

Page 5: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

AlexN

et,2012

VGG19,2014

Inceptionv3,2015

ResN

et-152,2015

DenseN

et-201,2016

ResN

eXt-101,2016

FCN8s,2017

Transformer,2017

RoB

ERTa,2018

BigG

AN,2018

0GB

5GB

10GB

15GB

Per-GPUmem

oryusag

e

GPU memory limit

?

RAM usageper-GPU

State-of-the-art models have hit a memory capacity wall.

5

Limited GPU memory is slowing progress in new deep learning models!

Chen et al. 2016Gomez et al. 2017Pohlen et al. 2017

Liu et al. 2019Dai et al. 2019

Child et al. 2019

Cited memory as limiting factor

How do we efficiently train largemodels beyond memory limits?

Problem:

Page 6: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io6

TOPSper GiBcapacity

Compute is outstripping DRAM capacity growth

Page 7: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Backprop is optimized for compute efficiency, not RAM usage

7

RAM

Compute

Compute-optimizedbackprop

Page 8: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io8

RAM

Compute

Compute-optimizedbackprop

RAM-optimizedbackprop

Ideal: scalable algorithm for backprop that adapts to RAM constraints

Page 9: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

This work: optimal space-time tradeoff for backpropagation

9

RAM

Compute

Compute-optimizedbackprop

Checkmate explores optimal trade-off

5x larger inputs w/ 2x cost

RAM-optimizedbackprop

Page 10: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io10

RAM

Compute

RAM-hungry backprop policyKeep all layers in RAM

Compute-optimizedbackprop

Page 11: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

A B C D E

∇E∇D∇C∇B∇A

Loss

Forward Pass

Backward Pass

Time

RAMused

B

C

D

E

∇E

11

Label

RAM-hungry backpropagation policyKeep all layers in RAM

A

Page 12: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

A B C D E Label

∇E∇D∇C∇B∇A

Loss

Forward Pass

Backward PassA

B

C

D

E

∇E ∇D

Time

RAMused

12

RAM-hungry backpropagation policyKeep all layers in RAM

Page 13: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

A B C D E Label

∇E∇D∇C∇B∇A

Loss

Forward Pass

Backward PassA

B

C

D

E

∇E ∇D ∇C

Time

RAMused

13

RAM-hungry backpropagation policyKeep all layers in RAM

Page 14: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

A B C D E Label

∇E∇D∇C∇B∇A

Loss

Forward Pass

Backward PassA

B

C

D

E

∇E ∇D ∇C ∇B ∇A

Time

RAMused

14

Peak RAM

RAM-hungry backpropagation policyKeep all layers in RAM

Page 15: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

RAM

Compute

RAM-optimized backpropagation policyRecompute all layers as needed

15

RAM-optimizedbackprop

Page 16: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Time

RAMused

How can we use less memory?Free early & recompute

RAM-optimized backpropagation policyRecompute all layers

16

Peak RAM (no recomputation)

A B C D E Label

∇E∇D∇C∇B∇A

Loss

Forward Pass

Backward PassA

B

C

D

Page 17: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Time

RAMused

A

B

C

B

D

A

E

C

∇E ∇D

How can we use less memory?Free early & recompute

RAM-optimized backpropagation policyRecompute all layers

17

Peak RAM (no recomputation)

A B C D E Label

∇E∇D∇C∇B∇A

Loss

Forward Pass

Backward Pass

Page 18: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Time

RAMused

A

B

C D E ∇E ∇D A B C

How can we use less memory?Free early & recompute

RAM-optimized backpropagation policyRecompute all layers

18

Peak RAM (no recomputation)

A B C D E Label

∇E∇D∇C∇B∇A

Loss

Forward Pass

Backward Pass B A C

Page 19: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Time

RAMused

A

B

C D E ∇E ∇D A B C ∇C

Peak RAM

How can we use less memory?Free early & recompute

RAM-optimized backpropagation policyRecompute all layers

19

Peak RAM (no recomputation)

A B C D E Label

∇E∇D∇C∇B∇A

Loss

Forward Pass

Backward Pass B A C

Page 20: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io20

Loss

E

∇E

D

∇D

C

∇C

B

∇B

A

∇A

Forward Pass

Backward Pass

C

∇C

C

∇C

C

∇C

C

∇C

How to choose which layers to recompute?

Page 21: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io21

Label

Forward Pass

How to choose which layers to recompute?

Page 22: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io22

Label

Backward Pass

Forward Pass

How to choose which layers to recompute?

Page 23: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io23

Label

106×slower

1. Variable runtime per layer

Challenges of heuristics:

Page 24: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io24

Label

1. Variable runtime per layer

Challenges of heuristics:

2. Variable RAM usage per layer

103×more RAM

Page 25: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io25

Label

1. Variable runtime per layer

Challenges of heuristics:

2. Variable RAM usage per layer

3. Real DNNs are non-linear

Page 26: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io26

Prior work is suboptimal in general setting!Greedy heuristic[Chen 2016][XLA authors 2017, 2020]

Divide-and-conquer heuristic[Griewank 2000][Kowarz 2006][Siskind 2018][Kumar 2019]

Optimal for specific architecture[Gruslys 2016][Feng 2018][Beaumont 2019]

3. Real DNNs are non-linear

2. Variable RAM usage per layer

1. Variable runtime per layer

Challenges:

Page 27: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io27

RAMCheckpointevery node

Recompute all layers

Compute

Can we optimally trade-off RAM for compute?

3. DAG flexibility

2. RAM-aware

1. Hardware-aware

Let’s be:

Page 28: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io28

A system for optimal tensor rematerialization

Accuratecost model

Flexiblesearch space

Optimal solverInteger Linear Program

Near-optimal solverTwo phase rounding

Graphrewrite

Hardware + RAM aware

Solve for 10s-1hr Train for 1mo

GPU, CPU, TPU support

Page 29: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io29

A system for optimal tensor rematerialization

Accuratecost model

Flexiblesearch space

Optimal solverInteger Linear Program

Near-optimal solverTwo phase rounding

Graphrewrite

Page 30: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io30

A system for optimal tensor rematerialization

A B C ∇ C ∇ B ∇ A

Page 31: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

t=1

t=2

t=3

t=4

t=5

t=631

A system for optimal tensor rematerialization

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

Stage

Layer

𝑅!,# ∈ {0, 1}

Page 32: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

A B C ∇ C ∇ B ∇ A

t=1

t=2

t=3

t=4

t=5

t=632

A system for optimal tensor rematerialization

𝑅!,# ∈ {0, 1}

Stage

Layer

Page 33: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

t=1

t=2

t=3

t=4

t=5

t=633

A system for optimal tensor rematerialization

𝑅!,# ∈ {0, 1}

Stage

LayerLayer

𝑆!,# ∈ {0, 1}

Page 34: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

t=1

t=2

t=3

t=4

t=5

t=634

A system for optimal tensor rematerialization

𝑅!,# ∈ {0, 1}

Stage

LayerLayer

𝑆!,# ∈ {0, 1}

R = What is computed?

S = What is in memory?

Page 35: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

t=1

t=2

t=3

t=4

t=5

t=635

A system for optimal tensor rematerialization

𝑅!,# ∈ {0, 1}

Stage

LayerLayer

𝑆!,# ∈ {0, 1}

Example of optimal “S” (SegNet)

Page 36: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io36

A system for optimal tensor rematerialization

Accuratecost model

Flexiblesearch space

Optimal solverInteger Linear Program

Near-optimal solverTwo phase rounding

Graphrewrite

Page 37: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io37

A system for optimal tensor rematerialization

Use R matrix to create linear objective

Layer 𝑖 (re)computed in stage 𝑡

Layer 𝑖 stored for stage 𝑡

𝑅!,# ∈ {0, 1}

𝑆!,# ∈ {0, 1}Decision variables

min+,,,-

$$𝐶. 𝑅/,.

Minimize forward +backward cost

Page 38: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io38

A system for optimal tensor rematerialization

“A layer’s dependencies must be computed before evaluation”

“A layer must be computed before it can be stored in RAM”

Layer 𝑖 (re)computed in stage 𝑡

Layer 𝑖 stored for stage 𝑡

𝑅!,# ∈ {0, 1}

𝑆!,# ∈ {0, 1}Decision variables

min+,,,-

$$𝐶. 𝑅/,.

Minimize forward +backward cost

𝑅!,$ ≤ 𝑅!,# + 𝑆!,#

𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#

Correctness

Page 39: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Layer 𝑖 (re)computed in stage 𝑡

Layer 𝑖 stored for stage 𝑡

𝑅!,# ∈ {0, 1}

𝑆!,# ∈ {0, 1}Decision variables

39

A system for optimal tensor rematerialization

Layer 𝑖 (re)computed in stage 𝑡

Layer 𝑖 stored for stage 𝑡

𝑅!,# ∈ {0, 1}

𝑆!,# ∈ {0, 1}Decision variables

Memory usage in stage 𝑡𝑈!,# ∈ ℝ$

Constrain memory via an implicit variable to model memory usage at each stage

min+,,,-

$$𝐶. 𝑅/,.

Minimize forward +backward cost

𝑅!,$ ≤ 𝑅!,# + 𝑆!,#

𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#

Correctness

𝑈!,' ≤ budget, …Memory limit

Page 40: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io40

A system for optimal tensor rematerialization

Layer 𝑖 (re)computed in stage 𝑡

Layer 𝑖 stored for stage 𝑡

𝑅!,# ∈ {0, 1}

𝑆!,# ∈ {0, 1}Decision variables

Memory usage in stage 𝑡𝑈!,# ∈ ℝ$

Constrain memory via an implicit variable to model memory usage at each stage

min+,,,-

$$𝐶. 𝑅/,.

Minimize forward +backward cost

𝑅!,$ ≤ 𝑅!,# + 𝑆!,#

𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#

Correctness

𝑈!,' ≤ budget, …Memory limit

Memory accounting details in paper

Page 41: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

A system for optimal tensor rematerialization

How long is the solve time?

9 hours 😳

min+,,,-

$$𝐶. 𝑅/,.

Minimize forward +backward cost

𝑅!,$ ≤ 𝑅!,# + 𝑆!,#

𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#

Correctness

𝑈!,' ≤ budget, …Memory limit

Page 42: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io42

A system for optimal tensor rematerialization

min+,,,-

$$𝐶. 𝑅/,.

Minimize forward +backward cost

𝑅!,$ ≤ 𝑅!,# + 𝑆!,#

𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#

Correctness

𝑈!,' ≤ budget, …Memory limit

Partition schedule into frontier-advancing stages

𝑅!,! = 1

𝑅, 𝑆, 𝑈 lower triangular

Tractability

9 hours → 0.2 seconds

Prunes n!permutations

of nodes

Page 43: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io43

A system for optimal tensor rematerialization

Accuratecost model

Flexiblesearch space

Optimal solverInteger Linear Program

Near-optimal solverTwo phase rounding

Graphrewrite

Page 44: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

ILP optimization is NP-hard (combinatorial search)

Polynomial-time approximation?

Proposed method: Two-Phase RoundingRound S, solve other variables optimally

Insight: Given S, optimal 𝐑 easy to compute

1. Relax boolean constraints2. Solve LP3. Round solution

How to maintain feasibility?

Page 45: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

AlexN

et,2012

VGG19,2014

Inceptionv3,2015

ResN

et-152,2015

DenseN

et-201,2016

ResN

eXt-101,2016

FCN8s,2017

Transformer,2017

RoB

ERTa,2018

BigG

AN,2018

0GB

5GB

10GB

15GB

Per-GPUmem

oryusag

e

GPU memory limit

?

Evaluation: Questions

1. What is the memory vs compute trade-off?

2. How much can we increase batch/model size?

3. How well does two-phase rounding do?

Page 46: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Checkmate

Checkmate

Evaluation: What is the memory vs compute trade-off?

Best heuristic

10 20

1.0

1.1

1.2

1.3

Budget (GB)

ComputeOverhead (x)

GPU Memory available (GB)

U-Net, batch size 32

Best heuristic

10 20 30 40

1.0

1.1

1.2

1.3

1.4

1.5

Budget (GB)

ComputeOverhead (x)

MobileNet, batch size 512

GPU Memory available (GB)

1.2x speedup!

Page 47: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Evaluation: How much can we increase batch size? VGG19224x224 images

Layer

𝑅!,#

Square root heuristicBatch size 197

1.18x larger!

Stage

Layer

No rematerializationBatch size 167

𝑅!,#

Layer

CheckmateBatch size 289

1.73x larger! 10 sec solve

𝑅!,#

Page 48: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

16 29 21 167

98 21518

51

33

197

116

452

35

51

43

266 19

9

640

61

60

62

289

225

1105

0x

1x

2x

3x

4x

5x

U-Net FCN8 SegNet VGG19 ResNet50 MobileNet

Nor

mal

ized

bat

ch s

ize

Checkpoint all AP √n Lin. greedy Checkmate (ours)

Evaluation: How much can we increase batch size?

1.73x larger!

5.1x larger!

Page 49: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

16 29 21 167

98 21518

51

33

197

116

452

35

51

43

266 19

9

640

61

60

62

289

225

1105

0x

1x

2x

3x

4x

5x

U-Net FCN8 SegNet VGG19 ResNet50 MobileNet

Nor

mal

ized

bat

ch s

ize

Checkpoint all AP √n Lin. greedy Checkmate (ours)

Evaluation: How much can we increase batch size?

1.73x larger!

5.1x larger!

*Ongoing work: BERT2.3x larger batch size over TF2.0

Train BERT-Large w/o model parallelism

Page 50: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

16 29 21 167

98 21518

51

33

197

116

452

35

51

43

266 19

9

640

61

60

62

289

225

1105

0x

1x

2x

3x

4x

5x

U-Net FCN8 SegNet VGG19 ResNet50 MobileNet

Nor

mal

ized

bat

ch s

ize

Checkpoint all AP √n Lin. greedy Checkmate (ours)

Figure 6. Maximum batch size possible on a single NVIDIA V100GPU when using different generalized rematerialization strategieswith at most a single extra forward pass. We enable increasingbatch size by up to 5.1⇥ over the current practice of cachingall activations (on MobileNet), and up to 1.73⇥ over the bestcheckpointing scheme (on U-Net).

For fair comparison on the non-linear graphs used in U-Net, FCN, and ResNet, we use the AP

pn and linearized

greedy baseline generalizations described in Section 6.1.Let Mfixed = 2Mparam, as in (2) and let M@1 be the mem-ory a baseline strategy uses at batch size 1. The maximumbaseline batch size is estimated with (11), where the mini-mization is taken with respect to hyperparameters, if any.

max B =

�16 GB � Mfixed

min M@1 � Mfixed

⌫(11)

Costs are measured in FLOPs, determined statically. U-Net, FCN8 and SegNet semantic segmentation networksuse a resolution of 416 ⇥ 608, and classification networksResNet50, VGG19 and MobileNet use resolution 224⇥224.

Takeaways: We can increase the batch size of U-Net to61 at a high resolution, an unprecedented result. For manytasks such as semantic segmentation, where U-Net is com-monly used, it is not possible to use batch sizes greater than16, depending on resolution. This is sub-optimal for batchnormalization layers, and being able to increase the batchsize by 3.8⇥ (61 vs 16 for a representative resolution) isquite significant. Orthogonal approaches to achieve thisinclude model parallelism and distributed memory batchnormalization which can be significantly more difficult toimplement and have high communication costs. Further-more, for MobileNet, Checkmate allows a batch size of1105 which is 1.73⇥ higher than the best baseline solution,a greedy heuristic, and 5.1⇥ common practice, checkpoint-ing all activations. The same schedules can also be used toincrease image resolution rather than batch size.

APpn

APgreedy

Griewanklog n

Two-phaseLP rounding

MobileNet 1.14⇥ 1.07⇥ 7.07⇥ 1.06⇥VGG16 1.28⇥ 1.06⇥ 1.44⇥ 1.01⇥VGG19 1.54⇥ 1.39⇥ 1.75⇥ 1.00⇥

U-Net 1.27⇥ 1.23⇥ - 1.03⇥ResNet50 1.20⇥ 1.25⇥ - 1.05⇥

Table 2. Approximation ratios for baseline heuristics and our LProunding strategy. Results are given as the geometric meanspeedup of the optimal ILP across feasible budgets.

6.5 How well can we approximate the optimalrematerialization policy?

To understand how well our LP rounding strategy (Sec-tion 5) approximates the ILP, we measure the ratioCOSTapprox/COSTopt, i.e. the speedup of the optimal sched-ule, in FLOPs. As in Section 6.3, we solve each strategy at arange of memory budgets, then compute the geometric meanof the ratio across budgets. The aggregated ratio is usedbecause some budgets are feasible via the ILP but not viathe approximations. Table 6 shows results. The two-phasedeterministic rounding approach has approximation factorsclose to optimal, at most 1.06⇥ for all tested architectures.

7 CONCLUSIONS

One of the main challenges when training large neural net-works is the limited capacity of high-bandwidth memoryon accelerators such as GPUs and TPUs. This has createda memory wall that limits the size of the models that canbe trained. The bottleneck for state-of-the-art model de-velopment is now memory rather than data and computeavailability, and we expect this trend to worsen in the future.

To address this challenge, we proposed a novel rematerial-ization algorithm which allows large models to be trainedwith limited available memory. Our method does not makethe strong assumptions required in prior work, supportinggeneral non-linear computation graphs such as residual net-works and capturing the impact of non-uniform memoryusage and computation cost throughout the graph with ahardware-aware, profile-guided cost model. We presentedan ILP formulation for the problem, implemented the Check-mate system for optimal rematerialization in TensorFlow,and tested the proposed system on a range of neural networkmodels. In evaluation, we find that optimal rematerializa-tion has minimal computational overhead at a wide range ofmemory budgets and showed that Checkmate enables prac-titioners to train high-resolution models with significantlylarger batch sizes. Finally, a novel two-phase roundingstrategy closely approximates the optimal solver.

Evaluation: How well does 2P rounding approximate ILP?

Within 6% of optimal cost (geomean)

43x speedup for ResNet50440x speedup for MobileNet

Page 51: Checkmate - MLSys02... · NeaU-Simal VRlYeU TZo phae roXnding Gah UeZUie Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, ... 224x224 images Layer #!,# Square root heuristic

checkmateai.github.io

CheckmateKey ideas:

• GPU memory limits are preventing the development of new deep learning models.

• We present the first general solution for optimal & near-optimal graph rematerialization.

• Formulation supports arbitrary DAGs and is both hardware-aware and memory-aware

• Integration with just one line of code

Code and paper:checkmateai.github.io

Email me:[email protected]

Paras Jain, Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez