checkmateai.github.io
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
Paras JainJoint work with: Ajay Jain, Ani Nrusimha, Amir Gholami,Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez
Checkmate
checkmateai.github.io2
BigGAN (2018)Image generation
Brock et al. 2019
VideoBERT (2019)Video generation
Sun et al. 2019
GPT-2 (2019)Text generation
Radford et al. 2019
checkmateai.github.io
0
400
800
1200
1600
Resnet50R-FCN
DeepLab V2BERTLarge
GPT-2
Para
met
er
coun
t (10
6 ) Emerging trend:
Rapid growth in model size
Figure adapted from NVIDIA3
checkmateai.github.io
AlexN
et,2012
VGG19,2014
Inceptionv3,2015
ResN
et-152,2015
DenseN
et-201,2016
ResN
eXt-101,2016
FCN8s,2017
Transformer,2017
RoB
ERTa,2018
BigG
AN,2018
0GB
5GB
10GB
15GB
Per-GPUmem
oryusag
e
GPU memory limit
?
RAM usageper-GPU
State-of-the-art models have hit a memory capacity wall.
4
Limited GPU memory is slowing progress in new deep learning models!
Chen et al. 2016Gomez et al. 2017Pohlen et al. 2017
Liu et al. 2019Dai et al. 2019
Child et al. 2019
Cited memory as limiting factor
checkmateai.github.io
AlexN
et,2012
VGG19,2014
Inceptionv3,2015
ResN
et-152,2015
DenseN
et-201,2016
ResN
eXt-101,2016
FCN8s,2017
Transformer,2017
RoB
ERTa,2018
BigG
AN,2018
0GB
5GB
10GB
15GB
Per-GPUmem
oryusag
e
GPU memory limit
?
RAM usageper-GPU
State-of-the-art models have hit a memory capacity wall.
5
Limited GPU memory is slowing progress in new deep learning models!
Chen et al. 2016Gomez et al. 2017Pohlen et al. 2017
Liu et al. 2019Dai et al. 2019
Child et al. 2019
Cited memory as limiting factor
How do we efficiently train largemodels beyond memory limits?
Problem:
checkmateai.github.io6
TOPSper GiBcapacity
Compute is outstripping DRAM capacity growth
checkmateai.github.io
Backprop is optimized for compute efficiency, not RAM usage
7
RAM
Compute
Compute-optimizedbackprop
checkmateai.github.io8
RAM
Compute
Compute-optimizedbackprop
RAM-optimizedbackprop
Ideal: scalable algorithm for backprop that adapts to RAM constraints
checkmateai.github.io
This work: optimal space-time tradeoff for backpropagation
9
RAM
Compute
Compute-optimizedbackprop
Checkmate explores optimal trade-off
5x larger inputs w/ 2x cost
RAM-optimizedbackprop
checkmateai.github.io10
RAM
Compute
RAM-hungry backprop policyKeep all layers in RAM
Compute-optimizedbackprop
checkmateai.github.io
A B C D E
∇E∇D∇C∇B∇A
Loss
Forward Pass
Backward Pass
Time
RAMused
B
C
D
E
∇E
11
Label
RAM-hungry backpropagation policyKeep all layers in RAM
A
checkmateai.github.io
A B C D E Label
∇E∇D∇C∇B∇A
Loss
Forward Pass
Backward PassA
B
C
D
E
∇E ∇D
Time
RAMused
12
RAM-hungry backpropagation policyKeep all layers in RAM
checkmateai.github.io
A B C D E Label
∇E∇D∇C∇B∇A
Loss
Forward Pass
Backward PassA
B
C
D
E
∇E ∇D ∇C
Time
RAMused
13
RAM-hungry backpropagation policyKeep all layers in RAM
checkmateai.github.io
A B C D E Label
∇E∇D∇C∇B∇A
Loss
Forward Pass
Backward PassA
B
C
D
E
∇E ∇D ∇C ∇B ∇A
Time
RAMused
14
Peak RAM
RAM-hungry backpropagation policyKeep all layers in RAM
checkmateai.github.io
RAM
Compute
RAM-optimized backpropagation policyRecompute all layers as needed
15
RAM-optimizedbackprop
checkmateai.github.io
Time
RAMused
How can we use less memory?Free early & recompute
RAM-optimized backpropagation policyRecompute all layers
16
Peak RAM (no recomputation)
A B C D E Label
∇E∇D∇C∇B∇A
Loss
Forward Pass
Backward PassA
B
C
D
checkmateai.github.io
Time
RAMused
A
B
C
B
D
A
E
C
∇E ∇D
How can we use less memory?Free early & recompute
RAM-optimized backpropagation policyRecompute all layers
17
Peak RAM (no recomputation)
A B C D E Label
∇E∇D∇C∇B∇A
Loss
Forward Pass
Backward Pass
checkmateai.github.io
Time
RAMused
A
B
C D E ∇E ∇D A B C
How can we use less memory?Free early & recompute
RAM-optimized backpropagation policyRecompute all layers
18
Peak RAM (no recomputation)
A B C D E Label
∇E∇D∇C∇B∇A
Loss
Forward Pass
Backward Pass B A C
checkmateai.github.io
Time
RAMused
A
B
C D E ∇E ∇D A B C ∇C
Peak RAM
How can we use less memory?Free early & recompute
RAM-optimized backpropagation policyRecompute all layers
19
Peak RAM (no recomputation)
A B C D E Label
∇E∇D∇C∇B∇A
Loss
Forward Pass
Backward Pass B A C
checkmateai.github.io20
Loss
E
∇E
D
∇D
C
∇C
B
∇B
A
∇A
Forward Pass
Backward Pass
C
∇C
C
∇C
C
∇C
C
∇C
How to choose which layers to recompute?
checkmateai.github.io21
Label
Forward Pass
How to choose which layers to recompute?
checkmateai.github.io22
Label
Backward Pass
Forward Pass
How to choose which layers to recompute?
checkmateai.github.io23
Label
106×slower
1. Variable runtime per layer
Challenges of heuristics:
checkmateai.github.io24
Label
1. Variable runtime per layer
Challenges of heuristics:
2. Variable RAM usage per layer
103×more RAM
checkmateai.github.io25
Label
1. Variable runtime per layer
Challenges of heuristics:
2. Variable RAM usage per layer
3. Real DNNs are non-linear
checkmateai.github.io26
Prior work is suboptimal in general setting!Greedy heuristic[Chen 2016][XLA authors 2017, 2020]
Divide-and-conquer heuristic[Griewank 2000][Kowarz 2006][Siskind 2018][Kumar 2019]
Optimal for specific architecture[Gruslys 2016][Feng 2018][Beaumont 2019]
3. Real DNNs are non-linear
2. Variable RAM usage per layer
1. Variable runtime per layer
Challenges:
checkmateai.github.io27
RAMCheckpointevery node
Recompute all layers
Compute
Can we optimally trade-off RAM for compute?
3. DAG flexibility
2. RAM-aware
1. Hardware-aware
Let’s be:
checkmateai.github.io28
A system for optimal tensor rematerialization
Accuratecost model
Flexiblesearch space
Optimal solverInteger Linear Program
Near-optimal solverTwo phase rounding
Graphrewrite
Hardware + RAM aware
Solve for 10s-1hr Train for 1mo
GPU, CPU, TPU support
checkmateai.github.io29
A system for optimal tensor rematerialization
Accuratecost model
Flexiblesearch space
Optimal solverInteger Linear Program
Near-optimal solverTwo phase rounding
Graphrewrite
checkmateai.github.io30
A system for optimal tensor rematerialization
A B C ∇ C ∇ B ∇ A
checkmateai.github.io
t=1
t=2
t=3
t=4
t=5
t=631
A system for optimal tensor rematerialization
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
Stage
Layer
𝑅!,# ∈ {0, 1}
checkmateai.github.io
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
A B C ∇ C ∇ B ∇ A
t=1
t=2
t=3
t=4
t=5
t=632
A system for optimal tensor rematerialization
𝑅!,# ∈ {0, 1}
Stage
Layer
checkmateai.github.io
t=1
t=2
t=3
t=4
t=5
t=633
A system for optimal tensor rematerialization
𝑅!,# ∈ {0, 1}
Stage
LayerLayer
𝑆!,# ∈ {0, 1}
checkmateai.github.io
t=1
t=2
t=3
t=4
t=5
t=634
A system for optimal tensor rematerialization
𝑅!,# ∈ {0, 1}
Stage
LayerLayer
𝑆!,# ∈ {0, 1}
R = What is computed?
S = What is in memory?
checkmateai.github.io
t=1
t=2
t=3
t=4
t=5
t=635
A system for optimal tensor rematerialization
𝑅!,# ∈ {0, 1}
Stage
LayerLayer
𝑆!,# ∈ {0, 1}
Example of optimal “S” (SegNet)
checkmateai.github.io36
A system for optimal tensor rematerialization
Accuratecost model
Flexiblesearch space
Optimal solverInteger Linear Program
Near-optimal solverTwo phase rounding
Graphrewrite
checkmateai.github.io37
A system for optimal tensor rematerialization
Use R matrix to create linear objective
Layer 𝑖 (re)computed in stage 𝑡
Layer 𝑖 stored for stage 𝑡
𝑅!,# ∈ {0, 1}
𝑆!,# ∈ {0, 1}Decision variables
min+,,,-
$$𝐶. 𝑅/,.
Minimize forward +backward cost
checkmateai.github.io38
A system for optimal tensor rematerialization
“A layer’s dependencies must be computed before evaluation”
“A layer must be computed before it can be stored in RAM”
Layer 𝑖 (re)computed in stage 𝑡
Layer 𝑖 stored for stage 𝑡
𝑅!,# ∈ {0, 1}
𝑆!,# ∈ {0, 1}Decision variables
min+,,,-
$$𝐶. 𝑅/,.
Minimize forward +backward cost
𝑅!,$ ≤ 𝑅!,# + 𝑆!,#
𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#
Correctness
checkmateai.github.io
Layer 𝑖 (re)computed in stage 𝑡
Layer 𝑖 stored for stage 𝑡
𝑅!,# ∈ {0, 1}
𝑆!,# ∈ {0, 1}Decision variables
39
A system for optimal tensor rematerialization
Layer 𝑖 (re)computed in stage 𝑡
Layer 𝑖 stored for stage 𝑡
𝑅!,# ∈ {0, 1}
𝑆!,# ∈ {0, 1}Decision variables
Memory usage in stage 𝑡𝑈!,# ∈ ℝ$
Constrain memory via an implicit variable to model memory usage at each stage
min+,,,-
$$𝐶. 𝑅/,.
Minimize forward +backward cost
𝑅!,$ ≤ 𝑅!,# + 𝑆!,#
𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#
Correctness
𝑈!,' ≤ budget, …Memory limit
checkmateai.github.io40
A system for optimal tensor rematerialization
Layer 𝑖 (re)computed in stage 𝑡
Layer 𝑖 stored for stage 𝑡
𝑅!,# ∈ {0, 1}
𝑆!,# ∈ {0, 1}Decision variables
Memory usage in stage 𝑡𝑈!,# ∈ ℝ$
Constrain memory via an implicit variable to model memory usage at each stage
min+,,,-
$$𝐶. 𝑅/,.
Minimize forward +backward cost
𝑅!,$ ≤ 𝑅!,# + 𝑆!,#
𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#
Correctness
𝑈!,' ≤ budget, …Memory limit
Memory accounting details in paper
checkmateai.github.io
A system for optimal tensor rematerialization
How long is the solve time?
9 hours 😳
min+,,,-
$$𝐶. 𝑅/,.
Minimize forward +backward cost
𝑅!,$ ≤ 𝑅!,# + 𝑆!,#
𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#
Correctness
𝑈!,' ≤ budget, …Memory limit
checkmateai.github.io42
A system for optimal tensor rematerialization
min+,,,-
$$𝐶. 𝑅/,.
Minimize forward +backward cost
𝑅!,$ ≤ 𝑅!,# + 𝑆!,#
𝑆!,# ≤ 𝑅!%&,# + 𝑆!%&,#
Correctness
𝑈!,' ≤ budget, …Memory limit
Partition schedule into frontier-advancing stages
𝑅!,! = 1
𝑅, 𝑆, 𝑈 lower triangular
Tractability
9 hours → 0.2 seconds
Prunes n!permutations
of nodes
checkmateai.github.io43
A system for optimal tensor rematerialization
Accuratecost model
Flexiblesearch space
Optimal solverInteger Linear Program
Near-optimal solverTwo phase rounding
Graphrewrite
checkmateai.github.io
ILP optimization is NP-hard (combinatorial search)
Polynomial-time approximation?
Proposed method: Two-Phase RoundingRound S, solve other variables optimally
Insight: Given S, optimal 𝐑 easy to compute
1. Relax boolean constraints2. Solve LP3. Round solution
How to maintain feasibility?
checkmateai.github.io
AlexN
et,2012
VGG19,2014
Inceptionv3,2015
ResN
et-152,2015
DenseN
et-201,2016
ResN
eXt-101,2016
FCN8s,2017
Transformer,2017
RoB
ERTa,2018
BigG
AN,2018
0GB
5GB
10GB
15GB
Per-GPUmem
oryusag
e
GPU memory limit
?
Evaluation: Questions
1. What is the memory vs compute trade-off?
2. How much can we increase batch/model size?
3. How well does two-phase rounding do?
checkmateai.github.io
Checkmate
Checkmate
Evaluation: What is the memory vs compute trade-off?
Best heuristic
10 20
1.0
1.1
1.2
1.3
Budget (GB)
ComputeOverhead (x)
GPU Memory available (GB)
U-Net, batch size 32
Best heuristic
10 20 30 40
1.0
1.1
1.2
1.3
1.4
1.5
Budget (GB)
ComputeOverhead (x)
MobileNet, batch size 512
GPU Memory available (GB)
1.2x speedup!
checkmateai.github.io
Evaluation: How much can we increase batch size? VGG19224x224 images
Layer
𝑅!,#
Square root heuristicBatch size 197
1.18x larger!
Stage
Layer
No rematerializationBatch size 167
𝑅!,#
Layer
CheckmateBatch size 289
1.73x larger! 10 sec solve
𝑅!,#
checkmateai.github.io
16 29 21 167
98 21518
51
33
197
116
452
35
51
43
266 19
9
640
61
60
62
289
225
1105
0x
1x
2x
3x
4x
5x
U-Net FCN8 SegNet VGG19 ResNet50 MobileNet
Nor
mal
ized
bat
ch s
ize
Checkpoint all AP √n Lin. greedy Checkmate (ours)
Evaluation: How much can we increase batch size?
1.73x larger!
5.1x larger!
checkmateai.github.io
16 29 21 167
98 21518
51
33
197
116
452
35
51
43
266 19
9
640
61
60
62
289
225
1105
0x
1x
2x
3x
4x
5x
U-Net FCN8 SegNet VGG19 ResNet50 MobileNet
Nor
mal
ized
bat
ch s
ize
Checkpoint all AP √n Lin. greedy Checkmate (ours)
Evaluation: How much can we increase batch size?
1.73x larger!
5.1x larger!
*Ongoing work: BERT2.3x larger batch size over TF2.0
Train BERT-Large w/o model parallelism
checkmateai.github.io
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
16 29 21 167
98 21518
51
33
197
116
452
35
51
43
266 19
9
640
61
60
62
289
225
1105
0x
1x
2x
3x
4x
5x
U-Net FCN8 SegNet VGG19 ResNet50 MobileNet
Nor
mal
ized
bat
ch s
ize
Checkpoint all AP √n Lin. greedy Checkmate (ours)
Figure 6. Maximum batch size possible on a single NVIDIA V100GPU when using different generalized rematerialization strategieswith at most a single extra forward pass. We enable increasingbatch size by up to 5.1⇥ over the current practice of cachingall activations (on MobileNet), and up to 1.73⇥ over the bestcheckpointing scheme (on U-Net).
For fair comparison on the non-linear graphs used in U-Net, FCN, and ResNet, we use the AP
pn and linearized
greedy baseline generalizations described in Section 6.1.Let Mfixed = 2Mparam, as in (2) and let M@1 be the mem-ory a baseline strategy uses at batch size 1. The maximumbaseline batch size is estimated with (11), where the mini-mization is taken with respect to hyperparameters, if any.
max B =
�16 GB � Mfixed
min M@1 � Mfixed
⌫(11)
Costs are measured in FLOPs, determined statically. U-Net, FCN8 and SegNet semantic segmentation networksuse a resolution of 416 ⇥ 608, and classification networksResNet50, VGG19 and MobileNet use resolution 224⇥224.
Takeaways: We can increase the batch size of U-Net to61 at a high resolution, an unprecedented result. For manytasks such as semantic segmentation, where U-Net is com-monly used, it is not possible to use batch sizes greater than16, depending on resolution. This is sub-optimal for batchnormalization layers, and being able to increase the batchsize by 3.8⇥ (61 vs 16 for a representative resolution) isquite significant. Orthogonal approaches to achieve thisinclude model parallelism and distributed memory batchnormalization which can be significantly more difficult toimplement and have high communication costs. Further-more, for MobileNet, Checkmate allows a batch size of1105 which is 1.73⇥ higher than the best baseline solution,a greedy heuristic, and 5.1⇥ common practice, checkpoint-ing all activations. The same schedules can also be used toincrease image resolution rather than batch size.
APpn
APgreedy
Griewanklog n
Two-phaseLP rounding
MobileNet 1.14⇥ 1.07⇥ 7.07⇥ 1.06⇥VGG16 1.28⇥ 1.06⇥ 1.44⇥ 1.01⇥VGG19 1.54⇥ 1.39⇥ 1.75⇥ 1.00⇥
U-Net 1.27⇥ 1.23⇥ - 1.03⇥ResNet50 1.20⇥ 1.25⇥ - 1.05⇥
Table 2. Approximation ratios for baseline heuristics and our LProunding strategy. Results are given as the geometric meanspeedup of the optimal ILP across feasible budgets.
6.5 How well can we approximate the optimalrematerialization policy?
To understand how well our LP rounding strategy (Sec-tion 5) approximates the ILP, we measure the ratioCOSTapprox/COSTopt, i.e. the speedup of the optimal sched-ule, in FLOPs. As in Section 6.3, we solve each strategy at arange of memory budgets, then compute the geometric meanof the ratio across budgets. The aggregated ratio is usedbecause some budgets are feasible via the ILP but not viathe approximations. Table 6 shows results. The two-phasedeterministic rounding approach has approximation factorsclose to optimal, at most 1.06⇥ for all tested architectures.
7 CONCLUSIONS
One of the main challenges when training large neural net-works is the limited capacity of high-bandwidth memoryon accelerators such as GPUs and TPUs. This has createda memory wall that limits the size of the models that canbe trained. The bottleneck for state-of-the-art model de-velopment is now memory rather than data and computeavailability, and we expect this trend to worsen in the future.
To address this challenge, we proposed a novel rematerial-ization algorithm which allows large models to be trainedwith limited available memory. Our method does not makethe strong assumptions required in prior work, supportinggeneral non-linear computation graphs such as residual net-works and capturing the impact of non-uniform memoryusage and computation cost throughout the graph with ahardware-aware, profile-guided cost model. We presentedan ILP formulation for the problem, implemented the Check-mate system for optimal rematerialization in TensorFlow,and tested the proposed system on a range of neural networkmodels. In evaluation, we find that optimal rematerializa-tion has minimal computational overhead at a wide range ofmemory budgets and showed that Checkmate enables prac-titioners to train high-resolution models with significantlylarger batch sizes. Finally, a novel two-phase roundingstrategy closely approximates the optimal solver.
Evaluation: How well does 2P rounding approximate ILP?
Within 6% of optimal cost (geomean)
43x speedup for ResNet50440x speedup for MobileNet
checkmateai.github.io
CheckmateKey ideas:
• GPU memory limits are preventing the development of new deep learning models.
• We present the first general solution for optimal & near-optimal graph rematerialization.
• Formulation supports arbitrary DAGs and is both hardware-aware and memory-aware
• Integration with just one line of code
Code and paper:checkmateai.github.io
Email me:[email protected]
Paras Jain, Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez