Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
DSA: More Efficient Budgeted Pruning via Differentiable Sparsity Allocation
Xuefei Ning1*, Tianchen Zhao2*, Wenshuo Li1, Peng Lei2, Yu Wang1, Huazhong Yang1
Department of Electronic Engineering, Tsinghua University1
Department of Electronic Engineering, Beihang University2
2020/8/29
ECCV 2020Spotlight Presentation
• Fine-grained pruning [Han et. al., ICLR 2017]
− Requires specialized sparsity accelerator
• Structured pruning: pruning structured subsets (e.g., kernels/channels)
− Easy acceleration
• Budgeted pruning – pruning under explicit budget control
− FLOPs (1E9) / Storage (2.7Mb)
Reviewing Pruning
2020/8/29 3
Han et al. “Deep Compression” ICLR2017
Fine-grained param pruning Channel-wise structural pruning
Budget(1E9 FLOPs / 2.7Mb params)
>
Reviewing Pruning
2020/8/29 4
• Kernel Pruning – remove redundant(less important) kernels
Conv 3x3
Conv 1x1
Conv 3x3
Self-Designed Importance Criterion
Conv 3x3
Conv 1x1
Conv 3x3
Simple Importance Criterion: L1 Norm
Intra Layer Intra Layer
50%
50%
50%
60%
20%
80%
Prune RatiosPrune Ratios
Usually Uniform Among Layers
Inter Layer
Inter Layer
Self-DesignedSparsity
Allocation
Focus
Focus
Inter-Layer DecisionIntra-Layer Decision
Sparsity Allocation
2020/8/29 5
• DSA focuses on the Budgeted Channel Pruning Problem’s Sparsity Allocation Method
Conv 3x3
Conv 1x1
Conv 3x3
Focus on Inter-Layer Importance
60%
20%
80%
Prune Ratios
Naïve Importance Criterion: L1 Norm
Intra Layer
Budget
50% FLOPs
Sparsity Allocation
2020/8/29 6
• Traditional Way of sparsity allocation method follows an iterative scheme
• Layer Sensitivity Analysis through iterative sample prune strategy and test the model
• Discrete Search of Prune Ratios
• Multi-stage & Low-efficient
He, Yihui, et al. "Amc: Automl for model compression and acceleration on mobile devices." ECCV 2018.Liu, Ning, et al. "AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates.” AAAI 2020.Liu, Zechun, et al. "Metapruning: Meta learning for automatic neural network channel pruning." ICCV 2019.
Discrete
Sample
Pruned Model
SamplerLayer-wise
Prune Ratio[0.8, 0.7, …, 0.9]
SensitivityAnalysis
Inner Loop (102~103)
Layer-wise Prune Ratio
[0.8, 0.7, …, 0.9]
Determine
Prune & Train
Original Model(Pre-trained)
Test on validation set
How to Improve the Sparsity Allocation’s Efficiency?
Efficient Pruning - DSA
2020/8/29 7
1. Gradient-based Sparsity Allocation− Discrete search -> Continuous optimization
2. Pruning from Scratch− Save the pre-training cost
SparsityAllocation
( 102~103inner iteration )(300Epochs)
(100~200 Epochs)
PrunedNetwork
Differentiable Sparsity AllocationTogether with Weight Optimization
Computational Cost
(1~𝑁 outer iterations )“Iterative & Time-consuming”
“End2End & Efficient”
WeightOptimizationPre-training
Network with Random
Initialization
DSA Flow1. Differentiable Pruning to derive prune ratios gradient
2. ADMM-inspired optimization to solve the constrained optimization
3. Topological Grouping to handle shortcut connection
2020/07/08 8
Topological Grouping
Differentiable Pruning
ADMM-inspiredOptimization
Pruned Network
Prune Ratio[ 0.8,0.7…0.9 ]
Task Loss
Budget Loss
As Constraint
As Target
Gradient
Optimize
Budget Model
SensitivityAnalysis
𝝏𝑳𝒕𝝏𝜶
as
Train Once
Original Network
DSA Flow
2020/07/08 9
Prune Ratios
SensitivityAnalysis
Continuous Optimization
Training(Differentiable Pruning) Gradients
End2End Training
Valid LossLoss
Budget Model
Budget(1E9FLOPs / 2.3Mb)
𝝏𝑳𝒕𝝏𝜶
Original Model(Random Initialized)
[ 0.4, 0.5, 0.3 ]
[ +0.1, -0.3, -0.2 ]
[ 0.5, 0.2, 0.1 ]
Satisfy
Constraint
ADMM-inspiredOptimizationTopological
Grouping
1. Differentiable Pruning
2020/07/08 10
Pruning(Forward) process:• Inputs: weights 𝑤, pruning ratio 𝛼
• Base importance 𝑏# : L1 Norm of BN 𝛾
• Indicator function 𝑓 = Sigmoid ∘ Log with parameters 𝛽$, 𝛽%
• Kernel-wise keeping probability: 𝑝# = 𝑓 𝑏# , 𝛽$, 𝛽%• Outputs: mask 𝑚# ~ Bernoulli 𝑝#
should satisfy 𝐸 ∑𝑚# = ∑𝑝# = 𝛼 ∗ 𝐶 (Expectation condition)
𝑓 =1
1 + 𝑏!𝛽"
#$!
𝛽$ as the soft threshold𝛽% as the steepness (control inexactness of pruning)
1. Differentiable Pruning
2020/07/08 11
2. Pruning inexactness:
Eventually, inexactness should goes to 0
1. Expectation condition: = 𝛼𝐶In every forward pass, the expectation condition should satisfy. Thus, 𝛽" is decided by solving the above equation
𝑓 =1
1 + 𝑏!𝛽"
#$!
1. Differentiable Pruning
2020/07/08 1212
Backward through random variable:1) Monte Carlo estimated re-parametrization gradients;2) Straight through gradients.
𝑝$ = 𝑓(𝑏$ , 𝛽%, 𝛽&)
Backward process:
2. ADMM-based Optimization
2020/07/08 1313
• Alternating Direction Method of Multipliers (ADMM): an iterative
algorithm framework widely used to solve unconstrained or constrained convex
optimization problems
− Although this problem is non-convex and stochastic, we follow
ADMM’s methodology to solve it
Inner optimization for z:
Dynamic RegularizationAdjusting coefficient 𝑢&
ResNet-20, ResNet56, ResNet-18, VGG-16(CIFAR-10)
ResNet-18, ResNet-50(ImageNet)
• Comparable or even better with current methods
Experiments
2020/8/29 14
Computing Eafficiency
2020/8/29 15
• Traditional 3-Stage• Pre-training• Sparsity Allocation (10% − 10& test)• Weight Optimization
SparsityAllocation
( 102~103inner iteration/ About 3 GPU hour )
(300 Epochs/3 GPU hour+)
(100~200 Epochs/2-3 GPU hour)
PrunedNetwork
Differentiable Sparsity AllocationTogether with Weight Optimization
Computational Cost
(1~𝑁 outer iterations )“Iterative & Time-consuming”
“End2End & Efficient”
WeightOptimizationPre-training
Network with Random
Initialization
• DSA end2end flow• No Pretraining, Start from Scratch• Merged into Weight Optimization• Weight Optimization
Up to 10 GPU hours
Only 3 GPU Hours
2020/07/08 16
Thanks for listening!
https://arxiv.org/abs/2004.02164 https://github.com/walkerning/differentiable-sparsity-allocation