16

DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch
Page 2: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

DSA: More Efficient Budgeted Pruning via Differentiable Sparsity Allocation

Xuefei Ning1*, Tianchen Zhao2*, Wenshuo Li1, Peng Lei2, Yu Wang1, Huazhong Yang1

Department of Electronic Engineering, Tsinghua University1

Department of Electronic Engineering, Beihang University2

2020/8/29

ECCV 2020Spotlight Presentation

[email protected], [email protected]

Page 3: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

• Fine-grained pruning [Han et. al., ICLR 2017]

− Requires specialized sparsity accelerator

• Structured pruning: pruning structured subsets (e.g., kernels/channels)

− Easy acceleration

• Budgeted pruning – pruning under explicit budget control

− FLOPs (1E9) / Storage (2.7Mb)

Reviewing Pruning

2020/8/29 3

Han et al. “Deep Compression” ICLR2017

Fine-grained param pruning Channel-wise structural pruning

Budget(1E9 FLOPs / 2.7Mb params)

>

Page 4: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

Reviewing Pruning

2020/8/29 4

• Kernel Pruning – remove redundant(less important) kernels

Conv 3x3

Conv 1x1

Conv 3x3

Self-Designed Importance Criterion

Conv 3x3

Conv 1x1

Conv 3x3

Simple Importance Criterion: L1 Norm

Intra Layer Intra Layer

50%

50%

50%

60%

20%

80%

Prune RatiosPrune Ratios

Usually Uniform Among Layers

Inter Layer

Inter Layer

Self-DesignedSparsity

Allocation

Focus

Focus

Inter-Layer DecisionIntra-Layer Decision

Page 5: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

Sparsity Allocation

2020/8/29 5

• DSA focuses on the Budgeted Channel Pruning Problem’s Sparsity Allocation Method

Conv 3x3

Conv 1x1

Conv 3x3

Focus on Inter-Layer Importance

60%

20%

80%

Prune Ratios

Naïve Importance Criterion: L1 Norm

Intra Layer

Budget

50% FLOPs

Page 6: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

Sparsity Allocation

2020/8/29 6

• Traditional Way of sparsity allocation method follows an iterative scheme

• Layer Sensitivity Analysis through iterative sample prune strategy and test the model

• Discrete Search of Prune Ratios

• Multi-stage & Low-efficient

He, Yihui, et al. "Amc: Automl for model compression and acceleration on mobile devices." ECCV 2018.Liu, Ning, et al. "AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates.” AAAI 2020.Liu, Zechun, et al. "Metapruning: Meta learning for automatic neural network channel pruning." ICCV 2019.

Discrete

Sample

Pruned Model

SamplerLayer-wise

Prune Ratio[0.8, 0.7, …, 0.9]

SensitivityAnalysis

Inner Loop (102~103)

Layer-wise Prune Ratio

[0.8, 0.7, …, 0.9]

Determine

Prune & Train

Original Model(Pre-trained)

Test on validation set

How to Improve the Sparsity Allocation’s Efficiency?

Page 7: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

Efficient Pruning - DSA

2020/8/29 7

1. Gradient-based Sparsity Allocation− Discrete search -> Continuous optimization

2. Pruning from Scratch− Save the pre-training cost

SparsityAllocation

( 102~103inner iteration )(300Epochs)

(100~200 Epochs)

PrunedNetwork

Differentiable Sparsity AllocationTogether with Weight Optimization

Computational Cost

(1~𝑁 outer iterations )“Iterative & Time-consuming”

“End2End & Efficient”

WeightOptimizationPre-training

Network with Random

Initialization

Page 8: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

DSA Flow1. Differentiable Pruning to derive prune ratios gradient

2. ADMM-inspired optimization to solve the constrained optimization

3. Topological Grouping to handle shortcut connection

2020/07/08 8

Topological Grouping

Differentiable Pruning

ADMM-inspiredOptimization

Pruned Network

Prune Ratio[ 0.8,0.7…0.9 ]

Task Loss

Budget Loss

As Constraint

As Target

Gradient

Optimize

Budget Model

SensitivityAnalysis

𝝏𝑳𝒕𝝏𝜶

as

Train Once

Original Network

Page 9: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

DSA Flow

2020/07/08 9

Prune Ratios

SensitivityAnalysis

Continuous Optimization

Training(Differentiable Pruning) Gradients

End2End Training

Valid LossLoss

Budget Model

Budget(1E9FLOPs / 2.3Mb)

𝝏𝑳𝒕𝝏𝜶

Original Model(Random Initialized)

[ 0.4, 0.5, 0.3 ]

[ +0.1, -0.3, -0.2 ]

[ 0.5, 0.2, 0.1 ]

Satisfy

Constraint

ADMM-inspiredOptimizationTopological

Grouping

Page 10: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

1. Differentiable Pruning

2020/07/08 10

Pruning(Forward) process:• Inputs: weights 𝑤, pruning ratio 𝛼

• Base importance 𝑏# : L1 Norm of BN 𝛾

• Indicator function 𝑓 = Sigmoid ∘ Log with parameters 𝛽$, 𝛽%

• Kernel-wise keeping probability: 𝑝# = 𝑓 𝑏# , 𝛽$, 𝛽%• Outputs: mask 𝑚# ~ Bernoulli 𝑝#

should satisfy 𝐸 ∑𝑚# = ∑𝑝# = 𝛼 ∗ 𝐶 (Expectation condition)

𝑓 =1

1 + 𝑏!𝛽"

#$!

𝛽$ as the soft threshold𝛽% as the steepness (control inexactness of pruning)

Page 11: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

1. Differentiable Pruning

2020/07/08 11

2. Pruning inexactness:

Eventually, inexactness should goes to 0

1. Expectation condition: = 𝛼𝐶In every forward pass, the expectation condition should satisfy. Thus, 𝛽" is decided by solving the above equation

𝑓 =1

1 + 𝑏!𝛽"

#$!

Page 12: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

1. Differentiable Pruning

2020/07/08 1212

Backward through random variable:1) Monte Carlo estimated re-parametrization gradients;2) Straight through gradients.

𝑝$ = 𝑓(𝑏$ , 𝛽%, 𝛽&)

Backward process:

Page 13: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

2. ADMM-based Optimization

2020/07/08 1313

• Alternating Direction Method of Multipliers (ADMM): an iterative

algorithm framework widely used to solve unconstrained or constrained convex

optimization problems

− Although this problem is non-convex and stochastic, we follow

ADMM’s methodology to solve it

Inner optimization for z:

Dynamic RegularizationAdjusting coefficient 𝑢&

Page 14: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

ResNet-20, ResNet56, ResNet-18, VGG-16(CIFAR-10)

ResNet-18, ResNet-50(ImageNet)

• Comparable or even better with current methods

Experiments

2020/8/29 14

Page 15: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

Computing Eafficiency

2020/8/29 15

• Traditional 3-Stage• Pre-training• Sparsity Allocation (10% − 10& test)• Weight Optimization

SparsityAllocation

( 102~103inner iteration/ About 3 GPU hour )

(300 Epochs/3 GPU hour+)

(100~200 Epochs/2-3 GPU hour)

PrunedNetwork

Differentiable Sparsity AllocationTogether with Weight Optimization

Computational Cost

(1~𝑁 outer iterations )“Iterative & Time-consuming”

“End2End & Efficient”

WeightOptimizationPre-training

Network with Random

Initialization

• DSA end2end flow• No Pretraining, Start from Scratch• Merged into Weight Optimization• Weight Optimization

Up to 10 GPU hours

Only 3 GPU Hours

Page 16: DSA: More Efficient Budgeted Pruning via...Efficient Pruning -DSA 2020/8/29 7 1. Gradient-based Sparsity Allocation −Discrete search -> Continuous optimization 2. Pruning from Scratch

2020/07/08 16

Thanks for listening!

https://arxiv.org/abs/2004.02164 https://github.com/walkerning/differentiable-sparsity-allocation