99
LIMITING THE DEEP LEARNING YIPING LU PEKING UNIVERSITY

LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

LIMITING THE DEEP LEARNINGYIPING LU PEKING UNIVERSITY

Page 2: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

DEEP LEARNING IS SUCCESSFUL, BUT…

GPU

PHD

Professor

PHD

Page 3: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

LARGER THE BETTER?

Test error

decreasing

Page 4: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

LARGER THE BETTER?

Test error

decreasing

Why Not Limiting To 𝒑

𝒏= +∞

Page 5: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TWO LIMITING

Depth Limiting

Wid

th L

imit

Page 6: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

DEPTH LIMITING: ODE

[He et al. 2015]

[E. 2017] [Haber et al. 2017] [Lu et al. 2017]

[Chen et al. 2018]

Page 7: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

BEYOND FINITE LAYER NEURAL NETWORK

Going into

infinite layer

Differential Equation

Page 8: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

ALSO THEORETICAL GUARANTEED

Page 9: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TRADITIONAL WISDOM IN DEEP LEARNING

?

Page 10: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

BECOMING HOT NOW…

NeurIPS2018 Best Paper Award

Page 11: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

BRIDGING CONTROL AND LEARNING

Feedback is the

learning loss

Neural Network

Optimization

Page 12: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

BRIDGING CONTROL AND LEARNING

Feedback is the

learning loss

Neural Network

Optimization

Interpretability? PDE-Net ICML2018

Page 13: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

BRIDGING CONTROL AND LEARNING

Feedback is the

learning loss

Neural Network

Optimization

A better model?LM-ResNet ICML2018

DURR ICLR2019

Macaroon submitted

Interpretability? PDE-Net ICML2018

Chang B, Meng L, Haber E, et al. Reversible

architectures for arbitrarily deep residual neural

networks[C]//Thirty-Second AAAI Conference

on Artificial Intelligence. 2018.

Tao Y, Sun Q, Du Q, et al. Nonlocal Neural

Networks, Nonlocal Diffusion and Nonlocal

Modeling[C]//Advances in Neural Information

Processing Systems. 2018: 496-506.

Page 14: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

BRIDGING CONTROL AND LEARNING

Feedback is the

learning loss

Neural Network

Optimization

A better model?

New Algorithm?

LM-ResNet ICML2018

DURR ICLR2019

Macaroon submitted

YOPO Submitted

Interpretability? PDE-Net ICML2018

An W, Wang H, Sun Q, et al. A pid controller approach for stochastic

optimization of deep networks[C]//Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition. 2018: 8522-8531.

Chen T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential

equations[C]//Advances in Neural Information Processing Systems.

2018: 6571-6583.

Li Q, Hao S. An optimal control approach to deep learning and

applications to discrete-weight neural networks[J]. arXiv preprint

arXiv:1803.01299, 2018.

Chang B, Meng L, Haber E, et al. Multi-level residual networks from

dynamical systems view[J]. arXiv preprint arXiv:1710.10348, 2017.

Chang B, Meng L, Haber E, et al. Reversible

architectures for arbitrarily deep residual neural

networks[C]//Thirty-Second AAAI Conference

on Artificial Intelligence. 2018.

Tao Y, Sun Q, Du Q, et al. Nonlocal Neural

Networks, Nonlocal Diffusion and Nonlocal

Modeling[C]//Advances in Neural Information

Processing Systems. 2018: 496-506.

Page 15: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

HOW DIFFERENTIAL EQUATION VIEW HELPS

NEURAL NETWORK DESIGNINGPRINCIPLED NEURAL ARCHITECTURE DESIGN

Page 16: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NEURAL NETWORK AS SOLVING ODES

Dynamic System Nueral Network

Continuous limit Numerical Approximation

WRN, ResNeXt, Inception-ResNet, PolyNet, SENet etc…… :

New scheme to Approximate the right hand side term

Why not change the way to discrete u_t?

Page 17: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

MULTISTEP ARCHITECTURE?

𝒙𝒏+𝟏 = 𝒙𝒏 + 𝒇(𝒙𝒏)

𝒙𝒕 = 𝒇(𝒙)

Page 18: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

MULTISTEP ARCHITECTURE?

𝒙𝒏+𝟏 = 𝒙𝒏 + 𝒇(𝒙𝒏)

𝒙𝒕 = 𝒇(𝒙) 𝒙𝒏+𝟏 = (𝟏 − 𝒌𝒏)𝒙𝒏 + 𝒌𝒏𝒙𝒏−𝟏 + 𝒇(𝒙𝒏)Linear Multi-step Scheme

Linear Multi-step Residual Network

Page 19: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

MULTISTEP ARCHITECTURE?

𝒙𝒏+𝟏 = 𝒙𝒏 + 𝒇(𝒙𝒏)

𝒙𝒕 = 𝒇(𝒙) 𝒙𝒏+𝟏 = (𝟏 − 𝒌𝒏)𝒙𝒏 + 𝒌𝒏𝒙𝒏−𝟏 + 𝒇(𝒙𝒏)Linear Multi-step Scheme

Linear Multi-step Residual Network

Only One More

Parameter

Page 20: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

EXPERIMENT

Page 21: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

PLOT THE MOMENTUM

Analysis by zero stability

Page 22: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

MODIFIED EQUATION VIEW

𝟏 + 𝒌𝒏 ሶ𝒖 + 𝟏 − 𝒌𝒏𝚫𝒕

𝟐ሷ𝒖𝒏 + 𝒐 𝚫𝒕𝟑 = 𝒇(𝒖)

Learn A Momentum

𝒙𝒏+𝟏 = (𝟏 − 𝒌𝒏)𝒙𝒏+𝒌𝒏𝒙𝒏−𝟏 + 𝚫𝐭𝒇(𝒙𝒏)

[Su et al. 2016] [Dong et al. 2017]

Page 23: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

UNDERSTANDING DROPOUT

Noise can avoid overfit? ሶ𝑋 𝑡 = 𝑓 𝑋 𝑡 , 𝑎 𝑡 + 𝑔(𝑋 𝑡 , 𝑡)𝑑𝐵𝑡 , 𝑋 0 = 𝑋0

The numerical scheme is only need

to be weak convergence!

𝑬𝒅𝒂𝒕𝒂(𝑙𝑜𝑠𝑠(𝑋 𝑇 )

Page 24: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

STOCHASTIC DEPTH AS AN EXAMPLE

𝒙𝒏+𝟏 = 𝒙𝒏 + 𝜼𝒏𝒇 𝒙

= 𝒙𝒏 + 𝑬𝜼𝒏𝒇 𝒙𝒏 + 𝜼𝒏 − 𝑬𝜼𝒏 𝒇(𝒙𝒏)

Page 25: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

Natural Language

Page 26: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

UNDERSTANDING NATURAL LANGUAGE PROCESSING

Using a Neural Network to extract the feature of document!

Emb Emb Emb Emb Emb

Deep Neural Network

y1 y2 y3 y4 y5

Idea: Consider every word in a document as a particle in the n-body system.

Page 27: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TRANSFORMER AS A SPLITTING SCHEME

FFN: advection

Self-Attention: particle interaction

Label

[Vaswani et al.2017]

[Devlin et al.2017]

Page 28: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

A BETTER SPLITTING SCHEME

Page 29: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

RESULT

Page 30: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

Computer Vision

Page 31: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

ONE NOISE LEVEL ONE NET

Network 1 𝝈 = 𝟐𝟓

Page 32: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

ONE NOISE LEVEL ONE NET

Network 1 𝝈 = 𝟐𝟓

Network 2 𝝈 = 𝟑𝟓

Page 33: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

WE WANT

One Model

Page 34: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

WE ALSO WANT GENERALIZATION

BM3D DnCNN

Page 35: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

RETHINKING TRADITIONAL FILTERING APPROACH

Perona-Malik Equation

input

processing

Page 36: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

RETHINKING TRADITIONAL FILTERING APPROACH

Perona-Malik Equation

input

processing

Noisy

Page 37: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

RETHINKING TRADITIONAL FILTERING APPROACH

Perona-Malik Equation

input

processing

Noisy

Page 38: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

RETHINKING TRADITIONAL FILTERING APPROACH

Perona-Malik Equation

input

processing

Over-smoothNoisy

Page 39: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

MOVING ENDPOINT CONTROL VS FIXED ENDPOINT CONTROL

Control dynamic:

Physic Law

𝝉 is the time

arrives the moon

minන0

𝜏

𝑅 𝑤 𝑡 , 𝑡 𝑑𝑡

𝑠. 𝑡. ሶ𝑋 = 𝑓 𝑋 𝑡 , 𝑤 𝑡 ,

𝜏 is the first time that dynamic meets X

Page 40: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

DURR

Page 41: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

GENERALIZE TO UNSEEN NOISE LEVEL

DURR

DnCNN

Page 42: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

Physics

Page 43: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

PHYSICS DISCOVERY

Our Work

Page 44: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

CONV FILTERS AS DIFFERENTIAL OPERATORS

1

1 -4 1

1

Δ𝑢 = 𝑢𝑥𝑥 + 𝑢𝑦𝑦

Page 45: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

PDE-NET

Page 46: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

PDE-NET

Page 47: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

PDE-NET 2.0

- Constrain the function space

- Theoretical Recover Guarantee(Coming Soon)

- Symbolic Discovery

Page 48: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

ON GOING WORK

Shock Wave Architecture SearcHYufeiWang, Ziju Shen, Zichao Long, Bin Dong Learning to

Solve Conservation Laws

Page 49: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

HOW DIFFERENTIAL EQUATION VIEW HELPS

OPTIMIZATIONCONSIDERING THE NEURAL NETWORK STRUCTURE

Page 50: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

RELATED WORKS

Maximal Principle [Li et al. 2018a] [li et al.2018b]

Adjoin Method [Chen et al.2018]

Multigrid [Chang et al.2017]

Domain decomposition

Our Work: ODE captures the composition structure of neural network!

Page 51: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TOWARDS DEEP LEARNING MODELS RESISTANT TO ADVERSARIAL

ATTACKS

Problem:

- More capacity and stronger adversaries decrease transferability. Always 10 times wider

- PGD training is expansive!

Can adversarial

training cheaper

Robust Optimization

Page 52: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION

Adversary updater

Black box

Previous Work

Heavy gradient calculation

Page 53: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION

Adversary updater

Adversary updater

Black box

Previous Work

YOPO

Heavy gradient calculation

Page 54: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

DIFFERENTIAL GAME

Player1

Player2Trajectory

Page 55: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

DIFFERENTIAL GAME

Goal

Player1

Player2Trajectory

Page 56: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

DIFFERENTIAL GAME

Trajectory

Goal

Player1

Player2

Network

Layer2Network

Layer1

Composition

Structure

Page 57: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

DECOUPLE TRAINING

Synthetic gradients [Jaderberg et al.2017]

Lifted Neural Network [Askari et al.2018] [Gu et al.2018] [li et al.2019]

Delayed Gradient [Huo et al.2018]

Block Coordinate Descent Approach [Lau et al. 2018]

Can Control perspective helps us to understand decoupling?

Our idea: Decouple the gradient back propagation with the adversary updating.

Page 58: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

𝒑𝒔𝒕

𝒙𝒔𝒕

First Iteration

𝒑𝒔𝟏

YOPO(YOU ONLY PROPAGATE ONCE)

Page 59: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

𝒑𝒔𝒕

𝒙𝒔𝒕

𝒑𝒔𝟏

First Iteration Other Iteration

copy

𝒑𝒔𝟏

YOPO(YOU ONLY PROPAGATE ONCE)

Page 60: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

𝒑𝒔𝒕

𝒙𝒔𝒕

𝒑𝒔𝟏

First Iteration Other Iteration

copy

𝒑𝒔𝟏

YOPO(YOU ONLY PROPAGATE ONCE)

Page 61: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

WHY DECOUPLING

Page 62: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

RESULT

Page 63: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

SO, WHAT IS INFINITE WIDE NEURAL NETWORK

Neural Network As particle system Linearized Neural Network

Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle

systems: Asymptotic convexity of the loss landscape and universal

scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915,

2018.

Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of

two-layer neural networks[J]. Proceedings of the National Academy of

Sciences, 2018, 115(33): E7665-E7671.

Infinite wide neural network=Kernel Method

Radom Feature + Neural Tangent Kernel

Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and

generalization in neural networks[C]//Advances in neural information

processing systems. 2018: 8571-8580.

Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep

neural networks[J]. arXiv preprint arXiv:1811.03804, 2018.

Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via

over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018.

Page 64: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

SO, WHAT IS INFINITE WIDE NEURAL NETWORK

Neural Network As particle system Linearized Neural Network

Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle

systems: Asymptotic convexity of the loss landscape and universal

scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915,

2018.

Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of

two-layer neural networks[J]. Proceedings of the National Academy of

Sciences, 2018, 115(33): E7665-E7671.

Infinite wide neural network=Kernel Method

Radom Feature + Neural Tangent Kernel

Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and

generalization in neural networks[C]//Advances in neural information

processing systems. 2018: 8571-8580.

Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep

neural networks[J]. arXiv preprint arXiv:1811.03804, 2018.

Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via

over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018.

Page 65: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TWO DIFFERENTIAL EQUATION FOR DEEP LEARNING

Feature Space

Data Space

Neural ODE

Cla

ssifie

r

Nonlocal PDE

Dog

Cat

Evolving Kernel

Page 66: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

WHAT FEATURE DOES NONLOCAL PDE CAPTURES

Page 67: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

WHAT FEATURE DOES NONLOCAL PDE CAPTURES

Page 68: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

WHAT FEATURE DOES NONLOCAL PDE CAPTURES

[Osher et al.2016]

Harmonic Equation -> Dimensionality

Page 69: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

WHAT FEATURE DOES NONLOCAL PDE CAPTURES

[Osher et al.2016]

Harmonic Equation -> Dimensionality

Dimension optimization is not

enough

Page 70: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

WHAT FEATURE DOES NONLOCAL PDE CAPTURES

[Osher et al.2016]

Harmonic Equation -> Dimensionality

Dimension optimization is not

enough

Biharmonic Equation -> Curvature

Page 71: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

SEMI-SUPERVISED LEARNING

Page 72: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

IMAGE INPAINTING

Weighted Nonlocal Laplacian

PSNR=23.51dB,SSIM=0.62

Weighted Curvature Regularzation

PSNR=24.07dB,SSIM=0.68

Page 73: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

IMAGE INPAINTING

Page 74: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

GOING INTO NEURAL!

Page 75: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

LOW FREQUENCY FIRST

Multigrid Method [Xu 1996]

Inverse Scale Space [Scherzer et al. 2001] [Burfer et al. 2006]

Ridge Regression [Smale et al. 2007] [Yao et al. 2007]

Experiment In Neural Network [Rahaman et al.2018] [Xu et al 2019]

Page 76: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TEACHER STUDENT OPTIMIZATION

Train

Page 77: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TEACHER STUDENT OPTIMIZATION

Train

Page 78: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TEACHER STUDENT OPTIMIZATION

Train

Train

Page 79: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TEACHER STUDENT OPTIMIZATION

Train

Train

Dark Knowledge

Page 80: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NO EARLY STOPPING, NO DISTILLATION

Teach

er te

st accu

racy

Page 81: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NO EARLY STOPPING, NO DISTILLATION

Teach

er te

st accu

racy

Page 82: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NO EARLY STOPPING, NO DISTILLATION

Teach

er te

st accu

racy

Page 83: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NO EARLY STOPPING, NO DISTILLATION

Teach

er te

st accu

racy

Page 84: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NO EARLY STOPPING, NO DISTILLATION

Stu

den

t test a

ccu

racy

Teach

er te

st accu

racy

Page 85: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NO EARLY STOPPING, NO DISTILLATION

Stu

den

t test a

ccu

racy

Teach

er te

st accu

racy

Page 86: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

LABEL DENOISING

NOISY LABLEinterpolate

NOISY LABLEinterpolate

NOISY LABLEinterpolate

Page 87: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

THEORY CAN GUARANTEE OUR METHOD!

Separate

Clusterable Dataset

Width enough neural netwrok

Page 88: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

THEORY CAN GUARANTEE OUR METHOD!

Separate

Clusterable Dataset

Width enough neural netwrok

Ensures margin

Page 89: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

LABEL DENOISING

We use a extremely large

network for experiment!

Page 90: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

FUTURE WORK: META-LEARING

Aggregating AIR and GT Label

[Tanaka et al.2018] [Sun et al. 2018]

[Han et al.2018] [Han et al.2019]

Page 91: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

FUTURE WORK: META-LEARING

Aggregating AIR and GT Label

[Tanaka et al.2018] [Sun et al. 2018]

[Han et al.2018] [Han et al.2019]

Meta-learner

Aggregated signal

Supervise Signal: Validation set Robustness

Page 92: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?

Page 93: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?

Page 94: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?

initialization

Good Lip Constant

Page 95: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?

initialization

Clean label training

Good Lip Constant

Page 96: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?

initialization

Clean label training

Noisy label training

Bad Lip Constant

Good Lip Constant

Page 97: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

TAKE HOME MESSAGE

Feature Space

Data Space

Neural ODE

Cla

ssifie

r

Nonlocal PDE

Dog

Cat

Evolving Kernel

Optimization

Architecture

Geometry Feature

Implicit

Regularization

Depth Limit Width Limit

Page 98: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

RESEARCH INTEREST

Geometry Of Data

Control View Of Deep Learning

Kernel Learning

PDEs On Graphs

Learning with limited data(noisy, semi-supervise, weak supervise)

Computational Photography, Image Processing, Graphics.

Page 99: LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

THANK YOU AND QUESTIONS?

Always looking forward to

cooperation opportunities

Contact: [email protected]

Lu Y, Zhong A, Li Q, Dong B. Beyond Finite Layer Neural Networks: Bridging Deep

Architectures and Numerical Differential Equations arXiv:1710.10121. ICML2018

Long Z, Lu Y, Ma X, Dong B. PDE-Net: Learning PDEs from Data arXiv:1710.09668.

ICML2018

Zhang S, Lu Y, Liu J, Dong B. Dynamically Unfolding Recurrent Restorer: A Moving

Endpoint Control Method for Image Restoration arXiv:1805.07709. ICLR2019

Long Z, Lu Y, Dong B. " PDE-Net 2.0: Learning PDEs from Data with A Numeric-

Symbolic Hybrid Deep Network“arXiv:1812.04426. Major Revision JCP.

Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, Bin Dong. You Only

Propogate Once: Accelerating Adversarial Training via Maximal Principle

arXiv:1905.00877

Yiping Lu, Di He, Zhuohan Li, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie-

yan Liu. OreoNet: Understanding NLP from a neural ODE viewpoint. (Submitted)

Bin Dong, Haochen Ju, Yiping Lu, Zuoqiang Shi. CURE: Curvature Regularization

For Missing Data Recovery arXiv preprint arXiv:1901.09548, 2019.

Bin Dong, Jikai Hou, Yiping Lu, Zhihua Zhang. Distillation $\approx$ Early Stopping?

Extracting Knowledge Utilizing Anisotropic Information Retrieval.(Submitted)

Aoxiao Zhong Zhiqing Sun Zhanxing Zhu

Acknowledgement

Bin Dong Di He Xiaoshuai Zhang Liwei WangZhuohan LiJikai Hou

Tianyuan Zhang

Dinghuai Zhang

Quanzheng Li Zhihua Zhang