LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al

LIMITING THE DEEP LEARNINGYIPING LU PEKING UNIVERSITY

DEEP LEARNING IS SUCCESSFUL, BUT…

GPU

PHD

Professor

PHD

LARGER THE BETTER?

Test error

decreasing

LARGER THE BETTER?

Test error

decreasing

Why Not Limiting To 𝒑

𝒏= +∞

TWO LIMITING

Depth Limiting

Wid

th L

imit

DEPTH LIMITING: ODE

[He et al. 2015]

[E. 2017] [Haber et al. 2017] [Lu et al. 2017]

[Chen et al. 2018]

BEYOND FINITE LAYER NEURAL NETWORK

Going into

infinite layer

Differential Equation

ALSO THEORETICAL GUARANTEED

TRADITIONAL WISDOM IN DEEP LEARNING

?

BECOMING HOT NOW…

NeurIPS2018 Best Paper Award

BRIDGING CONTROL AND LEARNING

Feedback is the

learning loss

Neural Network

Optimization


Feedback is the

learning loss

Neural Network

Optimization

Interpretability? PDE-Net ICML2018


Feedback is the

learning loss

Neural Network

Optimization

A better model?LM-ResNet ICML2018

DURR ICLR2019

Macaroon submitted


Chang B, Meng L, Haber E, et al. Reversible

architectures for arbitrarily deep residual neural

networks[C]//Thirty-Second AAAI Conference

on Artificial Intelligence. 2018.

Tao Y, Sun Q, Du Q, et al. Nonlocal Neural

Networks, Nonlocal Diffusion and Nonlocal

Modeling[C]//Advances in Neural Information

Processing Systems. 2018: 496-506.


Feedback is the

learning loss

Neural Network

Optimization

A better model?

New Algorithm?

LM-ResNet ICML2018

DURR ICLR2019

Macaroon submitted

YOPO Submitted


An W, Wang H, Sun Q, et al. A pid controller approach for stochastic

optimization of deep networks[C]//Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition. 2018: 8522-8531.

Chen T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential

equations[C]//Advances in Neural Information Processing Systems.

2018: 6571-6583.

Li Q, Hao S. An optimal control approach to deep learning and

applications to discrete-weight neural networks[J]. arXiv preprint

arXiv:1803.01299, 2018.

Chang B, Meng L, Haber E, et al. Multi-level residual networks from

dynamical systems view[J]. arXiv preprint arXiv:1710.10348, 2017.

Chang B, Meng L, Haber E, et al. Reversible

architectures for arbitrarily deep residual neural

networks[C]//Thirty-Second AAAI Conference

on Artificial Intelligence. 2018.

Tao Y, Sun Q, Du Q, et al. Nonlocal Neural

Networks, Nonlocal Diffusion and Nonlocal

Modeling[C]//Advances in Neural Information

Processing Systems. 2018: 496-506.

HOW DIFFERENTIAL EQUATION VIEW HELPS

NEURAL NETWORK DESIGNINGPRINCIPLED NEURAL ARCHITECTURE DESIGN

NEURAL NETWORK AS SOLVING ODES

Dynamic System Nueral Network

Continuous limit Numerical Approximation

WRN, ResNeXt, Inception-ResNet, PolyNet, SENet etc…… :

New scheme to Approximate the right hand side term

Why not change the way to discrete u_t?

MULTISTEP ARCHITECTURE?

𝒙𝒏+𝟏 = 𝒙𝒏 + 𝒇(𝒙𝒏)

𝒙𝒕 = 𝒇(𝒙)



𝒙𝒕 = 𝒇(𝒙) 𝒙𝒏+𝟏 = (𝟏 − 𝒌𝒏)𝒙𝒏 + 𝒌𝒏𝒙𝒏−𝟏 + 𝒇(𝒙𝒏)Linear Multi-step Scheme

Linear Multi-step Residual Network



𝒙𝒕 = 𝒇(𝒙) 𝒙𝒏+𝟏 = (𝟏 − 𝒌𝒏)𝒙𝒏 + 𝒌𝒏𝒙𝒏−𝟏 + 𝒇(𝒙𝒏)Linear Multi-step Scheme

Linear Multi-step Residual Network

Only One More

Parameter

EXPERIMENT

PLOT THE MOMENTUM

Analysis by zero stability

MODIFIED EQUATION VIEW

𝟏 + 𝒌𝒏 ሶ𝒖 + 𝟏 − 𝒌𝒏𝚫𝒕

𝟐ሷ𝒖𝒏 + 𝒐 𝚫𝒕𝟑 = 𝒇(𝒖)

Learn A Momentum

𝒙𝒏+𝟏 = (𝟏 − 𝒌𝒏)𝒙𝒏+𝒌𝒏𝒙𝒏−𝟏 + 𝚫𝐭𝒇(𝒙𝒏)

[Su et al. 2016] [Dong et al. 2017]

UNDERSTANDING DROPOUT

Noise can avoid overfit? ሶ𝑋 𝑡 = 𝑓 𝑋 𝑡 , 𝑎 𝑡 + 𝑔(𝑋 𝑡 , 𝑡)𝑑𝐵𝑡 , 𝑋 0 = 𝑋0

The numerical scheme is only need

to be weak convergence!

𝑬𝒅𝒂𝒕𝒂(𝑙𝑜𝑠𝑠(𝑋 𝑇 )

STOCHASTIC DEPTH AS AN EXAMPLE

𝒙𝒏+𝟏 = 𝒙𝒏 + 𝜼𝒏𝒇 𝒙

= 𝒙𝒏 + 𝑬𝜼𝒏𝒇 𝒙𝒏 + 𝜼𝒏 − 𝑬𝜼𝒏 𝒇(𝒙𝒏)

Natural Language

UNDERSTANDING NATURAL LANGUAGE PROCESSING

Using a Neural Network to extract the feature of document!

Emb Emb Emb Emb Emb

Deep Neural Network

y1 y2 y3 y4 y5

Idea: Consider every word in a document as a particle in the n-body system.

TRANSFORMER AS A SPLITTING SCHEME

FFN: advection

Self-Attention: particle interaction

Label

[Vaswani et al.2017]

[Devlin et al.2017]

A BETTER SPLITTING SCHEME

RESULT

Computer Vision

ONE NOISE LEVEL ONE NET

Network 1 𝝈 = 𝟐𝟓

ONE NOISE LEVEL ONE NET

Network 1 𝝈 = 𝟐𝟓

Network 2 𝝈 = 𝟑𝟓

WE WANT

One Model

WE ALSO WANT GENERALIZATION

BM3D DnCNN

RETHINKING TRADITIONAL FILTERING APPROACH

Perona-Malik Equation

input

processing



input

processing

Noisy



input

processing

Noisy



input

processing

Over-smoothNoisy

MOVING ENDPOINT CONTROL VS FIXED ENDPOINT CONTROL

Control dynamic:

Physic Law

𝝉 is the time

arrives the moon

minන0

𝜏

𝑅 𝑤 𝑡 , 𝑡 𝑑𝑡

𝑠. 𝑡. ሶ𝑋 = 𝑓 𝑋 𝑡 , 𝑤 𝑡 ,

𝜏 is the first time that dynamic meets X

DURR

GENERALIZE TO UNSEEN NOISE LEVEL

DURR

DnCNN

Physics

PHYSICS DISCOVERY

Our Work

CONV FILTERS AS DIFFERENTIAL OPERATORS

1

1 -4 1

1

Δ𝑢 = 𝑢𝑥𝑥 + 𝑢𝑦𝑦

PDE-NET

PDE-NET

PDE-NET 2.0

- Constrain the function space

- Theoretical Recover Guarantee(Coming Soon)

- Symbolic Discovery

ON GOING WORK

Shock Wave Architecture SearcHYufeiWang, Ziju Shen, Zichao Long, Bin Dong Learning to

Solve Conservation Laws

HOW DIFFERENTIAL EQUATION VIEW HELPS

OPTIMIZATIONCONSIDERING THE NEURAL NETWORK STRUCTURE

RELATED WORKS

Maximal Principle [Li et al. 2018a] [li et al.2018b]

Adjoin Method [Chen et al.2018]

Multigrid [Chang et al.2017]

Domain decomposition

Our Work: ODE captures the composition structure of neural network!

TOWARDS DEEP LEARNING MODELS RESISTANT TO ADVERSARIAL

ATTACKS

Problem:

- More capacity and stronger adversaries decrease transferability. Always 10 times wider

- PGD training is expansive!

Can adversarial

training cheaper

Robust Optimization

TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION

Adversary updater

Black box

Previous Work

Heavy gradient calculation

TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION

Adversary updater

Adversary updater

Black box

Previous Work

YOPO

Heavy gradient calculation

DIFFERENTIAL GAME

Player1

Player2Trajectory

DIFFERENTIAL GAME

Goal

Player1

Player2Trajectory

DIFFERENTIAL GAME

Trajectory

Goal

Player1

Player2

Network

Layer2Network

Layer1

Composition

Structure

DECOUPLE TRAINING

Synthetic gradients [Jaderberg et al.2017]

Lifted Neural Network [Askari et al.2018] [Gu et al.2018] [li et al.2019]

Delayed Gradient [Huo et al.2018]

Block Coordinate Descent Approach [Lau et al. 2018]

Can Control perspective helps us to understand decoupling?

Our idea: Decouple the gradient back propagation with the adversary updating.

𝒑𝒔𝒕

𝒙𝒔𝒕

First Iteration

𝒑𝒔𝟏

YOPO(YOU ONLY PROPAGATE ONCE)

𝒑𝒔𝒕

𝒙𝒔𝒕

𝒑𝒔𝟏

First Iteration Other Iteration

copy

𝒑𝒔𝟏


𝒑𝒔𝒕

𝒙𝒔𝒕

𝒑𝒔𝟏

First Iteration Other Iteration

copy

𝒑𝒔𝟏


WHY DECOUPLING

RESULT

SO, WHAT IS INFINITE WIDE NEURAL NETWORK

Neural Network As particle system Linearized Neural Network

Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle

systems: Asymptotic convexity of the loss landscape and universal

scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915,

2018.

Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of

two-layer neural networks[J]. Proceedings of the National Academy of

Sciences, 2018, 115(33): E7665-E7671.

Infinite wide neural network=Kernel Method

Radom Feature + Neural Tangent Kernel

Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and

generalization in neural networks[C]//Advances in neural information

processing systems. 2018: 8571-8580.

Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep

neural networks[J]. arXiv preprint arXiv:1811.03804, 2018.

Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via

over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018.

SO, WHAT IS INFINITE WIDE NEURAL NETWORK

Neural Network As particle system Linearized Neural Network

Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle

systems: Asymptotic convexity of the loss landscape and universal

scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915,

2018.

Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of

two-layer neural networks[J]. Proceedings of the National Academy of

Sciences, 2018, 115(33): E7665-E7671.

Infinite wide neural network=Kernel Method

Radom Feature + Neural Tangent Kernel

Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and

generalization in neural networks[C]//Advances in neural information

processing systems. 2018: 8571-8580.

Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep

neural networks[J]. arXiv preprint arXiv:1811.03804, 2018.

Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via

over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018.

TWO DIFFERENTIAL EQUATION FOR DEEP LEARNING

Feature Space

Data Space

Neural ODE

Cla

ssifie

r

Nonlocal PDE

Dog

Cat

Evolving Kernel

WHAT FEATURE DOES NONLOCAL PDE CAPTURES



[Osher et al.2016]

Harmonic Equation -> Dimensionality


[Osher et al.2016]


Dimension optimization is not

enough


[Osher et al.2016]


Dimension optimization is not

enough

Biharmonic Equation -> Curvature

SEMI-SUPERVISED LEARNING

IMAGE INPAINTING

Weighted Nonlocal Laplacian

PSNR=23.51dB,SSIM=0.62

Weighted Curvature Regularzation

PSNR=24.07dB,SSIM=0.68

IMAGE INPAINTING

GOING INTO NEURAL!

LOW FREQUENCY FIRST

Multigrid Method [Xu 1996]

Inverse Scale Space [Scherzer et al. 2001] [Burfer et al. 2006]

Ridge Regression [Smale et al. 2007] [Yao et al. 2007]

Experiment In Neural Network [Rahaman et al.2018] [Xu et al 2019]

TEACHER STUDENT OPTIMIZATION

Train


Train


Train

Train


Train

Train

Dark Knowledge

NO EARLY STOPPING, NO DISTILLATION

Teach

er te

st accu

racy


Teach

er te

st accu

racy


Teach

er te

st accu

racy


Teach

er te

st accu

racy


Stu

den

t test a

ccu

racy

Teach

er te

st accu

racy


Stu

den

t test a

ccu

racy

Teach

er te

st accu

racy

LABEL DENOISING

NOISY LABLEinterpolate



THEORY CAN GUARANTEE OUR METHOD!

Separate

Clusterable Dataset

…

Width enough neural netwrok

THEORY CAN GUARANTEE OUR METHOD!

Separate

Clusterable Dataset

…

Width enough neural netwrok

Ensures margin

LABEL DENOISING

We use a extremely large

network for experiment!

FUTURE WORK: META-LEARING

Aggregating AIR and GT Label

[Tanaka et al.2018] [Sun et al. 2018]

[Han et al.2018] [Han et al.2019]

FUTURE WORK: META-LEARING

Aggregating AIR and GT Label

[Tanaka et al.2018] [Sun et al. 2018]

[Han et al.2018] [Han et al.2019]

Meta-learner

Aggregated signal

Supervise Signal: Validation set Robustness

NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?



initialization

Good Lip Constant


initialization

Clean label training

Good Lip Constant


initialization

Clean label training

Noisy label training

Bad Lip Constant

Good Lip Constant

TAKE HOME MESSAGE

Feature Space

Data Space

Neural ODE

Cla

ssifie

r

Nonlocal PDE

Dog

Cat

Evolving Kernel

Optimization

Architecture

Geometry Feature

Implicit

Regularization

Depth Limit Width Limit

RESEARCH INTEREST

Geometry Of Data

Control View Of Deep Learning

Kernel Learning

PDEs On Graphs

Learning with limited data(noisy, semi-supervise, weak supervise)

Computational Photography, Image Processing, Graphics.

THANK YOU AND QUESTIONS?

Always looking forward to

cooperation opportunities

Contact: [email protected]

Lu Y, Zhong A, Li Q, Dong B. Beyond Finite Layer Neural Networks: Bridging Deep

Architectures and Numerical Differential Equations arXiv:1710.10121. ICML2018

Long Z, Lu Y, Ma X, Dong B. PDE-Net: Learning PDEs from Data arXiv:1710.09668.

ICML2018

Zhang S, Lu Y, Liu J, Dong B. Dynamically Unfolding Recurrent Restorer: A Moving

Endpoint Control Method for Image Restoration arXiv:1805.07709. ICLR2019

Long Z, Lu Y, Dong B. " PDE-Net 2.0: Learning PDEs from Data with A Numeric-

Symbolic Hybrid Deep Network“arXiv:1812.04426. Major Revision JCP.

Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, Bin Dong. You Only

Propogate Once: Accelerating Adversarial Training via Maximal Principle

arXiv:1905.00877

Yiping Lu, Di He, Zhuohan Li, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie-

yan Liu. OreoNet: Understanding NLP from a neural ODE viewpoint. (Submitted)

Bin Dong, Haochen Ju, Yiping Lu, Zuoqiang Shi. CURE: Curvature Regularization

For Missing Data Recovery arXiv preprint arXiv:1901.09548, 2019.

Bin Dong, Jikai Hou, Yiping Lu, Zhihua Zhang. Distillation $\approx$ Early Stopping?

Extracting Knowledge Utilizing Anisotropic Information Retrieval.(Submitted)

Aoxiao Zhong Zhiqing Sun Zhanxing Zhu

Acknowledgement

Bin Dong Di He Xiaoshuai Zhang Liwei WangZhuohan LiJikai Hou

Tianyuan Zhang

Dinghuai Zhang

Quanzheng Li Zhihua Zhang

Documents

LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and Pattern Recognition. 2018: 8522-8531. Chen T Q, Rubanova Y, Bettencourt J, et al