The Power of Stagewise Learning - University of...

The Power of Stagewise LearningFrom Support Vector Machine to Generative Adversarial Nets

Tianbao Yang

Department of Computer ScienceThe University of Iowa

Yang (CS@Uiowa) The Power of Stagewise Learning 1 / 44

Introduction

Outline

1 Introduction

2 Stagewise Learning for Convex Problems

3 Stagewise Learning for Non-Convex Problems

Introduction

Gap between Practice vs Theory

0 20 40 60 80 100��

1.0��

��

0.010.001

1√ t

��

Introduction

0 20 40 60 80 100��

��

0.010.001

1√ t

��

0 20 40 60 80 100��

��

0.01 0.001

1√ t

��

Q1: Why does Stagewise Learning (SL) Converge Faster?

Q2: How to design better SL algorithms for DNN and Other problems?

Introduction

0 20 40 60 80 100��

��

0.010.001

1√ t

��

0 20 40 60 80 100��

��

0.01 0.001

1√ t

��

Introduction

0 20 40 60 80 100��

��

0.010.001

1√ t

��

0 20 40 60 80 100��

��

0.01 0.001

1√ t

��

Introduction

The Evolution of Learning Methods

Complexity of Learning

SVM, logistic regression, ...

Cortes & Vapnik (1995)

Convex Problems

Deep Neural Networks (DNN)

Krizhevsky et al. (2012)

Generative Adversarial Nets (GAN)

Goodfellow et al. (2014)

Non-Convex Problems

Introduction

Big Data: Challenges and Opportunities

1.2 million of imagesAlexNet for Image Classification (Krizhevsky et al., 2012)

Google’s JFT 300 millions of imagesBigGAN for Image Generation (Brock et al., 2019)

Training on Huge datasets becomes a bottleneck!

Introduction

Learning a Predictive Model

x ∈ Rd y

(x, y) is generated i.i.d.

predictive model: f (x)→ y

Introduction

Risk Minimization

f ∗ = arg minf ∈F

R(f ) := Ex,y [`(f (x), y)]

F is a hypothesis class

loss function `(z , y): measures the prediction error

Risk of model f

Introduction

Risk Minimization

f ∗ = arg minf ∈F

R(f ) := Ex,y [`(f (x), y)]

F is a hypothesis class

loss function `(z , y): measures the prediction error

Risk of model f

Introduction

Empirical Risk Minimization

Empirical Risk Minimization (Offline Learning)

Collect {(x1, y1), (x2, y2), . . . , (xn, yn)}Find an approximate solution f to solve

f ∗n = arg minf ∈F

Rn(f ) :=1

n∑i=1

`(f (xi ), yi )

Empirical Risk

Introduction

Research Questions in Machine Learning

Iterative Algorithms:

ft+1 ← ft +A(available information at iteration t)

T : total time for training (e.g., # of iterations)

1. How fast is learning?Faster Training: Training Error

2. How accurate is the learned model?

Better Generalization: Testing Error

Introduction

Shallow Model: Convex Methods

x→ fw(x) = w>φ(x)

minw∈RD

n∑i=1

`(w; xi , yi )

Introduction

Deep Neural Networks: Non-Convex Methods

x→ fw(x) = wL ◦ σ(· · ·σ(w3 ◦ σ(w2 ◦ σ(w1 ◦ x))))

minw∈RD

n∑i=1

`(w; xi , yi )

Stagewise Learning for Convex Problems

Outline

1 Introduction

Convex Methods

Consider

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

F (w) is a convex function

SVM, Logistic regression, Least-squares, LASSO, etc.

Goal: For a sufficiently small ε > 0, find a solution w such that

F (w)− minw∈Rd

F (w) ≤ ε

Stochastic Gradient Descent (Robbins & Monro, 1951)

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

Stochastic Gradient Descent (SGD) Method

Sample (xt , yt) ∼ Pwt+1 = wt − ηt∂`(wt , xt , yt)

step size

Stochastic Gradient Descent (Robbins & Monro, 1951)

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

Stochastic Gradient Descent (SGD) Method

Sample (xt , yt) ∼ Pwt+1 = wt − ηt∂`(wt , xt , yt)

step size

Slow Convergence of SGD

1 variance of stochastic gradient

2 decreasing step size: standard theory ηt ∝ 1/√t

3 O(1ε2

)iteration complexity (Nemirovski et al., 2009)

Slow Convergence of SGD

1 variance of stochastic gradient

2 decreasing step size: standard theory ηt ∝ 1/√t

3 O(1ε2

)iteration complexity (Nemirovski et al., 2009)

How to improve the Convergence speed?

Previous approaches

Mini-batch SGD: sampling multiple samples each iteration• Pros: can have parallel speed-up• Cons: cannot not reduce total time complexity

Making stronger assumptions• e.g. strong convexity, smoothness, using full gradients• Pros: speed-up for some family of problems• Cons: may not hold

Can we do better without imposing these strong assumptions? ICML 2017

Stagewise Stochastic Gradient (ICML 2017)

One-Stage SGD(w1, η,D,T )

for τ = 1, . . . ,T

wτ+1 = Proj‖w−w1‖2≤D [wτ − η∂`(wτ , xiτ , yiτ )]

Output: w =∑T

τ=1wτ/T

projection onto a ball

optimum

One-Stage SGD(w1, η,D,T )

for τ = 1, . . . ,T

wτ+1 = Proj‖w−w1‖2≤D [wτ − η∂`(wτ , xiτ , yiτ )]

Output: w =∑T

τ=1wτ/T

projection onto a ball

optimum

Set η1, D1 and T SSGDfor k = 0, . . . ,K − 1 do

wk+1 = One-Stage SGD(wk , ηk ,Dk ,T )Set ηk+1 = ηk/2, Dk+1 = Dk/2

end forOutput: wK

-3.5 -2.5 -1.5 -0.5

Theoretical Result: Faster Convergence

Growth/Sharpness Condition: ∃c <∞ and θ ∈ (0, 1] such that:

‖w −w∗‖2 ≤ c(F (w)− F (w∗))θ,

−0.1 −0.05 0 0.05 0.10

|x|, θ=1

|x|1.5

, θ=2/3

|x|2, θ=0.5

1ε2(1−θ)

))vs. O( 1

SSGD SGD

Theoretical Result: Faster Convergence

Growth/Sharpness Condition: ∃c <∞ and θ ∈ (0, 1] such that:

‖w −w∗‖2 ≤ c(F (w)− F (w∗))θ,

−0.1 −0.05 0 0.05 0.10

|x|, θ=1

|x|1.5

, θ=2/3

|x|2, θ=0.5

1ε2(1−θ)

))vs. O( 1

SSGD SGD

Machine Learning Problems satisfy GC

SVM for high-dimensional data: θ = 1

n∑i=1

max(0, 1− yiw>xi ) + λ‖w‖1

O(log(1/ε)) vs O(1/ε2)

LASSO: θ = 1/2

n∑i=1

(yi −w>xi )2 + λ‖w‖1

O(1/ε) vs O(1/ε2)

many many more

Empirical Results: SSGD vs SGD

million songs: n = 463, 715 E2006-tfidf: n = 16, 087

number of iterations ×107

0 2 4 6 8 10

log10(objectivegap)

robust + ℓ1 norm, million songs

SGDSSGD

0 2 4 6 8 10

log10(objectivegap)

robust + ℓ1 norm, E2006-tfidf

SGDSSGD

0 2 4 6 8 10

log10(objectivegap)

hinge loss + ℓ1 norm, covtype

SGDSSGD

0 2 4 6 8 10

log10(objectivegap)

hinge loss + ℓ1 norm, real-sim

SGDSSGD

covtype: n = 581, 012 real-sim: n = 72, 309

From Balanced Data to Imbalanced Data

Minimizing Error Rate is not a Good idea!

Optimization of Suitable Measures for Imbalanced Data

maximize AUC (area under ROC curve):

AUC = Prob.(score of + > score of −)

maximize F-measure

F =2 · Precision · Recall

Precision + Recall

Non-Decomposable over individual examples

Our Contributions

Our Approaches: low memory, low computation, low iteration complexity.

1 Fast Stochastic/Online AUC Maximization (ICML 2018)• based on a zero-sum convex-concave game formulation• stagewise learning for solving a convex-concave game• improves complexity from O(1/ε2) to O(1/ε)

2 Fast Stochastic/Online F-measure Maximization (NeurIPS 2018)• decomposes into two tasks• learning a posterior probability and learning a threshold• stagewise learning for optimizing the threshold faster• improves complexity from O(1/ε2) to O(1/ε)

Experiments: Stochastic AUC Maximization

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

iteration 104

w8a dataset

SOLAM_l1

1 2 3 4 5 6 7 8 9 10

iteration 104

ijcnn1 dataset

SOLAM_l1

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

iteration 104

real-sim dataset

SOLAM_l1

p = 2.97%, p = 9.49%, p = 30.68%

Experiments: Stochastic F-measure Optimization

Stagewise Learning for Non-Convex Problems

Outline

1 Introduction

Generative Adversarial Nets (GAN)

Generative Modeling (Density Estimation)

Sample Generation

slides courtesy of Ian Goodfellow NIPS 2016 tutorial

Generative Adversarial Nets

Formulation of Generative Adversarial Nets

Zero-Sum Game Formulation (Goodfellow et al. (2014))

Ex∼pdata [logD(x)] + Ez∼prandom [log(1− D(G (z)))]

z random noise, x real data

G generator: G (z) fake image

D discriminator: D(x) (e.g., probability of being real image)

Ideally: at Nash Equilibrium: p(G (z)) = p(x)

prob. of being real

Formulation of Generative Adversarial Nets

Zero-Sum Game Formulation (Goodfellow et al. (2014))

Ex∼pdata [logD(x)] + Ez∼prandom [log(1− D(G (z)))]

z random noise, x real data

G generator: G (z) fake image

D discriminator: D(x) (e.g., probability of being real image)

Ideally: at Nash Equilibrium: p(G (z)) = p(x)

prob. of being real

Non-Convex Non-Concave Games

F (w,u) := Ex∼pdata,z∼prandom [`(w,u; x, z)]

Non-convex w.r.t w, non-concave w.r.t u

Finding Nash-Equilibrium is NP-hard

Existing studies mostly heuristics learned from convex-concave games

No Convergence Guarantee (could be divergent)

First Convergence Theory for Finding Nearly Stationary Points:Stationary Point: ∇F (w∗,u∗) = 0

(Lin-Liu-Rafique-Y., arXiv 2018, presented at NeurIPS 2018 SGO&ML)

A Stagewise Learning Algorithm for GAN

for k = 0, . . . ,K − 1 do

Fk(w,u) = F (w,u) + λ2‖w −wk‖2 − λ

2‖u− uk‖2(wk+1,uk+1) = A(Fk ,wk ,uk , ηk ,Tk)

end for

(w∗,u∗) = arg minw

F (w,u) +λ

2‖w −w∗‖2 − λ

2‖u− u∗‖2

1 λ > 0 is an algorithmic regularization parameter

2 Different A (e.g., Primal-Dual SGD, Adam)

3 Use variational inequality for analysis

Experiments: Image Generation

0 500 1000 1500

Number of Epochs

reWGAN training on CIFAR10 data

Adam(5d)

IPP-Adam

0 5 10 15 20

Number of Epochs

WGAN training on Lsun Bedroom data

Adam(5d)

IPP-Adam

GeneralizationPerformance

Experiments: Image Generation

0 500 1000 1500

Number of Epochs

reWGAN training on CIFAR10 data

Adam(5d)

IPP-Adam

0 5 10 15 20

Number of Epochs

WGAN training on Lsun Bedroom data

Adam(5d)

IPP-Adam

GeneralizationPerformance

Why Does Stagewise Learning Improves Testing Error?

Optimizing Deep Neural Networks

minw∈Rd

F (w) ,1

n∑i=1

`(w; xi , yi )

0 20 40 60 80 100��

��

0.01 0.001

1√ t

�� !��

Non-Convex

A Stagewise Learning Algorithm for DNN

for k = 0, . . . ,K − 1 do

Fk(w) = F (w) + λ2‖w −wk‖2

wk+1 = SGD(Fk ,wk , ηk ,Tk)ηk+1 = ηk/2

end for

1 λ ≥ 0: algorithmic regularization

2 step size decreases geometrically in stages

3 is a more general framework

4 convergence to a stationary point established in ICLR 2019

1 Explore Growth Condition!

µ‖w −w∗‖22 ≤ F (w)− F (w∗),

2 Explore Almost-Convexity Condition

−∇F (w)>(w∗ −w)

F (w)− F (w∗)≥ γ > 0,

3 Use stability for generalization analysis

Testing Error = Training Error + Generalization Error

Flat minimum

0 50 100 150 200��

‎‏‏

μ= 0.00027

��

Faster Convergence of Training Error: O( 1µε) vs O( 1

µ2ε)

With the same number of iterations T =√

Testing Error Comparison

1√nµ1/2

)vs. O

nµ3/2

)Stagewise Step Size Decreasing Step Size

First Theory for Explaining Stagewise Learning for DNN

(Y.-Yan-Yuan-Jin, arXiv 2018)

Faster Convergence of Training Error: O( 1µε) vs O( 1

µ2ε)

With the same number of iterations T =√

Testing Error Comparison

1√nµ1/2

)vs. O

nµ3/2

)Stagewise Step Size Decreasing Step Size

First Theory for Explaining Stagewise Learning for DNN

(Y.-Yan-Yuan-Jin, arXiv 2018)

Better Stagewise Learning Algorithms

0 20 40 60 80 100��"� �"��!��

��!"

��

� �

��!��"��"��#�!�� "��#�!�� "��#�!��

0 20 40 60 80 100��!��$�"�$�! #��

��#$� ��""!"

��#��$�� $��%�#��$��%�#��$��%�#��

V1: standard, no alg. regularization, restart at last solution

V2: algorithmic regularization, restart at last solution

V3: algorithmic regularization, restart at averaged solution

Conclusions

Stagewise Learning is Powerful

Theory: faster convergence for both convex and non-convex methods

Practice: SVM, AUC, F-measure, DNN, GAN

Open problems: e.g., Generalization of SL for GAN

Acknowledgements

Students: Yi Xu, Mingrui Liu, Xiaoxuan Zhang, Zhuoning Yuan, YanYan, Hassan Rafique

Collaborators: Qihang Lin (UIowa), Rong Jin (Alibaba Group)

Funding Agency: NSF (CRII, Big Data)

Thank You!

Questions?

References I

Brock, Andrew, Donahue, Jeff, and Simonyan, Karen. Large scale GANtraining for high fidelity natural image synthesis. In InternationalConference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=B1xsqj09Fm.

Cortes, Corinna and Vapnik, Vladimir. Support-vector networks. Mach.Learn., 20(3):273–297, September 1995. ISSN 0885-6125.

Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing,Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio,Yoshua. Generative adversarial nets. In Proceedings of the 27thInternational Conference on Neural Information Processing Systems -Volume 2, NIPS’14, pp. 2672–2680, 2014.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenetclassification with deep convolutional neural networks. In Advances inNeural Information Processing Systems (NIPS), pp. 1106–1114, 2012.

References II

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochasticapproximation approach to stochastic programming. SIAM Journal onOptimization, 19(4):1574–1609, 2009.

Robbins, Herbert and Monro, Sutton. A stochastic approximation method.The Annals of Mathematical Statistics, 22:400–407, 1951.

The Power of Stagewise Learning - University of...

Documents

Software Support for Improved Driver Reliability-Texaspages.cs.wisc.edu/~swift/papers/driver-reliability-talk.pdfThe (old) elephant in the room Device drivers are: The majority of

Stagewise Intro Lecture 1

Stagewise Safe Bayesian Optimization with Gaussian Processes · known underlying functions as Gaussian processes (GPs), which are smooth, ﬂexible, nonparametric models (Ras-mussen

New Forward Stagewise Additive Model for Collaborative Multiview … · 2016. 8. 8. · margin compared to our previous model. Also, the proposed ... each feature space and ﬁnally

Irreducible representations of the symmetric groupcossey/CMU talk.pdfThe symmetric group Representation Theory James, Mathas, and Fayers Irreducible representations of the symmetric

Deep Learning Feature Learning Representation Learning Generative Learning

Least Angle Regression, Forward Stagewise and the Lassostatweb.stanford.edu/~tibs/stat315a/LECTURES/lars.pdf · January 2005 Rob Tibshirani, Stanford 15 ’ & $ % Relationship between

Memory Scaling : A Systems Architecture Perspectiveomutlu/pub/mutlu_memory-scaling_imw13_invited-talk.pdfThe Main Memory System ! Main memory is a critical component of all computing

Least Angle Regression, Forward Stagewise and the Lassohastie/TALKS/larstalk.pdf · Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone

Guided Learning Projects Messy Learning – Real Learning – Successful Learning

Stagewise Lassojmlr.csail.mit.edu/papers/volume8/zhao07a/zhao07a.pdfSTAGEWISE LASSO innovative “backward” step. This step uses the same minimization rule as the forward step to

mw 1 I 1:111111. - ERIC - Education Resources … reported the existence of a stagewise differentiation that tended-to -lend credence to the Piagetian theory of general cognitive The

SSR-Net: A Compact Soft Stagewise Regression Network for

The 2nd International Symposium on Information Geometry ...calvino.polito.it/~pistone/igaia2-talk.pdfThe model space for the manifold are locally at each p∈ M the Orlicz space of

Multi-class AdaBoost - Stanford Universityww.web.stanford.edu/~hastie/Papers/SII-2-3-A8-Zhu.pdf · Multi-class AdaBoost ... forward stagewise additive modeling algorithm that mini-mizes

The UFOs of Octoberhomepages.rpi.edu/~sofkam/papers/ufo-talk.pdfThe UFOs of October The Autokinetic Eﬀect and Group Dynamics in UFO Observations Michael D. Sofka sofkam@rpi.edu Inquiring

Mode Frequencies from GONG, MDI, and HMI Datasylvain/research/preprints/NSO.WS27/talk.pdfThe GONG ﬁtting produced tables of singlets, ﬁtting 126 epochs over that 13.6 year span,

Some topics on Sparsitykowon.dongseo.ac.kr/~lbg/cagd/kmmcs/201008/NamSangNam.pdfI Morphological Component Analysis [MCA, Bobin et al] I Stagewise OMP [Donoho et al] I CoSAMP [Needell

A significance test for the lasso - Stanford Universitystatweb.stanford.edu › ~tibs › ftp › covtest-talk.pdfThe covariance test statistic: formal de nition Suppose that we wish

Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin