The Power of Stagewise Learning - University of...

Preview:

Citation preview

The Power of Stagewise LearningFrom Support Vector Machine to Generative Adversarial Nets

Tianbao Yang

Department of Computer ScienceThe University of Iowa

Yang (CS@Uiowa) The Power of Stagewise Learning 1 / 44

Introduction

Outline

1 Introduction

2 Stagewise Learning for Convex Problems

3 Stagewise Learning for Non-Convex Problems

Yang (CS@Uiowa) The Power of Stagewise Learning 2 / 44

Introduction

Gap between Practice vs Theory

0 20 40 60 80 100���������������������

0.0

0.2

0.4

0.6

0.8

1.0��������

������

0.1

0.010.001

1√ t

������������������ ��������������������

Yang (CS@Uiowa) The Power of Stagewise Learning 3 / 44

Introduction

Gap between Practice vs Theory

0 20 40 60 80 100���������������������

0.0

0.2

0.4

0.6

0.8

1.0

��������

������

0.1

0.010.001

1√ t

������������������ ��������������������

0 20 40 60 80 100���������������������

0.2

0.4

0.6

0.8

1.0

��������

���

0.1

0.01 0.001

1√ t

������������������ ��������������������

Q1: Why does Stagewise Learning (SL) Converge Faster?

Q2: How to design better SL algorithms for DNN and Other problems?

Yang (CS@Uiowa) The Power of Stagewise Learning 4 / 44

Introduction

Gap between Practice vs Theory

0 20 40 60 80 100���������������������

0.0

0.2

0.4

0.6

0.8

1.0

��������

������

0.1

0.010.001

1√ t

������������������ ��������������������

0 20 40 60 80 100���������������������

0.2

0.4

0.6

0.8

1.0

��������

���

0.1

0.01 0.001

1√ t

������������������ ��������������������

Q1: Why does Stagewise Learning (SL) Converge Faster?

Q2: How to design better SL algorithms for DNN and Other problems?

Yang (CS@Uiowa) The Power of Stagewise Learning 4 / 44

Introduction

Gap between Practice vs Theory

0 20 40 60 80 100���������������������

0.0

0.2

0.4

0.6

0.8

1.0

��������

������

0.1

0.010.001

1√ t

������������������ ��������������������

0 20 40 60 80 100���������������������

0.2

0.4

0.6

0.8

1.0

��������

���

0.1

0.01 0.001

1√ t

������������������ ��������������������

Q1: Why does Stagewise Learning (SL) Converge Faster?

Q2: How to design better SL algorithms for DNN and Other problems?

Yang (CS@Uiowa) The Power of Stagewise Learning 4 / 44

Introduction

The Evolution of Learning Methods

Time

Complexity of Learning

SVM, logistic regression, ...

Cortes & Vapnik (1995)

Convex Problems

Deep Neural Networks (DNN)

Krizhevsky et al. (2012)

Generative Adversarial Nets (GAN)

Goodfellow et al. (2014)

Non-Convex Problems

Yang (CS@Uiowa) The Power of Stagewise Learning 5 / 44

Introduction

Big Data: Challenges and Opportunities

1.2 million of imagesAlexNet for Image Classification (Krizhevsky et al., 2012)

Google’s JFT 300 millions of imagesBigGAN for Image Generation (Brock et al., 2019)

Training on Huge datasets becomes a bottleneck!

Yang (CS@Uiowa) The Power of Stagewise Learning 6 / 44

Introduction

Learning a Predictive Model

x ∈ Rd y

(x, y) is generated i.i.d.

predictive model: f (x)→ y

Yang (CS@Uiowa) The Power of Stagewise Learning 7 / 44

Introduction

Risk Minimization

f ∗ = arg minf ∈F

R(f ) := Ex,y [`(f (x), y)]

F is a hypothesis class

loss function `(z , y): measures the prediction error

Risk of model f

Yang (CS@Uiowa) The Power of Stagewise Learning 8 / 44

Introduction

Risk Minimization

f ∗ = arg minf ∈F

R(f ) := Ex,y [`(f (x), y)]

F is a hypothesis class

loss function `(z , y): measures the prediction error

Risk of model f

Yang (CS@Uiowa) The Power of Stagewise Learning 8 / 44

Introduction

Empirical Risk Minimization

Empirical Risk Minimization (Offline Learning)

Collect {(x1, y1), (x2, y2), . . . , (xn, yn)}Find an approximate solution f to solve

f ∗n = arg minf ∈F

Rn(f ) :=1

n

n∑i=1

`(f (xi ), yi )

Empirical Risk

Yang (CS@Uiowa) The Power of Stagewise Learning 9 / 44

Introduction

Research Questions in Machine Learning

Iterative Algorithms:

ft+1 ← ft +A(available information at iteration t)

T : total time for training (e.g., # of iterations)

1. How fast is learning?Faster Training: Training Error

2. How accurate is the learned model?

Better Generalization: Testing Error

Yang (CS@Uiowa) The Power of Stagewise Learning 10 / 44

Introduction

Shallow Model: Convex Methods

x→ fw(x) = w>φ(x)

minw∈RD

1

n

n∑i=1

`(w; xi , yi )

Yang (CS@Uiowa) The Power of Stagewise Learning 11 / 44

Introduction

Deep Neural Networks: Non-Convex Methods

x→ fw(x) = wL ◦ σ(· · ·σ(w3 ◦ σ(w2 ◦ σ(w1 ◦ x))))

minw∈RD

1

n

n∑i=1

`(w; xi , yi )

Yang (CS@Uiowa) The Power of Stagewise Learning 12 / 44

Stagewise Learning for Convex Problems

Outline

1 Introduction

2 Stagewise Learning for Convex Problems

3 Stagewise Learning for Non-Convex Problems

Yang (CS@Uiowa) The Power of Stagewise Learning 13 / 44

Stagewise Learning for Convex Problems

Convex Methods

Consider

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

F (w) is a convex function

SVM, Logistic regression, Least-squares, LASSO, etc.

Goal: For a sufficiently small ε > 0, find a solution w such that

F (w)− minw∈Rd

F (w) ≤ ε

Yang (CS@Uiowa) The Power of Stagewise Learning 14 / 44

Stagewise Learning for Convex Problems

Stochastic Gradient Descent (Robbins & Monro, 1951)

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

Stochastic Gradient Descent (SGD) Method

Sample (xt , yt) ∼ Pwt+1 = wt − ηt∂`(wt , xt , yt)

step size

Yang (CS@Uiowa) The Power of Stagewise Learning 15 / 44

Stagewise Learning for Convex Problems

Stochastic Gradient Descent (Robbins & Monro, 1951)

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

Stochastic Gradient Descent (SGD) Method

Sample (xt , yt) ∼ Pwt+1 = wt − ηt∂`(wt , xt , yt)

step size

Yang (CS@Uiowa) The Power of Stagewise Learning 15 / 44

Stagewise Learning for Convex Problems

Slow Convergence of SGD

1 variance of stochastic gradient

2 decreasing step size: standard theory ηt ∝ 1/√t

3 O(1ε2

)iteration complexity (Nemirovski et al., 2009)

Yang (CS@Uiowa) The Power of Stagewise Learning 16 / 44

Stagewise Learning for Convex Problems

Slow Convergence of SGD

1 variance of stochastic gradient

2 decreasing step size: standard theory ηt ∝ 1/√t

3 O(1ε2

)iteration complexity (Nemirovski et al., 2009)

Yang (CS@Uiowa) The Power of Stagewise Learning 16 / 44

Stagewise Learning for Convex Problems

How to improve the Convergence speed?

Previous approaches

Mini-batch SGD: sampling multiple samples each iteration• Pros: can have parallel speed-up• Cons: cannot not reduce total time complexity

Making stronger assumptions• e.g. strong convexity, smoothness, using full gradients• Pros: speed-up for some family of problems• Cons: may not hold

Can we do better without imposing these strong assumptions? ICML 2017

Yang (CS@Uiowa) The Power of Stagewise Learning 17 / 44

Stagewise Learning for Convex Problems

Stagewise Stochastic Gradient (ICML 2017)

One-Stage SGD(w1, η,D,T )

for τ = 1, . . . ,T

wτ+1 = Proj‖w−w1‖2≤D [wτ − η∂`(wτ , xiτ , yiτ )]

Output: w =∑T

τ=1wτ/T

projection onto a ball

optimum

Yang (CS@Uiowa) The Power of Stagewise Learning 18 / 44

Stagewise Learning for Convex Problems

Stagewise Stochastic Gradient (ICML 2017)

One-Stage SGD(w1, η,D,T )

for τ = 1, . . . ,T

wτ+1 = Proj‖w−w1‖2≤D [wτ − η∂`(wτ , xiτ , yiτ )]

Output: w =∑T

τ=1wτ/T

projection onto a ball

optimum

Yang (CS@Uiowa) The Power of Stagewise Learning 18 / 44

Stagewise Learning for Convex Problems

Stagewise Stochastic Gradient (ICML 2017)

Set η1, D1 and T SSGDfor k = 0, . . . ,K − 1 do

wk+1 = One-Stage SGD(wk , ηk ,Dk ,T )Set ηk+1 = ηk/2, Dk+1 = Dk/2

end forOutput: wK

-3.5 -2.5 -1.5 -0.5

Yang (CS@Uiowa) The Power of Stagewise Learning 19 / 44

Stagewise Learning for Convex Problems

Theoretical Result: Faster Convergence

Growth/Sharpness Condition: ∃c <∞ and θ ∈ (0, 1] such that:

‖w −w∗‖2 ≤ c(F (w)− F (w∗))θ,

−0.1 −0.05 0 0.05 0.10

0.02

0.04

0.06

0.08

0.1

x

F(x

)

|x|, θ=1

|x|1.5

, θ=2/3

|x|2, θ=0.5

O(

1ε2(1−θ)

log(

))vs. O( 1

ε2 )

SSGD SGD

Yang (CS@Uiowa) The Power of Stagewise Learning 20 / 44

Stagewise Learning for Convex Problems

Theoretical Result: Faster Convergence

Growth/Sharpness Condition: ∃c <∞ and θ ∈ (0, 1] such that:

‖w −w∗‖2 ≤ c(F (w)− F (w∗))θ,

−0.1 −0.05 0 0.05 0.10

0.02

0.04

0.06

0.08

0.1

x

F(x

)

|x|, θ=1

|x|1.5

, θ=2/3

|x|2, θ=0.5

O(

1ε2(1−θ)

log(

))vs. O( 1

ε2 )

SSGD SGD

Yang (CS@Uiowa) The Power of Stagewise Learning 20 / 44

Stagewise Learning for Convex Problems

Machine Learning Problems satisfy GC

SVM for high-dimensional data: θ = 1

minw

1

n

n∑i=1

max(0, 1− yiw>xi ) + λ‖w‖1

O(log(1/ε)) vs O(1/ε2)

LASSO: θ = 1/2

minw

1

n

n∑i=1

(yi −w>xi )2 + λ‖w‖1

O(1/ε) vs O(1/ε2)

many many more

Yang (CS@Uiowa) The Power of Stagewise Learning 21 / 44

Stagewise Learning for Convex Problems

Empirical Results: SSGD vs SGD

million songs: n = 463, 715 E2006-tfidf: n = 16, 087

number of iterations ×107

0 2 4 6 8 10

log10(objectivegap)

-6

-5

-4

-3

-2

-1

0

robust + ℓ1 norm, million songs

SGDSSGD

number of iterations ×105

0 2 4 6 8 10

log10(objectivegap)

-7

-6

-5

-4

-3

-2

-1

0

robust + ℓ1 norm, E2006-tfidf

SGDSSGD

number of iterations ×107

0 2 4 6 8 10

log10(objectivegap)

-8

-7.5

-7

-6.5

-6

-5.5

-5

-4.5

-4

-3.5

hinge loss + ℓ1 norm, covtype

SGDSSGD

number of iterations ×107

0 2 4 6 8 10

log10(objectivegap)

-5

-4.5

-4

-3.5

-3

-2.5

-2

-1.5

-1

hinge loss + ℓ1 norm, real-sim

SGDSSGD

covtype: n = 581, 012 real-sim: n = 72, 309

Yang (CS@Uiowa) The Power of Stagewise Learning 22 / 44

Stagewise Learning for Convex Problems

From Balanced Data to Imbalanced Data

Minimizing Error Rate is not a Good idea!

Yang (CS@Uiowa) The Power of Stagewise Learning 23 / 44

Stagewise Learning for Convex Problems

Optimization of Suitable Measures for Imbalanced Data

maximize AUC (area under ROC curve):

AUC = Prob.(score of + > score of −)

maximize F-measure

F =2 · Precision · Recall

Precision + Recall

etc

Non-Decomposable over individual examples

Yang (CS@Uiowa) The Power of Stagewise Learning 24 / 44

Stagewise Learning for Convex Problems

Our Contributions

Our Approaches: low memory, low computation, low iteration complexity.

1 Fast Stochastic/Online AUC Maximization (ICML 2018)• based on a zero-sum convex-concave game formulation• stagewise learning for solving a convex-concave game• improves complexity from O(1/ε2) to O(1/ε)

2 Fast Stochastic/Online F-measure Maximization (NeurIPS 2018)• decomposes into two tasks• learning a posterior probability and learning a threshold• stagewise learning for optimizing the threshold faster• improves complexity from O(1/ε2) to O(1/ε)

Yang (CS@Uiowa) The Power of Stagewise Learning 25 / 44

Stagewise Learning for Convex Problems

Experiments: Stochastic AUC Maximization

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

iteration 104

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

AU

C

w8a dataset

FSAUC

SOALM

SOLAM_l1

OPAUC

1 2 3 4 5 6 7 8 9 10

iteration 104

0.89

0.895

0.9

0.905

0.91

0.915

0.92

0.925

0.93

AU

C

ijcnn1 dataset

FSAUC

SOALM

SOLAM_l1

OPAUC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

iteration 104

0.97

0.975

0.98

0.985

0.99

0.995

AU

C

real-sim dataset

FSAUC

SOALM

SOLAM_l1

OPAUC

p = 2.97%, p = 9.49%, p = 30.68%

Yang (CS@Uiowa) The Power of Stagewise Learning 26 / 44

Stagewise Learning for Convex Problems

Experiments: Stochastic F-measure Optimization

Yang (CS@Uiowa) The Power of Stagewise Learning 27 / 44

Stagewise Learning for Non-Convex Problems

Outline

1 Introduction

2 Stagewise Learning for Convex Problems

3 Stagewise Learning for Non-Convex Problems

Yang (CS@Uiowa) The Power of Stagewise Learning 28 / 44

Stagewise Learning for Non-Convex Problems

Generative Adversarial Nets (GAN)

Generative Modeling (Density Estimation)

Sample Generation

slides courtesy of Ian Goodfellow NIPS 2016 tutorial

Yang (CS@Uiowa) The Power of Stagewise Learning 29 / 44

Stagewise Learning for Non-Convex Problems

Generative Adversarial Nets

Yang (CS@Uiowa) The Power of Stagewise Learning 30 / 44

Stagewise Learning for Non-Convex Problems

Formulation of Generative Adversarial Nets

Zero-Sum Game Formulation (Goodfellow et al. (2014))

minG

maxD

Ex∼pdata [logD(x)] + Ez∼prandom [log(1− D(G (z)))]

z random noise, x real data

G generator: G (z) fake image

D discriminator: D(x) (e.g., probability of being real image)

Ideally: at Nash Equilibrium: p(G (z)) = p(x)

prob. of being real

Yang (CS@Uiowa) The Power of Stagewise Learning 31 / 44

Stagewise Learning for Non-Convex Problems

Formulation of Generative Adversarial Nets

Zero-Sum Game Formulation (Goodfellow et al. (2014))

minG

maxD

Ex∼pdata [logD(x)] + Ez∼prandom [log(1− D(G (z)))]

z random noise, x real data

G generator: G (z) fake image

D discriminator: D(x) (e.g., probability of being real image)

Ideally: at Nash Equilibrium: p(G (z)) = p(x)

prob. of being real

Yang (CS@Uiowa) The Power of Stagewise Learning 31 / 44

Stagewise Learning for Non-Convex Problems

Non-Convex Non-Concave Games

minw

maxu

F (w,u) := Ex∼pdata,z∼prandom [`(w,u; x, z)]

Non-convex w.r.t w, non-concave w.r.t u

Finding Nash-Equilibrium is NP-hard

Existing studies mostly heuristics learned from convex-concave games

No Convergence Guarantee (could be divergent)

First Convergence Theory for Finding Nearly Stationary Points:Stationary Point: ∇F (w∗,u∗) = 0

(Lin-Liu-Rafique-Y., arXiv 2018, presented at NeurIPS 2018 SGO&ML)

Yang (CS@Uiowa) The Power of Stagewise Learning 32 / 44

Stagewise Learning for Non-Convex Problems

A Stagewise Learning Algorithm for GAN

for k = 0, . . . ,K − 1 do

Fk(w,u) = F (w,u) + λ2‖w −wk‖2 − λ

2‖u− uk‖2(wk+1,uk+1) = A(Fk ,wk ,uk , ηk ,Tk)

end for

(w∗,u∗) = arg minw

maxu

F (w,u) +λ

2‖w −w∗‖2 − λ

2‖u− u∗‖2

1 λ > 0 is an algorithmic regularization parameter

2 Different A (e.g., Primal-Dual SGD, Adam)

3 Use variational inequality for analysis

Yang (CS@Uiowa) The Power of Stagewise Learning 33 / 44

Stagewise Learning for Non-Convex Problems

Experiments: Image Generation

0 500 1000 1500

Number of Epochs

1

2

3

4

5

6

Incep

tio

n S

co

reWGAN training on CIFAR10 data

Adam(5d)

Adam

IPP-Adam

0 5 10 15 20

Number of Epochs

1

2

3

4

5

Incep

tio

n S

co

re

WGAN training on Lsun Bedroom data

Adam(5d)

Adam

IPP-Adam

GeneralizationPerformance

Yang (CS@Uiowa) The Power of Stagewise Learning 34 / 44

Stagewise Learning for Non-Convex Problems

Experiments: Image Generation

0 500 1000 1500

Number of Epochs

1

2

3

4

5

6

Incep

tio

n S

co

reWGAN training on CIFAR10 data

Adam(5d)

Adam

IPP-Adam

0 5 10 15 20

Number of Epochs

1

2

3

4

5

Incep

tio

n S

co

re

WGAN training on Lsun Bedroom data

Adam(5d)

Adam

IPP-Adam

GeneralizationPerformance

Yang (CS@Uiowa) The Power of Stagewise Learning 34 / 44

Stagewise Learning for Non-Convex Problems

Why Does Stagewise Learning Improves Testing Error?

Optimizing Deep Neural Networks

minw∈Rd

F (w) ,1

n

n∑i=1

`(w; xi , yi )

0 20 40 60 80 100������ ��� ����������

0.2

0.4

0.6

0.8

1.0

���

����

����

0.1

0.01 0.001

1√ t

����� ������������� ���������!���� ����

Non-Convex

Yang (CS@Uiowa) The Power of Stagewise Learning 35 / 44

Stagewise Learning for Non-Convex Problems

A Stagewise Learning Algorithm for DNN

for k = 0, . . . ,K − 1 do

Fk(w) = F (w) + λ2‖w −wk‖2

wk+1 = SGD(Fk ,wk , ηk ,Tk)ηk+1 = ηk/2

end for

1 λ ≥ 0: algorithmic regularization

2 step size decreases geometrically in stages

3 is a more general framework

4 convergence to a stationary point established in ICLR 2019

Yang (CS@Uiowa) The Power of Stagewise Learning 36 / 44

Stagewise Learning for Non-Convex Problems

Why Does Stagewise Learning Improves Testing Error?

1 Explore Growth Condition!

µ‖w −w∗‖22 ≤ F (w)− F (w∗),

2 Explore Almost-Convexity Condition

−∇F (w)>(w∗ −w)

F (w)− F (w∗)≥ γ > 0,

3 Use stability for generalization analysis

Testing Error = Training Error + Generalization Error

Flat minimum

0 50 100 150 200���������������������

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

‎‏‏

μ= 0.00027

������������������ ������������

Yang (CS@Uiowa) The Power of Stagewise Learning 37 / 44

Stagewise Learning for Non-Convex Problems

Why Does Stagewise Learning Improves Testing Error?

Faster Convergence of Training Error: O( 1µε) vs O( 1

µ2ε)

With the same number of iterations T =√

Testing Error Comparison

O(

1√nµ1/2

)vs. O

(1√

nµ3/2

)Stagewise Step Size Decreasing Step Size

First Theory for Explaining Stagewise Learning for DNN

(Y.-Yan-Yuan-Jin, arXiv 2018)

Yang (CS@Uiowa) The Power of Stagewise Learning 38 / 44

Stagewise Learning for Non-Convex Problems

Why Does Stagewise Learning Improves Testing Error?

Faster Convergence of Training Error: O( 1µε) vs O( 1

µ2ε)

With the same number of iterations T =√

Testing Error Comparison

O(

1√nµ1/2

)vs. O

(1√

nµ3/2

)Stagewise Step Size Decreasing Step Size

First Theory for Explaining Stagewise Learning for DNN

(Y.-Yan-Yuan-Jin, arXiv 2018)

Yang (CS@Uiowa) The Power of Stagewise Learning 38 / 44

Stagewise Learning for Non-Convex Problems

Better Stagewise Learning Algorithms

0 20 40 60 80 100������"� �"���!������

0.2

0.4

0.6

0.8

1.0

��!"

����

� �

��!��"��������������������"���#�!���� ������"���#�!���� ������"���#�!���� ����

0 20 40 60 80 100��!���$�"�$�! #�������

0.2

0.4

0.6

0.8

1.0

��#$� ���""!"

��#��$������������� �����$���%�#�����������$���%�#�����������$���%�#���������

V1: standard, no alg. regularization, restart at last solution

V2: algorithmic regularization, restart at last solution

V3: algorithmic regularization, restart at averaged solution

Yang (CS@Uiowa) The Power of Stagewise Learning 39 / 44

Stagewise Learning for Non-Convex Problems

Conclusions

Stagewise Learning is Powerful

Theory: faster convergence for both convex and non-convex methods

Practice: SVM, AUC, F-measure, DNN, GAN

Open problems: e.g., Generalization of SL for GAN

Yang (CS@Uiowa) The Power of Stagewise Learning 40 / 44

Stagewise Learning for Non-Convex Problems

Acknowledgements

Students: Yi Xu, Mingrui Liu, Xiaoxuan Zhang, Zhuoning Yuan, YanYan, Hassan Rafique

Collaborators: Qihang Lin (UIowa), Rong Jin (Alibaba Group)

Funding Agency: NSF (CRII, Big Data)

Yang (CS@Uiowa) The Power of Stagewise Learning 41 / 44

Stagewise Learning for Non-Convex Problems

Thank You!

Questions?

Yang (CS@Uiowa) The Power of Stagewise Learning 42 / 44

Stagewise Learning for Non-Convex Problems

References I

Brock, Andrew, Donahue, Jeff, and Simonyan, Karen. Large scale GANtraining for high fidelity natural image synthesis. In InternationalConference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=B1xsqj09Fm.

Cortes, Corinna and Vapnik, Vladimir. Support-vector networks. Mach.Learn., 20(3):273–297, September 1995. ISSN 0885-6125.

Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing,Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio,Yoshua. Generative adversarial nets. In Proceedings of the 27thInternational Conference on Neural Information Processing Systems -Volume 2, NIPS’14, pp. 2672–2680, 2014.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenetclassification with deep convolutional neural networks. In Advances inNeural Information Processing Systems (NIPS), pp. 1106–1114, 2012.

Yang (CS@Uiowa) The Power of Stagewise Learning 43 / 44

Stagewise Learning for Non-Convex Problems

References II

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochasticapproximation approach to stochastic programming. SIAM Journal onOptimization, 19(4):1574–1609, 2009.

Robbins, Herbert and Monro, Sutton. A stochastic approximation method.The Annals of Mathematical Statistics, 22:400–407, 1951.

Yang (CS@Uiowa) The Power of Stagewise Learning 44 / 44

Recommended