View
13
Download
0
Category
Preview:
Citation preview
The Power of Stagewise LearningFrom Support Vector Machine to Generative Adversarial Nets
Tianbao Yang
Department of Computer ScienceThe University of Iowa
Yang (CS@Uiowa) The Power of Stagewise Learning 1 / 44
Introduction
Outline
1 Introduction
2 Stagewise Learning for Convex Problems
3 Stagewise Learning for Non-Convex Problems
Yang (CS@Uiowa) The Power of Stagewise Learning 2 / 44
Introduction
Gap between Practice vs Theory
0 20 40 60 80 100���������������������
0.0
0.2
0.4
0.6
0.8
1.0��������
������
0.1
0.010.001
1√ t
������������������ ��������������������
Yang (CS@Uiowa) The Power of Stagewise Learning 3 / 44
Introduction
Gap between Practice vs Theory
0 20 40 60 80 100���������������������
0.0
0.2
0.4
0.6
0.8
1.0
��������
������
0.1
0.010.001
1√ t
������������������ ��������������������
0 20 40 60 80 100���������������������
0.2
0.4
0.6
0.8
1.0
��������
���
0.1
0.01 0.001
1√ t
������������������ ��������������������
Q1: Why does Stagewise Learning (SL) Converge Faster?
Q2: How to design better SL algorithms for DNN and Other problems?
Yang (CS@Uiowa) The Power of Stagewise Learning 4 / 44
Introduction
Gap between Practice vs Theory
0 20 40 60 80 100���������������������
0.0
0.2
0.4
0.6
0.8
1.0
��������
������
0.1
0.010.001
1√ t
������������������ ��������������������
0 20 40 60 80 100���������������������
0.2
0.4
0.6
0.8
1.0
��������
���
0.1
0.01 0.001
1√ t
������������������ ��������������������
Q1: Why does Stagewise Learning (SL) Converge Faster?
Q2: How to design better SL algorithms for DNN and Other problems?
Yang (CS@Uiowa) The Power of Stagewise Learning 4 / 44
Introduction
Gap between Practice vs Theory
0 20 40 60 80 100���������������������
0.0
0.2
0.4
0.6
0.8
1.0
��������
������
0.1
0.010.001
1√ t
������������������ ��������������������
0 20 40 60 80 100���������������������
0.2
0.4
0.6
0.8
1.0
��������
���
0.1
0.01 0.001
1√ t
������������������ ��������������������
Q1: Why does Stagewise Learning (SL) Converge Faster?
Q2: How to design better SL algorithms for DNN and Other problems?
Yang (CS@Uiowa) The Power of Stagewise Learning 4 / 44
Introduction
The Evolution of Learning Methods
Time
Complexity of Learning
SVM, logistic regression, ...
Cortes & Vapnik (1995)
Convex Problems
Deep Neural Networks (DNN)
Krizhevsky et al. (2012)
Generative Adversarial Nets (GAN)
Goodfellow et al. (2014)
Non-Convex Problems
Yang (CS@Uiowa) The Power of Stagewise Learning 5 / 44
Introduction
Big Data: Challenges and Opportunities
1.2 million of imagesAlexNet for Image Classification (Krizhevsky et al., 2012)
Google’s JFT 300 millions of imagesBigGAN for Image Generation (Brock et al., 2019)
Training on Huge datasets becomes a bottleneck!
Yang (CS@Uiowa) The Power of Stagewise Learning 6 / 44
Introduction
Learning a Predictive Model
x ∈ Rd y
(x, y) is generated i.i.d.
predictive model: f (x)→ y
Yang (CS@Uiowa) The Power of Stagewise Learning 7 / 44
Introduction
Risk Minimization
f ∗ = arg minf ∈F
R(f ) := Ex,y [`(f (x), y)]
F is a hypothesis class
loss function `(z , y): measures the prediction error
Risk of model f
Yang (CS@Uiowa) The Power of Stagewise Learning 8 / 44
Introduction
Risk Minimization
f ∗ = arg minf ∈F
R(f ) := Ex,y [`(f (x), y)]
F is a hypothesis class
loss function `(z , y): measures the prediction error
Risk of model f
Yang (CS@Uiowa) The Power of Stagewise Learning 8 / 44
Introduction
Empirical Risk Minimization
Empirical Risk Minimization (Offline Learning)
Collect {(x1, y1), (x2, y2), . . . , (xn, yn)}Find an approximate solution f to solve
f ∗n = arg minf ∈F
Rn(f ) :=1
n
n∑i=1
`(f (xi ), yi )
Empirical Risk
Yang (CS@Uiowa) The Power of Stagewise Learning 9 / 44
Introduction
Research Questions in Machine Learning
Iterative Algorithms:
ft+1 ← ft +A(available information at iteration t)
T : total time for training (e.g., # of iterations)
1. How fast is learning?Faster Training: Training Error
2. How accurate is the learned model?
Better Generalization: Testing Error
Yang (CS@Uiowa) The Power of Stagewise Learning 10 / 44
Introduction
Shallow Model: Convex Methods
x→ fw(x) = w>φ(x)
minw∈RD
1
n
n∑i=1
`(w; xi , yi )
Yang (CS@Uiowa) The Power of Stagewise Learning 11 / 44
Introduction
Deep Neural Networks: Non-Convex Methods
x→ fw(x) = wL ◦ σ(· · ·σ(w3 ◦ σ(w2 ◦ σ(w1 ◦ x))))
minw∈RD
1
n
n∑i=1
`(w; xi , yi )
Yang (CS@Uiowa) The Power of Stagewise Learning 12 / 44
Stagewise Learning for Convex Problems
Outline
1 Introduction
2 Stagewise Learning for Convex Problems
3 Stagewise Learning for Non-Convex Problems
Yang (CS@Uiowa) The Power of Stagewise Learning 13 / 44
Stagewise Learning for Convex Problems
Convex Methods
Consider
minw∈Rd
F (w) , Ex,y∼P [`(w; x, y)]
F (w) is a convex function
SVM, Logistic regression, Least-squares, LASSO, etc.
Goal: For a sufficiently small ε > 0, find a solution w such that
F (w)− minw∈Rd
F (w) ≤ ε
Yang (CS@Uiowa) The Power of Stagewise Learning 14 / 44
Stagewise Learning for Convex Problems
Stochastic Gradient Descent (Robbins & Monro, 1951)
minw∈Rd
F (w) , Ex,y∼P [`(w; x, y)]
Stochastic Gradient Descent (SGD) Method
Sample (xt , yt) ∼ Pwt+1 = wt − ηt∂`(wt , xt , yt)
step size
Yang (CS@Uiowa) The Power of Stagewise Learning 15 / 44
Stagewise Learning for Convex Problems
Stochastic Gradient Descent (Robbins & Monro, 1951)
minw∈Rd
F (w) , Ex,y∼P [`(w; x, y)]
Stochastic Gradient Descent (SGD) Method
Sample (xt , yt) ∼ Pwt+1 = wt − ηt∂`(wt , xt , yt)
step size
Yang (CS@Uiowa) The Power of Stagewise Learning 15 / 44
Stagewise Learning for Convex Problems
Slow Convergence of SGD
1 variance of stochastic gradient
2 decreasing step size: standard theory ηt ∝ 1/√t
3 O(1ε2
)iteration complexity (Nemirovski et al., 2009)
Yang (CS@Uiowa) The Power of Stagewise Learning 16 / 44
Stagewise Learning for Convex Problems
Slow Convergence of SGD
1 variance of stochastic gradient
2 decreasing step size: standard theory ηt ∝ 1/√t
3 O(1ε2
)iteration complexity (Nemirovski et al., 2009)
Yang (CS@Uiowa) The Power of Stagewise Learning 16 / 44
Stagewise Learning for Convex Problems
How to improve the Convergence speed?
Previous approaches
Mini-batch SGD: sampling multiple samples each iteration• Pros: can have parallel speed-up• Cons: cannot not reduce total time complexity
Making stronger assumptions• e.g. strong convexity, smoothness, using full gradients• Pros: speed-up for some family of problems• Cons: may not hold
Can we do better without imposing these strong assumptions? ICML 2017
Yang (CS@Uiowa) The Power of Stagewise Learning 17 / 44
Stagewise Learning for Convex Problems
Stagewise Stochastic Gradient (ICML 2017)
One-Stage SGD(w1, η,D,T )
for τ = 1, . . . ,T
wτ+1 = Proj‖w−w1‖2≤D [wτ − η∂`(wτ , xiτ , yiτ )]
Output: w =∑T
τ=1wτ/T
projection onto a ball
optimum
Yang (CS@Uiowa) The Power of Stagewise Learning 18 / 44
Stagewise Learning for Convex Problems
Stagewise Stochastic Gradient (ICML 2017)
One-Stage SGD(w1, η,D,T )
for τ = 1, . . . ,T
wτ+1 = Proj‖w−w1‖2≤D [wτ − η∂`(wτ , xiτ , yiτ )]
Output: w =∑T
τ=1wτ/T
projection onto a ball
optimum
Yang (CS@Uiowa) The Power of Stagewise Learning 18 / 44
Stagewise Learning for Convex Problems
Stagewise Stochastic Gradient (ICML 2017)
Set η1, D1 and T SSGDfor k = 0, . . . ,K − 1 do
wk+1 = One-Stage SGD(wk , ηk ,Dk ,T )Set ηk+1 = ηk/2, Dk+1 = Dk/2
end forOutput: wK
-3.5 -2.5 -1.5 -0.5
Yang (CS@Uiowa) The Power of Stagewise Learning 19 / 44
Stagewise Learning for Convex Problems
Theoretical Result: Faster Convergence
Growth/Sharpness Condition: ∃c <∞ and θ ∈ (0, 1] such that:
‖w −w∗‖2 ≤ c(F (w)− F (w∗))θ,
−0.1 −0.05 0 0.05 0.10
0.02
0.04
0.06
0.08
0.1
x
F(x
)
|x|, θ=1
|x|1.5
, θ=2/3
|x|2, θ=0.5
O(
1ε2(1−θ)
log(
1ε
))vs. O( 1
ε2 )
SSGD SGD
Yang (CS@Uiowa) The Power of Stagewise Learning 20 / 44
Stagewise Learning for Convex Problems
Theoretical Result: Faster Convergence
Growth/Sharpness Condition: ∃c <∞ and θ ∈ (0, 1] such that:
‖w −w∗‖2 ≤ c(F (w)− F (w∗))θ,
−0.1 −0.05 0 0.05 0.10
0.02
0.04
0.06
0.08
0.1
x
F(x
)
|x|, θ=1
|x|1.5
, θ=2/3
|x|2, θ=0.5
O(
1ε2(1−θ)
log(
1ε
))vs. O( 1
ε2 )
SSGD SGD
Yang (CS@Uiowa) The Power of Stagewise Learning 20 / 44
Stagewise Learning for Convex Problems
Machine Learning Problems satisfy GC
SVM for high-dimensional data: θ = 1
minw
1
n
n∑i=1
max(0, 1− yiw>xi ) + λ‖w‖1
O(log(1/ε)) vs O(1/ε2)
LASSO: θ = 1/2
minw
1
n
n∑i=1
(yi −w>xi )2 + λ‖w‖1
O(1/ε) vs O(1/ε2)
many many more
Yang (CS@Uiowa) The Power of Stagewise Learning 21 / 44
Stagewise Learning for Convex Problems
Empirical Results: SSGD vs SGD
million songs: n = 463, 715 E2006-tfidf: n = 16, 087
number of iterations ×107
0 2 4 6 8 10
log10(objectivegap)
-6
-5
-4
-3
-2
-1
0
robust + ℓ1 norm, million songs
SGDSSGD
number of iterations ×105
0 2 4 6 8 10
log10(objectivegap)
-7
-6
-5
-4
-3
-2
-1
0
robust + ℓ1 norm, E2006-tfidf
SGDSSGD
number of iterations ×107
0 2 4 6 8 10
log10(objectivegap)
-8
-7.5
-7
-6.5
-6
-5.5
-5
-4.5
-4
-3.5
hinge loss + ℓ1 norm, covtype
SGDSSGD
number of iterations ×107
0 2 4 6 8 10
log10(objectivegap)
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
hinge loss + ℓ1 norm, real-sim
SGDSSGD
covtype: n = 581, 012 real-sim: n = 72, 309
Yang (CS@Uiowa) The Power of Stagewise Learning 22 / 44
Stagewise Learning for Convex Problems
From Balanced Data to Imbalanced Data
Minimizing Error Rate is not a Good idea!
Yang (CS@Uiowa) The Power of Stagewise Learning 23 / 44
Stagewise Learning for Convex Problems
Optimization of Suitable Measures for Imbalanced Data
maximize AUC (area under ROC curve):
AUC = Prob.(score of + > score of −)
maximize F-measure
F =2 · Precision · Recall
Precision + Recall
etc
Non-Decomposable over individual examples
Yang (CS@Uiowa) The Power of Stagewise Learning 24 / 44
Stagewise Learning for Convex Problems
Our Contributions
Our Approaches: low memory, low computation, low iteration complexity.
1 Fast Stochastic/Online AUC Maximization (ICML 2018)• based on a zero-sum convex-concave game formulation• stagewise learning for solving a convex-concave game• improves complexity from O(1/ε2) to O(1/ε)
2 Fast Stochastic/Online F-measure Maximization (NeurIPS 2018)• decomposes into two tasks• learning a posterior probability and learning a threshold• stagewise learning for optimizing the threshold faster• improves complexity from O(1/ε2) to O(1/ε)
Yang (CS@Uiowa) The Power of Stagewise Learning 25 / 44
Stagewise Learning for Convex Problems
Experiments: Stochastic AUC Maximization
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
iteration 104
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
AU
C
w8a dataset
FSAUC
SOALM
SOLAM_l1
OPAUC
1 2 3 4 5 6 7 8 9 10
iteration 104
0.89
0.895
0.9
0.905
0.91
0.915
0.92
0.925
0.93
AU
C
ijcnn1 dataset
FSAUC
SOALM
SOLAM_l1
OPAUC
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
iteration 104
0.97
0.975
0.98
0.985
0.99
0.995
AU
C
real-sim dataset
FSAUC
SOALM
SOLAM_l1
OPAUC
p = 2.97%, p = 9.49%, p = 30.68%
Yang (CS@Uiowa) The Power of Stagewise Learning 26 / 44
Stagewise Learning for Convex Problems
Experiments: Stochastic F-measure Optimization
Yang (CS@Uiowa) The Power of Stagewise Learning 27 / 44
Stagewise Learning for Non-Convex Problems
Outline
1 Introduction
2 Stagewise Learning for Convex Problems
3 Stagewise Learning for Non-Convex Problems
Yang (CS@Uiowa) The Power of Stagewise Learning 28 / 44
Stagewise Learning for Non-Convex Problems
Generative Adversarial Nets (GAN)
Generative Modeling (Density Estimation)
Sample Generation
slides courtesy of Ian Goodfellow NIPS 2016 tutorial
Yang (CS@Uiowa) The Power of Stagewise Learning 29 / 44
Stagewise Learning for Non-Convex Problems
Generative Adversarial Nets
Yang (CS@Uiowa) The Power of Stagewise Learning 30 / 44
Stagewise Learning for Non-Convex Problems
Formulation of Generative Adversarial Nets
Zero-Sum Game Formulation (Goodfellow et al. (2014))
minG
maxD
Ex∼pdata [logD(x)] + Ez∼prandom [log(1− D(G (z)))]
z random noise, x real data
G generator: G (z) fake image
D discriminator: D(x) (e.g., probability of being real image)
Ideally: at Nash Equilibrium: p(G (z)) = p(x)
prob. of being real
Yang (CS@Uiowa) The Power of Stagewise Learning 31 / 44
Stagewise Learning for Non-Convex Problems
Formulation of Generative Adversarial Nets
Zero-Sum Game Formulation (Goodfellow et al. (2014))
minG
maxD
Ex∼pdata [logD(x)] + Ez∼prandom [log(1− D(G (z)))]
z random noise, x real data
G generator: G (z) fake image
D discriminator: D(x) (e.g., probability of being real image)
Ideally: at Nash Equilibrium: p(G (z)) = p(x)
prob. of being real
Yang (CS@Uiowa) The Power of Stagewise Learning 31 / 44
Stagewise Learning for Non-Convex Problems
Non-Convex Non-Concave Games
minw
maxu
F (w,u) := Ex∼pdata,z∼prandom [`(w,u; x, z)]
Non-convex w.r.t w, non-concave w.r.t u
Finding Nash-Equilibrium is NP-hard
Existing studies mostly heuristics learned from convex-concave games
No Convergence Guarantee (could be divergent)
First Convergence Theory for Finding Nearly Stationary Points:Stationary Point: ∇F (w∗,u∗) = 0
(Lin-Liu-Rafique-Y., arXiv 2018, presented at NeurIPS 2018 SGO&ML)
Yang (CS@Uiowa) The Power of Stagewise Learning 32 / 44
Stagewise Learning for Non-Convex Problems
A Stagewise Learning Algorithm for GAN
for k = 0, . . . ,K − 1 do
Fk(w,u) = F (w,u) + λ2‖w −wk‖2 − λ
2‖u− uk‖2(wk+1,uk+1) = A(Fk ,wk ,uk , ηk ,Tk)
end for
(w∗,u∗) = arg minw
maxu
F (w,u) +λ
2‖w −w∗‖2 − λ
2‖u− u∗‖2
1 λ > 0 is an algorithmic regularization parameter
2 Different A (e.g., Primal-Dual SGD, Adam)
3 Use variational inequality for analysis
Yang (CS@Uiowa) The Power of Stagewise Learning 33 / 44
Stagewise Learning for Non-Convex Problems
Experiments: Image Generation
0 500 1000 1500
Number of Epochs
1
2
3
4
5
6
Incep
tio
n S
co
reWGAN training on CIFAR10 data
Adam(5d)
Adam
IPP-Adam
0 5 10 15 20
Number of Epochs
1
2
3
4
5
Incep
tio
n S
co
re
WGAN training on Lsun Bedroom data
Adam(5d)
Adam
IPP-Adam
GeneralizationPerformance
Yang (CS@Uiowa) The Power of Stagewise Learning 34 / 44
Stagewise Learning for Non-Convex Problems
Experiments: Image Generation
0 500 1000 1500
Number of Epochs
1
2
3
4
5
6
Incep
tio
n S
co
reWGAN training on CIFAR10 data
Adam(5d)
Adam
IPP-Adam
0 5 10 15 20
Number of Epochs
1
2
3
4
5
Incep
tio
n S
co
re
WGAN training on Lsun Bedroom data
Adam(5d)
Adam
IPP-Adam
GeneralizationPerformance
Yang (CS@Uiowa) The Power of Stagewise Learning 34 / 44
Stagewise Learning for Non-Convex Problems
Why Does Stagewise Learning Improves Testing Error?
Optimizing Deep Neural Networks
minw∈Rd
F (w) ,1
n
n∑i=1
`(w; xi , yi )
0 20 40 60 80 100������ ��� ����������
0.2
0.4
0.6
0.8
1.0
���
����
����
�
0.1
0.01 0.001
1√ t
����� ������������� ���������!���� ����
Non-Convex
Yang (CS@Uiowa) The Power of Stagewise Learning 35 / 44
Stagewise Learning for Non-Convex Problems
A Stagewise Learning Algorithm for DNN
for k = 0, . . . ,K − 1 do
Fk(w) = F (w) + λ2‖w −wk‖2
wk+1 = SGD(Fk ,wk , ηk ,Tk)ηk+1 = ηk/2
end for
1 λ ≥ 0: algorithmic regularization
2 step size decreases geometrically in stages
3 is a more general framework
4 convergence to a stationary point established in ICLR 2019
Yang (CS@Uiowa) The Power of Stagewise Learning 36 / 44
Stagewise Learning for Non-Convex Problems
Why Does Stagewise Learning Improves Testing Error?
1 Explore Growth Condition!
µ‖w −w∗‖22 ≤ F (w)− F (w∗),
2 Explore Almost-Convexity Condition
−∇F (w)>(w∗ −w)
F (w)− F (w∗)≥ γ > 0,
3 Use stability for generalization analysis
Testing Error = Training Error + Generalization Error
Flat minimum
0 50 100 150 200���������������������
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
μ= 0.00027
������������������ ������������
Yang (CS@Uiowa) The Power of Stagewise Learning 37 / 44
Stagewise Learning for Non-Convex Problems
Why Does Stagewise Learning Improves Testing Error?
Faster Convergence of Training Error: O( 1µε) vs O( 1
µ2ε)
With the same number of iterations T =√
nµ
Testing Error Comparison
O(
1√nµ1/2
)vs. O
(1√
nµ3/2
)Stagewise Step Size Decreasing Step Size
First Theory for Explaining Stagewise Learning for DNN
(Y.-Yan-Yuan-Jin, arXiv 2018)
Yang (CS@Uiowa) The Power of Stagewise Learning 38 / 44
Stagewise Learning for Non-Convex Problems
Why Does Stagewise Learning Improves Testing Error?
Faster Convergence of Training Error: O( 1µε) vs O( 1
µ2ε)
With the same number of iterations T =√
nµ
Testing Error Comparison
O(
1√nµ1/2
)vs. O
(1√
nµ3/2
)Stagewise Step Size Decreasing Step Size
First Theory for Explaining Stagewise Learning for DNN
(Y.-Yan-Yuan-Jin, arXiv 2018)
Yang (CS@Uiowa) The Power of Stagewise Learning 38 / 44
Stagewise Learning for Non-Convex Problems
Better Stagewise Learning Algorithms
0 20 40 60 80 100������"� �"���!������
0.2
0.4
0.6
0.8
1.0
��!"
����
� �
��!��"��������������������"���#�!���� ������"���#�!���� ������"���#�!���� ����
0 20 40 60 80 100��!���$�"�$�! #�������
0.2
0.4
0.6
0.8
1.0
��#$� ���""!"
��#��$������������� �����$���%�#�����������$���%�#�����������$���%�#���������
V1: standard, no alg. regularization, restart at last solution
V2: algorithmic regularization, restart at last solution
V3: algorithmic regularization, restart at averaged solution
Yang (CS@Uiowa) The Power of Stagewise Learning 39 / 44
Stagewise Learning for Non-Convex Problems
Conclusions
Stagewise Learning is Powerful
Theory: faster convergence for both convex and non-convex methods
Practice: SVM, AUC, F-measure, DNN, GAN
Open problems: e.g., Generalization of SL for GAN
Yang (CS@Uiowa) The Power of Stagewise Learning 40 / 44
Stagewise Learning for Non-Convex Problems
Acknowledgements
Students: Yi Xu, Mingrui Liu, Xiaoxuan Zhang, Zhuoning Yuan, YanYan, Hassan Rafique
Collaborators: Qihang Lin (UIowa), Rong Jin (Alibaba Group)
Funding Agency: NSF (CRII, Big Data)
Yang (CS@Uiowa) The Power of Stagewise Learning 41 / 44
Stagewise Learning for Non-Convex Problems
Thank You!
Questions?
Yang (CS@Uiowa) The Power of Stagewise Learning 42 / 44
Stagewise Learning for Non-Convex Problems
References I
Brock, Andrew, Donahue, Jeff, and Simonyan, Karen. Large scale GANtraining for high fidelity natural image synthesis. In InternationalConference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=B1xsqj09Fm.
Cortes, Corinna and Vapnik, Vladimir. Support-vector networks. Mach.Learn., 20(3):273–297, September 1995. ISSN 0885-6125.
Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing,Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio,Yoshua. Generative adversarial nets. In Proceedings of the 27thInternational Conference on Neural Information Processing Systems -Volume 2, NIPS’14, pp. 2672–2680, 2014.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenetclassification with deep convolutional neural networks. In Advances inNeural Information Processing Systems (NIPS), pp. 1106–1114, 2012.
Yang (CS@Uiowa) The Power of Stagewise Learning 43 / 44
Stagewise Learning for Non-Convex Problems
References II
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochasticapproximation approach to stochastic programming. SIAM Journal onOptimization, 19(4):1574–1609, 2009.
Robbins, Herbert and Monro, Sutton. A stochastic approximation method.The Annals of Mathematical Statistics, 22:400–407, 1951.
Yang (CS@Uiowa) The Power of Stagewise Learning 44 / 44
Recommended