Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix...

Big learning:challenges and opportunities

Francis Bach

SIERRA Project-team, INRIA - Ecole Normale Superieure

October 2013

Scientific context

Big data

• Omnipresent digital media

– Multimedia, sensors, indicators, social networks

– All levels: personal, professional, scientific, industrial

– Too large and/or complex for manual processing

– Computational challenges

– Dealing with large databases

– Statistical challenges

– What can be predicted from such databases and how?

– Looking for hidden information

– Opportunities (and threats)

Scientific context

Big data

• Omnipresent digital media

– Multimedia, sensors, indicators, social networks

– All levels: personal, professional, scientific, industrial

– Too large and/or complex for manual processing

• Computational challenges

– Dealing with large databases

• Statistical challenges

– What can be predicted from such databases and how?

– Looking for hidden information

• Opportunities (and threats)

Machine learning for big data

• Large-scale machine learning: large p, large n, large k

– p : dimension of each observation (input)

– n : number of observations

– k : number of tasks (dimension of outputs)

• Examples: computer vision, bioinformatics, etc.

Search engines - advertising

Advertising - recommendation

Object recognition

Learning for bioinformatics - Proteins

• Crucial components of cell life

• Predicting multiple functions and

interactions

• Massive data: up to 1 millions for

humans!

• Complex data

– Amino-acid sequence

– Link with DNA

– Tri-dimensional molecule

Machine learning for big data

• Large-scale machine learning: large p, large n, large k

– p : dimension of each observation (input)

– n : number of observations

– k : number of tasks (dimension of outputs)

• Examples: computer vision, bioinformatics, etc.

• Two main challenges:

1. Computational: ideal running-time complexity = O(pn+ kn)

2. Statistical: meaningful results

Big learning: challenges and opportunities

Outline

• Scientific context

– Big data: need for supervised and unsupervised learning

• Beyond stochastic gradient for supervised learning

– Few passes through the data

– Provable robustness and ease of use

• Matrix factorization for unsupervised learning

– Looking for hidden information through dictionary learning

– Feature learning

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

ℓ(yi, θ

⊤Φ(xi))

+ µΩ(θ)

convex data fitting term + regularizer

minθ∈Rp

ℓ(yi, θ

⊤Φ(xi))

+ µΩ(θ)

• Applications to any data-oriented field

– Computer vision, bioinformatics

– Natural language processing, etc.

minθ∈Rp

ℓ(yi, θ

⊤Φ(xi))

+ µΩ(θ)

• Main practical challenges

– Designing/learning good features Φ(x)

– Efficiently solving the optimization problem

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

fi(θ) with fi(θ) = ℓ(yi, θ

⊤Φ(xi))+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

f ′i(θt−1)

– Linear (e.g., exponential) convergence rate in O(e−αt)

– Iteration complexity is linear in n (with line search)

f ′i(θt−1)

– Linear (e.g., exponential) convergence rate in O(e−αt)

– Iteration complexity is linear in n (with line search)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

– Sampling with replacement: i(t) random element of 1, . . . , n

– Convergence rate in O(1/t)

– Iteration complexity is independent of n (step size selection?)

f ′i(θt−1)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

• Goal = best of both worlds: Linear rate with O(1) iteration cost

Robustness to step size

stochastic

deterministic

• Goal = best of both worlds: Linear rate with O(1) iteration cost

Robustness to step size

hybridlog(

stochastic

deterministic

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n

– Random selection i(t) ∈ 1, . . . , n with replacement

– Iteration: θt = θt−1 −γtn

yti with yti =

f ′i(θt−1) if i = i(t)

yt−1i otherwise

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n

– Random selection i(t) ∈ 1, . . . , n with replacement

– Iteration: θt = θt−1 −γtn

yti with yti =

f ′i(θt−1) if i = i(t)

yt−1i otherwise

• Stochastic version of incremental average gradient (Blatt et al., 2008)

• Simple implementation

– Extra memory requirement: same size as original data (or less)

– Simple/robust constant step-size

Convergence analysis

• Assume each fi is L-smooth and g=1

fi is µ-strongly convex

• Constant step size γt =1

16L. If µ >

n, ∃C ∈ R such that

∀t > 0, E[g(θt)− g(θ∗)

]6 C exp

• Linear convergence rate with iteration cost independent of n

– After each pass through the data, constant error reduction

– Breaking two lower bounds

Simulation experiments

• protein dataset (n = 145751, p = 74)

• Dataset split in two (training/testing)

0 5 10 15 20 25 30

10−4

10−3

10−2

10−1

Effective Passes

Steepest

L−BFGS

pegasos

SAG (2/(L+nµ))

SAG−LS

0 5 10 15 20 25 30

Effective Passes

Steepest

L−BFGS

pegasos

SAG (2/(L+nµ))

SAG−LS

Training cost Testing cost

Simulation experiments

• covertype dataset (n = 581012, p = 54)

• Dataset split in two (training/testing)

0 5 10 15 20 25 30

10−4

10−2

Effective Passes

Steepest

L−BFGS

pegasos

SAG (2/(L+nµ))

SAG−LS

0 5 10 15 20 25 30

Effective Passes

Steepest

L−BFGS

pegasos

SAG (2/(L+nµ))

SAG−LS

Training cost Testing cost

Large-scale supervised learning

Convex optimization

• Simplicity

– Few lines of code

• Robustness

– Step-size

– Adaptivity to problem difficulty

• On-going work

– Single pass through the data (Bach and Moulines, 2013)

– Distributed algorithms

- Convexity as a solution to all problems?

- Need good features Φ(x) for linear predictions θ⊤Φ(x) !

Large-scale supervised learning

Convex optimization

• Simplicity

– Few lines of code

• Robustness

– Step-size

– Adaptivity to problem difficulty

• On-going work

– Single pass through the data (Bach and Moulines, 2013)

– Distributed algorithms

• Convexity as a solution to all problems?

– Need good features Φ(x) for linear predictions θ⊤Φ(x) !

Unsupervised learning through matrix factorization

• Given data matrix X = (x⊤1 , . . . ,x

⊤n ) ∈ R

– Principal component analysis: xi ≈ Dαi

– K-means: xi ≈ dk ⇒ X = DA

Learning dictionaries for uncovering hidden structure

• Fact: many natural signals may be approximately represented as a

superposition of few atoms from a dictionary D = (d1, . . . ,dk)

– Decomposition x =k∑

αidi = Dα with α ∈ R

k sparse

– Natural signals (sounds, images) (Olshausen and Field, 1997)

- Decoding problem: given a dictionary D, finding α through

regularized convex optimization minα∈Rk ‖x−Dα‖22 + λ‖α‖1

– Decomposition x =k∑

k sparse

• Decoding problem: given a dictionary D, finding α through

– Decomposition x =

k sparse

• Decoding problem: given a dictionary D, finding α through

• Dictionary learning problem: given n signals x1, . . . ,xn,

– Estimate both dictionary D and codes α1, . . . ,αn

minαj∈Rp

∥∥xj −Dαj

∥∥2

2+ λ‖αj‖1

Challenges of dictionary learning

minαj∈Rp

∥∥xj −Dαj

∥∥2

2+ λ‖αj‖1

• Algorithmic challenges

– Large number of signals ⇒ online learning (Mairal et al., 2009a)

• Theoretical challenges

– Identifiabiliy/robustness (Jenatton et al., 2012)

• Domain-specific challenges

– Going beyond plain sparsity ⇒ structured sparsity

(Jenatton, Mairal, Obozinski, and Bach, 2011)

Dictionary learning for image denoising

x︸︷︷︸measurements

= y︸︷︷︸

original image

+ ε︸︷︷︸noise

Dictionary learning for image denoising

• Solving the denoising problem (Elad and Aharon, 2006)

– Extract all overlapping 8× 8 patches xi ∈ R64

– Form the matrix X = (x⊤1 , . . . ,x

⊤n ) ∈ R

– Solve a matrix factorization problem:

minD,A

||X−DA||2F = minD,A

||xi −Dαi||22

where A is sparse, and D is the dictionary

– Each patch is decomposed into xi = Dαi

– Average all Dαi to reconstruct a full-sized image

• The number of patches n is large (= number of pixels)

• Online learning (Mairal, Bach, Ponce, and Sapiro, 2009a)

Denoising result

(Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)

Denoising result

(Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)

Inpainting a 12-Mpixel photograph

Why structured sparsity?

• Interpretability

– Structured dictionary elements (Jenatton et al., 2009b)

– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

Structured sparse PCA (Jenatton et al., 2009b)

raw data sparse PCA

• Unstructed sparse PCA ⇒ many zeros do not lead to better

interpretability

raw data sparse PCA

• Unstructed sparse PCA ⇒ many zeros do not lead to better

interpretability

raw data Structured sparse PCA

• Enforce selection of convex nonzero patterns ⇒ robustness to

occlusion in face identification

raw data Structured sparse PCA

• Enforce selection of convex nonzero patterns ⇒ robustness to

occlusion in face identification

Modelling of text corpora - Dictionary tree

• Stability and identifiability

• Prediction or estimation performance

– When prior knowledge matches data (Haupt and Nowak, 2006;

Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)

• How?

– Design of sparsity-inducing norms

Structured sparsity

• Sparsity-inducing behavior from “corners” of constraint sets

Structured dictionary learning- Efficient optimization

minA∈R

D∈Rp×k

‖xi −Dαi‖22 + λψ(αi) s.t. ∀j, ‖dj‖2 ≤ 1.

• Minimization with respect to αi : regularized least-squares

– Many algorithms dedicated to the ℓ1-norm ψ(α) = ‖α‖1

• Proximal methods : first-order methods with optimal convergence

rate (Nesterov, 2007; Beck and Teboulle, 2009)

– Requires solving many times minα∈Rk

12‖y −α‖22 + λψ(α)

• Efficient algorithms for structured sparse problems

– Bach, Jenatton, Mairal, and Obozinski (2011)

– Code available: http://www.di.ens.fr/willow/SPAMS/

Extensions - Digital Zooming

Digital Zooming (Couzinie-Devy et al., 2011)

Extensions - Task-driven dictionaries

inverse half-toning (Mairal et al., 2011)

Extensions - Task-driven dictionaries

inverse half-toning (Mairal et al., 2011)

Big learning: challenges and opportunities

Conclusion

• Scientific context

– Big data: need for supervised and unsupervised learning

• Beyond stochastic gradient for supervised learning

– Few passes through the data

– Provable robustness and ease of use

• Matrix factorization for unsupervised learning

– Looking for hidden information through dictionary learning

– Feature learning

References

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence

rate o(1/n). Technical Report 00831977, HAL, 2013.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties.

R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. Technical

report, arXiv:0808.3572, 2008.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant

step size. 18(1):29–51, 2008.

M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned

dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, 2006.

J. Haupt and R. Nowak. Signal reconstruction from noisy random projections. IEEE Transactions on

Information Theory, 52(9):4036–4048, 2006.

J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th

International Conference on Machine Learning (ICML), 2009.

R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.

Technical report, arXiv:0904.3523, 2009a.

R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. Technical

report, arXiv:0909.1440, 2009b.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary

learning. In Submitted to ICML, 2010.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding.

Journal of Machine Learning Research, 12:2297–2334, 2011.

K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic

filter maps. In Proceedings of CVPR, 2009.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets. Technical Report -, HAL, 2012.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In

International Conference on Machine Learning (ICML), 2009a.

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image

restoration. In International Conference on Computer Vision (ICCV), 2009b.

J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In

NIPS, 2010.

Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center

for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007.

B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed

by V1? Vision Research, 37:3311–3325, 1997.

Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix...

Documents

Matrix Approximation for Large-scale Learning

Learning with Matrix Factorizations Nathan Srebro

Leave-one-out Approach for Matrix Completion: Primal and ... · The matrix completion problem concerns recovering a low-rank matrix given a (typically random) subset of its entries

Dictionary Learning for Massive Matrix Factorization

Home Learning Matrix - Term 3 Week 8

On the Role of Sparsity and DAG Constraints for Learning ... · In matrix form, the linear DAG model is given by X= BTX+ N, where B= [B 1jj B d] is a weighted adjacency matrix and

Image Compression by Learning Matrix Ortho-normal Bases

Matrix Learning

Home Learning Matrix - Term 3 Week 6

Matrix Computations in Machine Learning

Learning Matrix

Space Week Matrix - Empowering Learning Together

Online Learning for Matrix Factorization and Sparse Codingfbach/mairal10a.pdf · Online Learning for Matrix Factorization and Sparse Coding Julien Mairal JULIEN.MAIRAL@INRIA.FR Francis

Unsupervised Learning for Matrix Decompositions - HIIT Learning Coffee... · Unsupervised Machine Learning for Matrix Decomposition Erkki Oja Department of Computer Science Aalto

Regularization in Matrix Relevance Learningfschleif/mlr/mlr_02_2008.pdfMACHINE LEARNING REPORTS Regularization in Matrix Relevance Learning Report 02/2008 Submitted: 23.10.2008 Published:

Matrix Learning in Learning Vector Quantization · Matrix Learning in Learning Vector Quantization Michael Biehl1, Barbara Hammer2, Petra Schneider1 1- Rijksuniversiteit Groningen

Adaptive sampling in matrix completion: When can it help? How? · Low-rank matrix completion problem Given some entries of a matrix M, exactly recover (\complete") hidden entries

Learning the Kernel Matrix with Semidefinite Programming

Learning to Sense Sparse Signals: Simultaneous Sensing Matrix

Learning Sparse Representations by Non-Negative Matrix