- Home
- Documents
*Parameter-free Machine Learning through Coin Betting Betting on a Coin From Betting to Optimization*

prev

next

out of 65

View

0Download

0

Embed Size (px)

Optimization of Non-smooth Convex Functions Coin Betting Experiments

Parameter-free Machine Learning through Coin Betting

Francesco Orabona

Boston University Boston, MA USA

Geometric Analysis Approach to AI Workshop

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

Motivation of my Recent Work: Parameter-free Algorithms

Most of the algorithms we know/use in machine learning have parameters

E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

Most of the time, a large enough validation set can be used to tune the parameters

But, at what computational price? Are they really necessary?

Are you ignoring some computational/sample complexities?

When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

Motivation of my Recent Work: Parameter-free Algorithms

Most of the algorithms we know/use in machine learning have parameters E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

Most of the time, a large enough validation set can be used to tune the parameters

But, at what computational price? Are they really necessary?

Are you ignoring some computational/sample complexities?

When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

Motivation of my Recent Work: Parameter-free Algorithms

Most of the algorithms we know/use in machine learning have parameters E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

Most of the time, a large enough validation set can be used to tune the parameters

But, at what computational price? Are they really necessary?

Are you ignoring some computational/sample complexities?

When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

Motivation of my Recent Work: Parameter-free Algorithms

Most of the algorithms we know/use in machine learning have parameters E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

Most of the time, a large enough validation set can be used to tune the parameters

But, at what computational price? Are they really necessary?

Are you ignoring some computational/sample complexities?

When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

Motivation of my Recent Work: Parameter-free Algorithms

Most of the time, a large enough validation set can be used to tune the parameters

But, at what computational price? Are they really necessary?

Are you ignoring some computational/sample complexities?

When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

Motivation of my Recent Work: Parameter-free Algorithms

Most of the time, a large enough validation set can be used to tune the parameters

But, at what computational price? Are they really necessary?

Are you ignoring some computational/sample complexities?

When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

My Dream: Truly Automatic Machine Learning

No parameters to tune

No humans in the loop

Some guarantees

Sometimes this is possible

This talk is about a method to achieve this dream in many interesting cases

Not the only way nor the only parameter-free method I proposed

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

My Dream: Truly Automatic Machine Learning

No parameters to tune

No humans in the loop

Some guarantees

Sometimes this is possible

This talk is about a method to achieve this dream in many interesting cases

Not the only way nor the only parameter-free method I proposed

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

Outline

1 Optimization of Non-smooth Convex Functions

2 Coin Betting Betting on a Coin From Betting to Optimization COCOB

3 Experiments

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

Outline

1 Optimization of Non-smooth Convex Functions

2 Coin Betting Betting on a Coin From Betting to Optimization COCOB

3 Experiments

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

How Subgradient Descent Work in Machine Learning?

w

F (w)

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

How Subgradient Descent Work in Machine Learning?

w

F (w)

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

How Subgradient Descent Work in Machine Learning?

w

F (w)

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

What About “Adaptive” Algorithms?

In the stochastic setting is even more challenging: the function at each round is always the same only in expectation

What about AdaGrad?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

What About “Adaptive” Algorithms?

In the stochastic setting is even more challenging: the function at each round is always the same only in expectation

What about AdaGrad?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

AdaGrad

w

F (w)

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

What the Theory Says?

Only strategy known: use a decreasing step size, O ( η√

t

) Convergence after T iterations is O

( 1√ T

( ‖w∗‖2 η

+ η ))

w∗ is the best solution

The optimal learning rate is with η = ‖w∗‖ that would give a rate of O (

1√ T ‖w∗‖

) ...but you don’t know w∗...

Why we cannot have an optimization algorithm that self-tunes its learning rate?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

What the Theory Says?

Only strategy known: use a decreasing step size, O ( η√

t

) Convergence after T iterations is O

( 1√ T

( ‖w∗‖2 η

+ η ))

w∗ is the best solution The optimal learning rate is with η = ‖w∗‖ that would give a rate of O (

1√ T ‖w∗‖

)

...but you don’t know w∗...

Why we cannot have an optimization algorithm that self-tunes its learning rate?

Francesco Orabona Parameter-free Machine Learning through Coin Betting

Optimization of Non-smooth Convex Functions Coin Betting Experiments

What the Theory Says?

Only strategy known: use a decreasing step size, O ( η√

t

) Convergence after T iterations is O

( 1√ T

( ‖w∗‖2 η

+ η ))

w∗ is the best soluti