Parameter-free Machine Learning through Coin Betting Betting on a Coin From Betting to Optimization

  • View
    0

  • Download
    0

Embed Size (px)

Text of Parameter-free Machine Learning through Coin Betting Betting on a Coin From Betting to Optimization

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    Parameter-free Machine Learning through Coin Betting

    Francesco Orabona

    Boston University Boston, MA USA

    Geometric Analysis Approach to AI Workshop

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    Motivation of my Recent Work: Parameter-free Algorithms

    Most of the algorithms we know/use in machine learning have parameters

    E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

    Most of the time, a large enough validation set can be used to tune the parameters

    But, at what computational price? Are they really necessary?

    Are you ignoring some computational/sample complexities?

    When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    Motivation of my Recent Work: Parameter-free Algorithms

    Most of the algorithms we know/use in machine learning have parameters E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

    Most of the time, a large enough validation set can be used to tune the parameters

    But, at what computational price? Are they really necessary?

    Are you ignoring some computational/sample complexities?

    When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    Motivation of my Recent Work: Parameter-free Algorithms

    Most of the algorithms we know/use in machine learning have parameters E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

    Most of the time, a large enough validation set can be used to tune the parameters

    But, at what computational price? Are they really necessary?

    Are you ignoring some computational/sample complexities?

    When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    Motivation of my Recent Work: Parameter-free Algorithms

    Most of the algorithms we know/use in machine learning have parameters E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

    Most of the time, a large enough validation set can be used to tune the parameters

    But, at what computational price? Are they really necessary?

    Are you ignoring some computational/sample complexities?

    When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    Motivation of my Recent Work: Parameter-free Algorithms

    Most of the algorithms we know/use in machine learning have parameters E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

    Most of the time, a large enough validation set can be used to tune the parameters

    But, at what computational price? Are they really necessary?

    Are you ignoring some computational/sample complexities?

    When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    Motivation of my Recent Work: Parameter-free Algorithms

    Most of the algorithms we know/use in machine learning have parameters E.g. regularization parameter in LASSO, k in k -Nearest Neighbourhood, topology of the networks in deep learning

    Most of the time, a large enough validation set can be used to tune the parameters

    But, at what computational price? Are they really necessary?

    Are you ignoring some computational/sample complexities?

    When is the last time you needed to tune the learning rate to sort a vector? Or to invert a matrix?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    My Dream: Truly Automatic Machine Learning

    No parameters to tune

    No humans in the loop

    Some guarantees

    Sometimes this is possible

    This talk is about a method to achieve this dream in many interesting cases

    Not the only way nor the only parameter-free method I proposed

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    My Dream: Truly Automatic Machine Learning

    No parameters to tune

    No humans in the loop

    Some guarantees

    Sometimes this is possible

    This talk is about a method to achieve this dream in many interesting cases

    Not the only way nor the only parameter-free method I proposed

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    Outline

    1 Optimization of Non-smooth Convex Functions

    2 Coin Betting Betting on a Coin From Betting to Optimization COCOB

    3 Experiments

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    Outline

    1 Optimization of Non-smooth Convex Functions

    2 Coin Betting Betting on a Coin From Betting to Optimization COCOB

    3 Experiments

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    How Subgradient Descent Work in Machine Learning?

    w

    F (w)

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    How Subgradient Descent Work in Machine Learning?

    w

    F (w)

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    How Subgradient Descent Work in Machine Learning?

    w

    F (w)

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    What About “Adaptive” Algorithms?

    In the stochastic setting is even more challenging: the function at each round is always the same only in expectation

    What about AdaGrad?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    What About “Adaptive” Algorithms?

    In the stochastic setting is even more challenging: the function at each round is always the same only in expectation

    What about AdaGrad?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    AdaGrad

    w

    F (w)

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    What the Theory Says?

    Only strategy known: use a decreasing step size, O ( η√

    t

    ) Convergence after T iterations is O

    ( 1√ T

    ( ‖w∗‖2 η

    + η ))

    w∗ is the best solution

    The optimal learning rate is with η = ‖w∗‖ that would give a rate of O (

    1√ T ‖w∗‖

    ) ...but you don’t know w∗...

    Why we cannot have an optimization algorithm that self-tunes its learning rate?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    What the Theory Says?

    Only strategy known: use a decreasing step size, O ( η√

    t

    ) Convergence after T iterations is O

    ( 1√ T

    ( ‖w∗‖2 η

    + η ))

    w∗ is the best solution The optimal learning rate is with η = ‖w∗‖ that would give a rate of O (

    1√ T ‖w∗‖

    )

    ...but you don’t know w∗...

    Why we cannot have an optimization algorithm that self-tunes its learning rate?

    Francesco Orabona Parameter-free Machine Learning through Coin Betting

  • Optimization of Non-smooth Convex Functions Coin Betting Experiments

    What the Theory Says?

    Only strategy known: use a decreasing step size, O ( η√

    t

    ) Convergence after T iterations is O

    ( 1√ T

    ( ‖w∗‖2 η

    + η ))

    w∗ is the best soluti