62
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John C. Duchi 1,2 Elad Hazan 3 Yoram Singer 2 1 University of California, Berkeley 2 Google Research 3 Technion International Symposium on Mathematical Programming 2012 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 1 / 32

Adaptive Subgradient Methods for Online Learning and ...jduchi/projects/DuchiHaSi12_ismp.pdfFinal Convergence Guarantee Algorithm: at time t, set s t= kg 1:t;jk 2 d j=1 and A t= diag(s

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Adaptive Subgradient Methods

    for

    Online Learning and Stochastic Optimization

    John C. Duchi1,2 Elad Hazan3 Yoram Singer2

    1University of California, Berkeley

    2Google Research

    3Technion

    International Symposium on Mathematical Programming 2012

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 1 / 32

  • Setting: Online Convex Optimization

    Online learning task—repeat:

    • Learner plays point xt• Receive function ft• Suffer lossft(xt) + ϕ(xt)

    • Parameter vector for features• Receive label yt, features φt• Suffer regularized logistic loss

    log [1 + exp(−yt 〈φt, xt〉)] + λ ‖xt‖1

    Goal: Attain small regret

    T∑t=1

    ft(xt) + ϕ(xt)− infx∈X

    [ T∑t=1

    ft(x) + ϕ(x)

    ]

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 2 / 32

  • Setting: Online Convex Optimization

    Online learning task—repeat:

    • Learner plays point xt• Receive function ft• Suffer lossft(xt) + ϕ(xt)

    • Parameter vector for features• Receive label yt, features φt• Suffer regularized logistic loss

    log [1 + exp(−yt 〈φt, xt〉)] + λ ‖xt‖1

    Goal: Attain small regret

    T∑t=1

    ft(xt) + ϕ(xt)− infx∈X

    [ T∑t=1

    ft(x) + ϕ(x)

    ]

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 2 / 32

  • Setting: Online Convex Optimization

    Online learning task—repeat:

    • Learner plays point xt• Receive function ft• Suffer lossft(xt) + ϕ(xt)

    • Parameter vector for features• Receive label yt, features φt• Suffer regularized logistic loss

    log [1 + exp(−yt 〈φt, xt〉)] + λ ‖xt‖1

    Goal: Attain small regret

    T∑t=1

    ft(xt) + ϕ(xt)− infx∈X

    [ T∑t=1

    ft(x) + ϕ(x)

    ]

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 2 / 32

  • Motivation

    Text data:

    The most unsung birthday

    in American business and

    technological history

    this year may be the 50th

    anniversary of the Xerox

    914 photocopier.a

    aThe Atlantic, July/August 2010.

    High-dimensional image features

    Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 3 / 32

  • Motivation

    Text data:

    The most unsung birthday

    in American business and

    technological history

    this year may be the 50th

    anniversary of the Xerox

    914 photocopier.a

    aThe Atlantic, July/August 2010.

    High-dimensional image features

    Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 3 / 32

  • Motivation

    Text data:

    The most unsung birthday

    in American business and

    technological history

    this year may be the 50th

    anniversary of the Xerox

    914 photocopier.a

    aThe Atlantic, July/August 2010.

    High-dimensional image features

    Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 3 / 32

  • Goal?Flipping around the usual sparsity game

    minx. ‖Ax− b‖ , A = [a1 a2 · · · an]> ∈ Rn×d

    Usually in sparsity-focused depend on

    ‖ai‖∞︸ ︷︷ ︸dense

    · ‖x‖1︸︷︷︸sparse

    What we would like:

    ‖ai‖1︸ ︷︷ ︸sparse

    · ‖x‖∞︸ ︷︷ ︸dense

    (In general, impossible)

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 4 / 32

  • Goal?Flipping around the usual sparsity game

    minx. ‖Ax− b‖ , A = [a1 a2 · · · an]> ∈ Rn×d

    Usually in sparsity-focused depend on

    ‖ai‖∞︸ ︷︷ ︸dense

    · ‖x‖1︸︷︷︸sparse

    What we would like:

    ‖ai‖1︸ ︷︷ ︸sparse

    · ‖x‖∞︸ ︷︷ ︸dense

    (In general, impossible)

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 4 / 32

  • Goal?Flipping around the usual sparsity game

    minx. ‖Ax− b‖ , A = [a1 a2 · · · an]> ∈ Rn×d

    Usually in sparsity-focused depend on

    ‖ai‖∞︸ ︷︷ ︸dense

    · ‖x‖1︸︷︷︸sparse

    What we would like:

    ‖ai‖1︸ ︷︷ ︸sparse

    · ‖x‖∞︸ ︷︷ ︸dense

    (In general, impossible)

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 4 / 32

  • Approaches: Gradient Descent and Dual AveragingLet gt ∈ ∂ft(xt):

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2 + ηt 〈gt, x〉

    }or

    xt+1 = argminx∈X

    {ηtt

    t∑τ=1

    〈gτ , x〉+1

    2t‖x‖2

    }

    f (x)

    f (xt) + 〈gt, x - xt〉

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 5 / 32

  • Approaches: Gradient Descent and Dual AveragingLet gt ∈ ∂ft(xt):

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2 + ηt 〈gt, x〉

    }or

    xt+1 = argminx∈X

    {ηtt

    t∑τ=1

    〈gτ , x〉+1

    2t‖x‖2

    }

    〈gt, x〉 + 12 ||x - xt||2

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 5 / 32

  • What is the problem?

    • Gradient steps treat all features as equal

    • They are not!

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 6 / 32

  • Adapting to Geometry of Space

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 7 / 32

  • Why adapt to geometry?

    Hard

    Nice

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32

  • Why adapt to geometry?

    Hard

    Nice

    yt φt,1 φt,2 φt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

    1 Frequent, irrelevant

    2 Infrequent, predictive

    3 Infrequent, predictive

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32

  • Why adapt to geometry?

    Hard

    Nice

    yt φt,1 φt,2 φt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

    1 Frequent, irrelevant

    2 Infrequent, predictive

    3 Infrequent, predictive

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32

  • Adapting to Geometry of the Space

    • Receive gt ∈ ∂ft(xt)• Earlier:

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2 + η 〈gt, x〉

    }

    • Now: let ‖x‖2A = 〈x,Ax〉 for A � 0. Use

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2A + η 〈gt, x〉

    }

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 9 / 32

  • Adapting to Geometry of the Space

    • Receive gt ∈ ∂ft(xt)• Earlier:

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2 + η 〈gt, x〉

    }

    • Now: let ‖x‖2A = 〈x,Ax〉 for A � 0. Use

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2A + η 〈gt, x〉

    }

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 9 / 32

  • Regret Bounds

    What does adaptation buy?

    • Standard regret bound:

    T∑t=1

    ft(xt)− ft(x∗) ≤1

    2η‖x1 − x∗‖22 +

    η

    2

    T∑t=1

    ‖gt‖22

    • Regret bound with matrix:

    T∑t=1

    ft(xt)− ft(x∗) ≤1

    2η‖x1 − x∗‖2A +

    η

    2

    T∑t=1

    ‖gt‖2A−1

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 10 / 32

  • Regret Bounds

    What does adaptation buy?

    • Standard regret bound:

    T∑t=1

    ft(xt)− ft(x∗) ≤1

    2η‖x1 − x∗‖22 +

    η

    2

    T∑t=1

    ‖gt‖22

    • Regret bound with matrix:

    T∑t=1

    ft(xt)− ft(x∗) ≤1

    2η‖x1 − x∗‖2A +

    η

    2

    T∑t=1

    ‖gt‖2A−1

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 10 / 32

  • Meta Learning Problem

    • Have regret:

    T∑t=1

    ft(xt)− ft(x∗) ≤1

    η‖x1 − x∗‖2A +

    η

    2

    T∑t=1

    ‖gt‖2A−1

    • What happens if we minimize A in hindsight?

    minA

    T∑t=1

    〈gt, A

    −1gt〉

    subject to A � 0, tr(A) ≤ C

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 11 / 32

  • Meta Learning Problem• What happens if we minimize A in hindsight?

    minA

    T∑t=1

    〈gt, A

    −1gt〉

    subject to A � 0, tr(A) ≤ C

    • Solution is of form

    A = c diag

    ( T∑t=1

    gtg>t

    ) 12

    A = c

    ( T∑t=1

    gtg>t

    ) 12

    (diagonal) (full)

    (where c chosen to satisfy tr constraint)

    • Let g1:t,j be vector of jth gradient component. Optimal:

    Aj,j ∝ ‖g1:T , j‖2

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 12 / 32

  • Meta Learning Problem• What happens if we minimize A in hindsight?

    minA

    T∑t=1

    〈gt, A

    −1gt〉

    subject to A � 0, tr(A) ≤ C

    • Solution is of form

    A = c diag

    ( T∑t=1

    gtg>t

    ) 12

    A = c

    ( T∑t=1

    gtg>t

    ) 12

    (diagonal) (full)

    (where c chosen to satisfy tr constraint)

    • Let g1:t,j be vector of jth gradient component. Optimal:

    Aj,j ∝ ‖g1:T , j‖2

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 12 / 32

  • Meta Learning Problem• What happens if we minimize A in hindsight?

    minA

    T∑t=1

    〈gt, A

    −1gt〉

    subject to A � 0, tr(A) ≤ C

    • Solution is of form

    A = c diag

    ( T∑t=1

    gtg>t

    ) 12

    A = c

    ( T∑t=1

    gtg>t

    ) 12

    (diagonal) (full)

    (where c chosen to satisfy tr constraint)

    • Let g1:t,j be vector of jth gradient component. Optimal:

    Aj,j ∝ ‖g1:T , j‖2Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 12 / 32

  • Low regret to the best A

    • Let g1:t,j be vector of jth gradient component. At time t, use

    st =[‖g1:t,j‖2

    ]dj=1

    and At = diag(st)

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2At + η 〈gt, x〉

    }

    • Example:

    yt gt,1 gt,2 gt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 1 1 0-1 1 0 0

    s1 =√

    3.5 s2 =√

    2 s3 = 1

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 13 / 32

  • Low regret to the best A

    • Let g1:t,j be vector of jth gradient component. At time t, use

    st =[‖g1:t,j‖2

    ]dj=1

    and At = diag(st)

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2At + η 〈gt, x〉

    }

    • Example:

    yt gt,1 gt,2 gt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 1 1 0-1 1 0 0

    s1 =√

    3.5 s2 =√

    2 s3 = 1

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 13 / 32

  • Final Convergence GuaranteeAlgorithm: at time t, set

    st =[‖g1:t,j‖2

    ]dj=1

    and At = diag(st)

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2At + η 〈gt, x〉

    }Define radius

    R∞ := maxt‖xt − x∗‖∞ ≤ sup

    x∈X‖x− x∗‖∞ .

    TheoremThe final regret bound of AdaGrad:

    T∑t=1

    ft(xt)− ft(x∗) ≤ 2R∞d∑i=1

    ‖g1:T,j‖2 .

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 14 / 32

  • Final Convergence GuaranteeAlgorithm: at time t, set

    st =[‖g1:t,j‖2

    ]dj=1

    and At = diag(st)

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2At + η 〈gt, x〉

    }Define radius

    R∞ := maxt‖xt − x∗‖∞ ≤ sup

    x∈X‖x− x∗‖∞ .

    TheoremThe final regret bound of AdaGrad:

    T∑t=1

    ft(xt)− ft(x∗) ≤ 2R∞d∑i=1

    ‖g1:T,j‖2 .

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 14 / 32

  • Understanding the convergence guarantees I

    • Stochastic convex optimization:

    f(x) := EP [f(x; ξ)]

    Sample ξt according to P , define ft(x) := f(x; ξt). Then

    E[f

    (1

    T

    T∑t=1

    xt

    )]− f(x∗) ≤ 2R∞

    T

    d∑i=1

    E[‖g1:T,j‖2

    ]

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 15 / 32

  • Understanding the convergence guarantees I

    • Stochastic convex optimization:

    f(x) := EP [f(x; ξ)]

    Sample ξt according to P , define ft(x) := f(x; ξt). Then

    E[f

    (1

    T

    T∑t=1

    xt

    )]− f(x∗) ≤ 2R∞

    T

    d∑i=1

    E[‖g1:T,j‖2

    ]

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 15 / 32

  • Understanding the convergence guarantees IISupport vector machine example: define

    f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d

    • If ξj 6= 0 with probability ∝ j−α for α > 1

    E[f

    (1

    T

    T∑t=1

    xt

    )]− f(x∗) = O

    (‖x∗‖∞√T·max

    {log d, d1−α/2

    })

    • Previously best-known method:

    E[f

    (1

    T

    T∑t=1

    xt

    )]− f(x∗) = O

    (‖x∗‖∞√T·√d

    ).

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 16 / 32

  • Understanding the convergence guarantees IISupport vector machine example: define

    f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d

    • If ξj 6= 0 with probability ∝ j−α for α > 1

    E[f

    (1

    T

    T∑t=1

    xt

    )]− f(x∗) = O

    (‖x∗‖∞√T·max

    {log d, d1−α/2

    })

    • Previously best-known method:

    E[f

    (1

    T

    T∑t=1

    xt

    )]− f(x∗) = O

    (‖x∗‖∞√T·√d

    ).

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 16 / 32

  • Understanding the convergence guarantees IISupport vector machine example: define

    f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d

    • If ξj 6= 0 with probability ∝ j−α for α > 1

    E[f

    (1

    T

    T∑t=1

    xt

    )]− f(x∗) = O

    (‖x∗‖∞√T·max

    {log d, d1−α/2

    })

    • Previously best-known method:

    E[f

    (1

    T

    T∑t=1

    xt

    )]− f(x∗) = O

    (‖x∗‖∞√T·√d

    ).

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 16 / 32

  • Understanding the convergence guarantees III

    Back to regret minimization

    • Convergence almost as good as that of the best geometrymatrix:

    T∑t=1

    ft(xt)− ft(x∗)

    ≤ 2√d ‖x∗‖∞

    √√√√infs

    { T∑t=1

    ‖gt‖2diag(s)−1 : s � 0, 〈11, s〉 ≤ d}

    • This (and other bounds) are minimax optimal

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 17 / 32

  • Understanding the convergence guarantees III

    Back to regret minimization

    • Convergence almost as good as that of the best geometrymatrix:

    T∑t=1

    ft(xt)− ft(x∗)

    ≤ 2√d ‖x∗‖∞

    √√√√infs

    { T∑t=1

    ‖gt‖2diag(s)−1 : s � 0, 〈11, s〉 ≤ d}

    • This (and other bounds) are minimax optimal

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 17 / 32

  • The AdaGrad AlgorithmsAnalysis applies to several algorithms

    st =[‖g1:t,j‖2

    ]dj=1

    , At = diag(st)

    • Forward-backward splitting (Lions and Mercier 1979, Nesterov2007, Duchi and Singer 2009, others)

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2At + 〈gt, x〉+ ϕ(x)

    }

    • Regularized Dual Averaging (Nesterov 2007, Xiao 2010)

    xt+1 = argminx∈X

    {1

    t

    t∑τ=1

    〈gτ , x〉+ ϕ(x) +1

    2t‖x‖2At

    }

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 18 / 32

  • The AdaGrad AlgorithmsAnalysis applies to several algorithms

    st =[‖g1:t,j‖2

    ]dj=1

    , At = diag(st)

    • Forward-backward splitting (Lions and Mercier 1979, Nesterov2007, Duchi and Singer 2009, others)

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2At + 〈gt, x〉+ ϕ(x)

    }

    • Regularized Dual Averaging (Nesterov 2007, Xiao 2010)

    xt+1 = argminx∈X

    {1

    t

    t∑τ=1

    〈gτ , x〉+ ϕ(x) +1

    2t‖x‖2At

    }

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 18 / 32

  • The AdaGrad AlgorithmsAnalysis applies to several algorithms

    st =[‖g1:t,j‖2

    ]dj=1

    , At = diag(st)

    • Forward-backward splitting (Lions and Mercier 1979, Nesterov2007, Duchi and Singer 2009, others)

    xt+1 = argminx∈X

    {1

    2‖x− xt‖2At + 〈gt, x〉+ ϕ(x)

    }

    • Regularized Dual Averaging (Nesterov 2007, Xiao 2010)

    xt+1 = argminx∈X

    {1

    t

    t∑τ=1

    〈gτ , x〉+ ϕ(x) +1

    2t‖x‖2At

    }

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 18 / 32

  • An Example and Experimental Results

    • `1-regularization

    • Text classification

    • Image ranking

    • Neural network learning

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 19 / 32

  • AdaGrad with composite updates

    Recall more general problem:

    T∑t=1

    f(xt) + ϕ(xt)− infx∗∈X

    [ T∑t=1

    f(x) + ϕ(x)

    ]

    • Must solve updates of form

    argminx∈X

    {〈g, x〉+ ϕ(x) + 1

    2‖x‖2A

    }

    • Luckily, often still simple

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 20 / 32

  • AdaGrad with composite updates

    Recall more general problem:

    T∑t=1

    f(xt) + ϕ(xt)− infx∗∈X

    [ T∑t=1

    f(x) + ϕ(x)

    ]

    • Must solve updates of form

    argminx∈X

    {〈g, x〉+ ϕ(x) + 1

    2‖x‖2A

    }

    • Luckily, often still simple

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 20 / 32

  • AdaGrad with composite updates

    Recall more general problem:

    T∑t=1

    f(xt) + ϕ(xt)− infx∗∈X

    [ T∑t=1

    f(x) + ϕ(x)

    ]

    • Must solve updates of form

    argminx∈X

    {〈g, x〉+ ϕ(x) + 1

    2‖x‖2A

    }

    • Luckily, often still simple

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 20 / 32

  • AdaGrad with `1 regularizationSet ḡt =

    1t

    ∑tτ=1 gτ . Need to solve

    minx〈ḡt, x〉+ λ ‖x‖1 +

    1

    2t〈x, diag(st)x〉

    • Coordinate-wise update yields sparsity and adaptivity:

    xt+1,j = sign(−ḡt,j)t

    ‖g1:t,j‖2[|ḡt,j| − λ]+

    0 50 100 150 200 250 300−0.6

    −0.4

    −0.2

    0

    ḡt

    λ truncation

    t

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 21 / 32

  • AdaGrad with `1 regularizationSet ḡt =

    1t

    ∑tτ=1 gτ . Need to solve

    minx〈ḡt, x〉+ λ ‖x‖1 +

    1

    2t〈x, diag(st)x〉

    • Coordinate-wise update yields sparsity and adaptivity:

    xt+1,j = sign(−ḡt,j)t

    ‖g1:t,j‖2[|ḡt,j| − λ]+

    0 50 100 150 200 250 300−0.6

    −0.4

    −0.2

    0

    ḡt

    λ truncation

    t

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 21 / 32

  • Text ClassificationReuters RCV1 document classification task—d = 2 · 106 features,approximately 4000 non-zero features per document

    ft(x) := [1− 〈x, ξt〉]+where ξt ∈ {−1, 0, 1}d is data sample

    FOBOS AdaGrad PA1 AROW2

    Ecomonics .058 (.194) .044 (.086) .059 .049Corporate .111 (.226) .053 (.105) .107 .061

    Government .056 (.183) .040 (.080) .066 .044Medicine .056 (.146) .035 (.063) .053 .039

    Test set classification error rate(sparsity of final predictor in parenthesis)

    1Crammer et al., 20062Crammer et al., 2009Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 22 / 32

  • Text ClassificationReuters RCV1 document classification task—d = 2 · 106 features,approximately 4000 non-zero features per document

    ft(x) := [1− 〈x, ξt〉]+where ξt ∈ {−1, 0, 1}d is data sample

    FOBOS AdaGrad PA1 AROW2

    Ecomonics .058 (.194) .044 (.086) .059 .049Corporate .111 (.226) .053 (.105) .107 .061

    Government .056 (.183) .040 (.080) .066 .044Medicine .056 (.146) .035 (.063) .053 .039

    Test set classification error rate(sparsity of final predictor in parenthesis)

    1Crammer et al., 20062Crammer et al., 2009Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 22 / 32

  • Image Ranking

    ImageNet (Deng et al., 2009), large-scale hierarchical image database

    Fish

    Vertebrate

    Animal

    Invertebrate

    Mammal

    Cow

    Train 15,000 rankers/classifiers torank images for each noun (as inGrangier and Bengio, 2008)

    Dataξ = (z1, z2) ∈ {0, 1}d × {0, 1}d ispair of images

    f(x; z1, z2) =[1−

    〈x, z1 − z2

    〉]+

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 23 / 32

  • Image Ranking

    ImageNet (Deng et al., 2009), large-scale hierarchical image database

    Fish

    Vertebrate

    Animal

    Invertebrate

    Mammal

    Cow

    Train 15,000 rankers/classifiers torank images for each noun (as inGrangier and Bengio, 2008)

    Dataξ = (z1, z2) ∈ {0, 1}d × {0, 1}d ispair of images

    f(x; z1, z2) =[1−

    〈x, z1 − z2

    〉]+

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 23 / 32

  • Image Ranking Results

    Precision at k: proportion of examples in top k that belong tocategory. Average precision is average placement of all positiveexamples.

    Algorithm Avg. Prec. P@1 P@5 P@10 Nonzero

    AdaGrad 0.6022 0.8502 0.8130 0.7811 0.7267AROW 0.5813 0.8597 0.8165 0.7816 1.0000

    PA 0.5581 0.8455 0.7957 0.7576 1.0000Fobos 0.5042 0.7496 0.6950 0.6545 0.8996

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 24 / 32

  • Neural Network Learning

    Wildly non-convex problem:

    f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))

    where

    p(α) =1

    1 + exp(α)

    ⇠1 ⇠2 ⇠3 ⇠4⇠5

    x1 x2 x3 x4 x5

    p(hx1, ⇠1i)

    Idea: Use stochastic gradient methods to solve it anyway

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 25 / 32

  • Neural Network Learning

    Wildly non-convex problem:

    f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))

    where

    p(α) =1

    1 + exp(α)

    ⇠1 ⇠2 ⇠3 ⇠4⇠5

    x1 x2 x3 x4 x5

    p(hx1, ⇠1i)

    Idea: Use stochastic gradient methods to solve it anyway

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 25 / 32

  • Neural Network Learning

    0 20 40 60 80 100 1200

    5

    10

    15

    20

    25

    Time (hours)

    Avera

    ge F

    ram

    e A

    ccura

    cy (

    %)

    Accuracy on Test Set

    SGDGPU

    Downpour SGD

    Downpour SGD w/AdagradSandblaster L−BFGS

    (Dean et al. 2012)

    Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 26 / 32

  • Conclusions and Discussion

    • Family of algorithms that adapt to geometry of data

    • Extendable to full matrix case to handle feature correlation

    • Can derive many efficient algorithms for high-dimensionalproblems, especially with sparsity

    • Future: Efficient full-matrix adaptivity, other types of adaptation

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 27 / 32

  • Conclusions and Discussion

    • Family of algorithms that adapt to geometry of data

    • Extendable to full matrix case to handle feature correlation

    • Can derive many efficient algorithms for high-dimensionalproblems, especially with sparsity

    • Future: Efficient full-matrix adaptivity, other types of adaptation

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 27 / 32

  • Thanks!

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 28 / 32

  • OGD Sketch: “Almost” Contraction• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity)• Before: xt+1 = xt − ηgt

    1

    2‖xt+1 − x∗‖22 ≤

    1

    2‖xt − x∗‖22 + η (ft(x∗)− ft(xt)) +

    η2

    2‖gt‖22

    • Now: xt+1 = xt − ηA−1gt1

    2‖xt+1 − x∗‖2A

    =1

    2‖xt − x∗‖2A + η 〈gt, x∗ − xt〉+

    η2

    2‖gt‖2A−1

    ≤ 12‖xt − x∗‖2A + η (ft(x∗)− ft(xt)) +

    η2

    2‖gt‖2A−1↑

    dual norm to ‖·‖A

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 29 / 32

  • OGD Sketch: “Almost” Contraction• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity)• Before: xt+1 = xt − ηgt

    1

    2‖xt+1 − x∗‖22 ≤

    1

    2‖xt − x∗‖22 + η (ft(x∗)− ft(xt)) +

    η2

    2‖gt‖22

    • Now: xt+1 = xt − ηA−1gt1

    2‖xt+1 − x∗‖2A

    =1

    2‖xt − x∗‖2A + η 〈gt, x∗ − xt〉+

    η2

    2‖gt‖2A−1

    ≤ 12‖xt − x∗‖2A + η (ft(x∗)− ft(xt)) +

    η2

    2‖gt‖2A−1↑

    dual norm to ‖·‖ADuchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 29 / 32

  • Hindsight minimization

    • Focus on diagonal case (full matrix case similar)

    mins

    T∑t=1

    〈gt, diag(s)

    −1gt〉

    subject to s � 0, 〈1, s〉 ≤ C

    • Let g1:T,j be vector of jth component. Solution is of form

    sj ∝ ‖g1:T,j‖2

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 30 / 32

  • Low regret to the best A

    T∑t=1

    ft(xt) + ϕ(xt)− ft(x∗)− ϕ(x∗)

    ≤ 12η

    T∑t=1

    (‖xt − x∗‖2At − ‖xt+1 − x

    ∗‖2At)

    ︸ ︷︷ ︸Term I

    2

    T∑t=1

    ‖gt‖2A−1t︸ ︷︷ ︸Term II

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 31 / 32

  • Bounding TermsDefine D∞ = maxt ‖xt − x∗‖∞ ≤ supx∈X ‖x− x∗‖∞• Term I:

    T∑t=1

    (‖xt − x∗‖2At − ‖xt+1 − x

    ∗‖2At)≤ D2∞

    d∑j=1

    ‖g1:T,j‖2

    • Term II:T∑t=1

    ‖gt‖2A−1t ≤ 2T∑t=1

    ‖gt‖2A−1T = 2d∑j=1

    ‖g1:T,j‖2

    = 2

    √√√√infs

    {T∑t=1

    〈gt, diag(s)−1gt〉∣∣∣ s � 0, 〈1, s〉 ≤ d∑

    j=1

    ‖g1:T,j‖2

    }

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 32 / 32

  • Bounding TermsDefine D∞ = maxt ‖xt − x∗‖∞ ≤ supx∈X ‖x− x∗‖∞• Term I:

    T∑t=1

    (‖xt − x∗‖2At − ‖xt+1 − x

    ∗‖2At)≤ D2∞

    d∑j=1

    ‖g1:T,j‖2

    • Term II:T∑t=1

    ‖gt‖2A−1t ≤ 2T∑t=1

    ‖gt‖2A−1T = 2d∑j=1

    ‖g1:T,j‖2

    = 2

    √√√√infs

    {T∑t=1

    〈gt, diag(s)−1gt〉∣∣∣ s � 0, 〈1, s〉 ≤ d∑

    j=1

    ‖g1:T,j‖2

    }

    Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 32 / 32

    Brief Review of Online Convex OptimizationAdaGrad AlgorithmExamples and ResultsConclusions