SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki...

SpicyMKL Efficient multiple kernel learning

method using dual augmented Lagrangian

Taiji SuzukiRyota Tomioka

The University of TokyoGraduate School of Information Science and Technology

Department of Mathematical Informatics

23th Oct., 2009@ISM1

Multiple Kernel Learning

Fast Algorithm

SpicyMKL

Kernel Learning

Regression, Classification

Kernel Function

• Gaussian, polynomial, chi-square, ….– Parameters ： Gaussian width, polynomial

degree

• Features– Computer Vision ： color, gradient, sift (sift,

hsvsift, huesift, scaling of sift), Geometric Blur, image regions, ．．．

Thousands of kernels

Multiple Kernel Learning(Lanckriet et al. 2004)

Select important kernels and combine them

Multiple Kernel Learning (MKL)Single Kernel

Multiple Kernel

dm:convex combination, sparse (many 0 components)

LassoGroup Lasso

Multiple Kernel Learning (MKL)

grouping

kernelize

Sparse learning

： N samples

： M kernels

： RKHS associated with km

： Convex loss （ hinge, square, logistic ）

L1-regularization

→sparse

： Gram matrix of mth kernel

Representer theorem

Finite dimensional problem

Proposed:SpicyMKL

• Scales well against # of kernels.• Formulated in a general framework.

SpicyMKL

Outline

• Introduction• Details of Multiple Kernel Learning • Our method

– Proximal minimization– Skipping inactive kernels

• Numerical Experiments– Bench-mark datasets– Sparse or dense

Details of Multiple Kernel Learning(MKL)

Generalized Formulation

: sparse regularization

hinge loss

logistic loss

elastic net

L1regularization

elastic net:

L1regularization:

hinge:

logistic:

Rank 1 decomposition

If g is monotonically increasing and f` is diff’ble at the optimum, the solution of MKL is rank 1:

such that

convex combination of kernels

Proof Derivative w.r.t.

Existing approach 1

Constraints based formulation: [Lanckriet et al. 2004, JMLR] [Bach, Lanckriet & Jordan 2004, ICML]

• Lanckriet et al. 2004: SDP• Bach, Lanckriet & Jordan 2004: SOCP

primal

Existing approach 2

Upper bound based formulation: [Sonnenburg et al. 2006, JMLR] [Rakotomamonjy et al. 2008, JMLR] [Chapelle & Rakotomamonjy 2008, NIPS workshop]

• Sonnenburg et al. 2006: Cutting plane (SILP)• Rakotomamonjy et al. 2008: Gradient descent (SimpleMKL)• Chapelle & Rakotomamonjy 2008: Newton method (HessianMKL)

primal

(Jensen’s inequality)

Problem of existing methods

• Do not make use of sparsity

during the optimization.

→ 　 Do not perform efficiently when

# of kernels is large.

Outline

Our method SpicyMKL

• DAL (Tomioka & Sugiyama 09)– Dual Augmented Lagrangian– Lasso, group Lasso, trace norm regularization

• SpicyMKL = DAL + MKL– Kernelization of DAL

Difficulty of MKL （ Lasso ）

is non-diff’ble at 0.

Relax by Proximal Minimization

(Rockafellar, 1976)

Our algorithm

SpicyMKL: Proximal Minimization (Rockafellar, 1976)

Set some initial solution . Choose increasing sequence .

Iterate the following update rule until convergence:

Regularization toward the last update.

Taking its dual, the update step can be solved efficiently

super linearly !

• •

It can be shown that

Fenchel’s duality theorem

Split into two parts

Non-diff’ble Smooth !Moreau envelope

Non-diff’ble

Inner optimization

• Newton method is applicable.

Twice Differentiable Twice Differentiable (a.e.)

Even if is not differentiable,we can apply almost the same algorithm.

Rigorous Derivation of

Convolution and duality

Moreau envelope

Moreau envelope (Moreau 65)

For a convex function , we define its Moreau envelope as

Moreau envelope

L1-penaltydual

Smooth !

We arrived at …

Dual of update rule in proximal minimization

• Differentiable• Only active kernels need to be calculated.

(next slide)

What we have observed

Soft threshold

Hard threshold

If (inactive), ,and its gradient also vanishes.

Skipping inactive kernels

The objective function:

Its derivatives:

• Only active kernels contributes their computations.

→ Efficient for large number of kernels.

Derivatives• We can apply Newton method for the dual

optimization.

If is strongly convex and is sparse reg., for any increasing sequence ,the convergence rate is super linear.

For any proper convex functions and , and any non-decreasing sequence ,

Convergence result

• If loss is non-differentiable, almost the same algorithm is available by utilizing Moreau envelope of .

primal dual Moreau envelope of dual

Non-differentiable loss

Dual of elastic net

Smooth !

primal dual

Naïve optimization in the dual is applicable.But we applied prox. minimization methodfor small to avoid numerical instability.

Why ‘Spicy’ ?

• SpicyMKL = DAL + MKL.

• DAL also means a major Indian cuisine.

　→ hot, spicy

　→ SpicyMKL

from Wikipedia

“Dal”

Outline

Numerical Experiments

SpicyMKL

CPU time against # of kernels

• Scales well against # of kernels.

RingnormSplice

IDA data set, L1 regularization.

SpicyMKL

• Scaling against # of samples is almost the same as the existing methods.

RingnormSplice

CPU time against # of samplesIDA data set, L1 regularization.

UCI dataset # of kernels 189 243 918 891 1647

Comparison in elastic net(Sparse v.s. Dense)

where .

Sparse or Dense ?

Sparse Dense0 1

Elastic Net

Compare performances of sparse and dense solutionsin a computer vision task.

caltech101 dataset (Fei-Fei et al., 2004)

• object recognition task• anchor, ant, cannon, chair, cup

• 10 classification problems 　 (10=5×(5-1)/2) • # of kernels: 1760 = 4 (features)×22 (regions) ×

　あああああ 2 (kernel functions) × 10 (kernel parameters)– sift features [Lowe99]: colorDescriptor [van de Sande et al.10]

hsvsift, sift (scale = 4px, 8px, automatic scale)– regions: whole region, 4 division, 16 division, spatial

pyramid. (computed histogram of features on each region.)– kernel functions: Gaussian kernels, kernels with

10 parameters. 42

Medium density

Performance of sparse solution is comparable with dense solution.

Performances are averaged over all classification problems.

まとめと今後の課題Conclusion• We proposed a new MKL algorithm that is

efficient when # of kernels is large.– proximal minimization– neglect ‘in-active’ kernels

• Medium density showed the best performance, but sparse solution also works well.

Future work• Second order update

Conclusion and Future works

• Technical report

T. Suzuki & R. Tomioka: SpicyMKL.

arXiv: http://arxiv.org/abs/0909.5026

• DAL R. Tomioka & M. Sugiyama: Dual Augmented Lagrangian

Method for Efficient Sparse Reconstruction. IEEE Signal Proccesing Letters, 16 (12) pp. 1067--1070, 2009.

Thank you for your attention!

Moreau envelope

proximal operator

Some properties

(Fenchel duality th.)

Differentiable !

SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki...

Documents

1 OPNFV Summit 2015 Doctor - Fault Management Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC

Development of EPICS Embedded Image Processing System Takashi Obina, Jun-ichi Odagiri, Ryota Takai KEK, Accelerator Laboratory

Tomioka Silk Mill and Related Sites - bunka · Tomioka Silk Mill and Related Sites comprise a technological ensemble depicting the significant technological interchange ... Monuments

Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC

Estimation of low-rank tensors via convex optimizationttic.uchicago.edu/~ryotat/talks/tensor11berlin.pdf · Estimation of low-rank tensors via convex optimization Ryota Tomioka1,

Ultra Large E nergy E fficient Container Ship

C24 analyzing oracle database hang issues using various diagnostics_pubic by Ryota Watabe

· JP/EN players Instruction Manual 2nd EDITION Game Design: Naotaka Shimamoto Art Direction: Yoshiaki Tomioka Translation and Support: Nozomi Obinata

Final Presentation - Tohoku University Official …...Final Presentation International Ghost House Team Shunsuke Otsuki , Ryota Sasaki, Fredrik Ivetun, Akio Yamashita , Johan Moisander,

€¦ · 7B 23 a No 102 5 8 9 10 12 13 25 26 10 fi 27 5 o (No.102) 7H26a L httos://service.sugumail.com/tomioka- Cit}/ httos://service.sugumail.com/tomioka-

UT-A4仕上Title UT-A4仕上 Author RYOTA NAKAMURA Created Date 10/20/2009 1:30:44 AM

Thorner Historical Society...Foreign Ministry; Rear Admiral Tadatoshi Tomioka, Navy; Toshikazu Kase, Foreign Ministry, and Lieutenant General Suichi Miyakazi, Army. In the back row,

Combined classification and channel/basis selection with L1-L2 regularization with application to P300 speller system Ryota Tomioka & Stefan Haufe Tokyo

Regularization Strategies and Empirical Bayesian Learning ...ttic.uchicago.edu/~ryotat/talks/tomioka-mklworkshop10.pdfBayesian Learning for MKL Ryota Tomioka1, Taiji Suzuki1 1Department

Scalable and Crash-Tolerant Load Balancing based on Switch Migration for Multiple OpenFlow Controllers Chu LIANG ＊ Ryota KAWASHIMA ＊ Hiroshi MATSUO ＊＊

1 Doctor Fault Management 18 May 2015 Ryota Mibu, NEC

E fficient Cooperative Diversity Schemes and Radio Resource Allocation for IEEE 802.16j

Ryota Koga - Nvidiadeveloper.download.nvidia.com/GTC/PDF/1072_Koga.pdf · Ryota Koga President of X-Ability Co.,Ltd. # Collaborative researchers Yuki Furukawa(X-Ability Co.,Ltd.),

An E fficient Spatio-Temporal Architecture for Animation Rendering

Ryota Tomioka & Stefan Haufe Tokyo Tech / TU Berlin / Fraunhofer FIRST