Olivier Bousquet, Pertinence - Machine Learning Thoughtsml.typepad.com/Talks/ob_marseille.pdf ·...

Motivation How to do it? Challenges Applications

Machine Learning: Challenges and Applications

Olivier Bousquet, Pertinence

Olivier Bousquet, Pertinence Machine Learning: Challenges and Applications

Outline

1 Motivation

2 How to do it?

3 Challenges

4 Applications

Outline

1 Motivation

2 How to do it?

3 Challenges

4 Applications

Spam Filtering

Example 1

Example 2

Spam Filtering Algorithm

List words that may indicate spam

Take into account possible variations

Count these words/variations in incoming messages

Choose a threshold above which the message is classified asspam

Constantly refine this algorithm as new spam gets ignored, orcorrect messages are rejected (update the list, change thethresholds...)

Manufacturing Process Control

High Pressure Dye Casting

Inject liquid aluminium in a dye, adjust temperature at variouspositionsTry to avoid bubbles

Approach

Build a simulation model

Use Navier-Stokes + phase transitions (solid/liquid) +air/liquid interface, complex geometry, heat transfer...

Require huge computational resources (finite elementsmethods)

No guarantee that the model is correct (what aboutimpurities?)

Approach

Is there a better way?

Can we solve such problems in a simpler and more direct way?

Can we build systems that can solve a wide variety of suchproblems (without starting from scratch for each new similarproblem) ?

Idea: Use an empirical approach

Outline

1 Motivation

2 How to do it?

3 Challenges

4 Applications

Induction

Start from Data or Observations

Build a model which

Agrees with the dataPredicts unobserved data

Scientific Method: laws are induced from observations. A lawis correct as long as no experiment contradicts it.

Induction

Build a model which

Induction

Build a model which

Induction

Build a model which

Induction

Build a model which

Ingredients 1

Data representation: choose a way to represent the objects tobe classified

Example

Ingredients 2

Class of functions: choose a way to express the classificationfunction

Example

Overfitting

Data is not always nicely separated

Overfitting

Underfitting

Overfitting

Reasonable model

Ingredients

Preference: incorporate assumptions about the modelregularity

Algorithm: set up the problem as an optimization problem

Ingredients

Formalization

Example: statistical formalization (not the only one)

Data: (X1,Y1), . . . , (Xn,Yn)

Function class: F , f : X → YLoss function: `(f (x), y) = 1[f (x) 6=y ]

Goal: find a function f ∈ F such that E`(f (X ),Y ) is as smallas possible, and f is regular enough

Optimization problem

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Key question: convergence of E`(f̂ (X ),Y ) tominf E`(f (X ),Y )

Formalization

Data: (X1,Y1), . . . , (Xn,Yn)

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Formalization

Data: (X1,Y1), . . . , (Xn,Yn)

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Formalization

Data: (X1,Y1), . . . , (Xn,Yn)

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Formalization

Data: (X1,Y1), . . . , (Xn,Yn)

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Formalization

Data: (X1,Y1), . . . , (Xn,Yn)

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Outline

1 Motivation

2 How to do it?

3 Challenges

4 Applications

High Dimensionality

Complexity of the data increases: requires bigger descriptionvectors

Gathering data becomes easier: more and more descriptorsavailable

This increase in the dimensionality comes at a price

Overfitting becomes the main issue (curse of dimensionality)But at the same time, interesting phenomena occur (blessing)

High Dimensionality

Curses

NP-completeness: optimization problems can be exponentialin the dimension (e.g. search for the separating hyperplanewith minimum number of mistakes)

Statistical issue: exponentially slow convergence

Theorem

Let F be the set of all Lipschitz functions on [0, 1]d . For anyestimator,

supf ∈F

E(f̂ (x)− f (x))2 ≥ Cn−2

2+d as n →∞

Curses

NP-completeness: optimization problems can be exponentialin the dimension (e.g. search for the separating hyperplanewith minimum number of mistakes)

Statistical issue: exponentially slow convergence

Theorem

Let F be the set of all Lipschitz functions on [0, 1]d . For anyestimator,

supf ∈F

E(f̂ (x)− f (x))2 ≥ Cn−2

2+d as n →∞

Blessings

Concentration-of-measure phenomenon

On the n-sphere, for the usual rotationally-invariant measure,most of the mass is concentrated near any (n− 1)-sphere thatis equatorial in it (analogous to a great circle on the Earth’ssurface, when n = 2). In other words, the ’poles’ and theirneighborhoods down to very small latitudes account for a tinyproportion of the ’area’.Chernoff bounds: for i.i.d. bounded random variables

n∑i=1

Xi − EX ≥ ε

]≤ e−ncε2

General concentration inequality: under appropriate conditionson f (e.g. Lipschitz)

P [f (X1, . . . ,Xn)− Ef ≥ ε] ≤ e−ncε2

Blessings

n∑i=1

Xi − EX ≥ ε

]≤ e−ncε2

P [f (X1, . . . ,Xn)− Ef ≥ ε] ≤ e−ncε2

Blessings

n∑i=1

Xi − EX ≥ ε

]≤ e−ncε2

P [f (X1, . . . ,Xn)− Ef ≥ ε] ≤ e−ncε2

Blessings

Random Projections

Lemma (Johnson & Lindenstrauss)

Given a set S of points in Rd , if we perform a random orthogonalprojection of those points on a subspace of dimension m, thenm = O(γ−2 log |S |) is sufficient so that with high probability allpairwise distances are preserved up to a factor 1± γ

→ cheap and easy way to reduce the dimension, while preservingthe geometry

Connections

Theoretical

LogicInformation Theory / CompressionAlgorithmic Information TheoryMathematical Statistics

High dimensional phenomena

Probability in Banach spacesRandom graphs, graph theoryConcentration

Algorithmics

OptimizationApproximate algorithms

Connections

Theoretical

Algorithmics

Connections

Theoretical

Algorithmics

Dealing with Dimensionality

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Important questions

How to choose the norm?

What can be said about convergence?

How can this be implemented?

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Important questions

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Important questions

minf ∈F

n∑i=1

`(f (Xi ),Yi ) + λ‖f ‖

Important questions

Choosing the norm

Typical approach: linear methods

f (x) =d∑

L2 norm:by duality everything can be expressed in terms of innerproducts

d∑i=1

x iz i

this allows to generalize to reproducing kernel Hilbert spaces

L1: ensures sparsityonly a few αi will be non-zerothis allows to deal with very high-dimensional representations

Choosing the norm

f (x) =d∑

d∑i=1

x iz i

Choosing the norm

f (x) =d∑

d∑i=1

x iz i

Choosing the norm

f (x) =d∑

d∑i=1

x iz i

Choosing the norm

f (x) =d∑

d∑i=1

x iz i

Choosing the norm

f (x) =d∑

d∑i=1

x iz i

Choosing the norm

f (x) =d∑

d∑i=1

x iz i

L2: Kernels

Idea: replace inner products by positive definite kernels

Three reasons why this is interesting:

a good way to measure smoothness in high dimensionsallows to deal with complex objectseasy way to non-linearize

Also, interesting geometric properties (balls are effectively lowdimensional in L2(P))

L2: Kernels

L1: Boosting

Use sparsity to introduce many dimensions

Given a set H of interesting features, replace x by (h(x))h∈H

Build a linear combination with small L1 norm

Effective algorithms (e.g. Adaboost), geometry notsignificantly altered by replacing H by its convex hull

L1: Boosting

Convergence Properties

Concentration ensures

supf ∈F

n∑i=1

`(f (Xi ),Yi )−E`(f (X ),Y ) ≈ E

[supf ∈F

n∑i=1

`(f (Xi ),Yi )− E`(f (X ),Y )

Expectation term can be written as E‖∑

Xi‖Key concept: geometry induced by F and the distribution,”size” of F can be measured by metric entropy, Rademacheraverages, majorizing measures,...

supf ∈F

n∑i=1

`(f (Xi ),Yi )−E`(f (X ),Y ) ≈ E

[supf ∈F

n∑i=1

`(f (Xi ),Yi )− E`(f (X ),Y )

supf ∈F

n∑i=1

`(f (Xi ),Yi )−E`(f (X ),Y ) ≈ E

[supf ∈F

n∑i=1

`(f (Xi ),Yi )− E`(f (X ),Y )

Computational Tricks

Typically use convex optimization (but number of variables can behuge)

Kernels: fast way of computing inner products

Boosting: few relevant features, fast selection of mostimportant feature

Random Projections: cheap and easy way to reducedimensionality

Outline

1 Motivation

2 How to do it?

3 Challenges

4 Applications

To be discussed next

A wide variety of applications

Future directions

To be discussed next

A wide variety of applications

Future directions

Recommendation Engines

News Categorization

Autonomous Vehicles

Other Applications

Bioinformatics

Recognition problems (speech, handwritten text, images...)

Data Mining

Search Engines

Other Applications

Bioinformatics

Data Mining

Search Engines

Other Applications

Bioinformatics

Data Mining

Search Engines

Other Applications

Bioinformatics

Data Mining

Search Engines

Other Applications

Bioinformatics

Data Mining

Search Engines

Machine Learning Can Make You Rich

Mathematical Finance is the scientific topic of choice forgetting a good job

Machine Learning is not too bad either these days!

Recruiting companies: big shots of the internet betting heavilyon this technology

Google / Yahoo! / MicrosoftAlso some start-ups: Pertinence

Money Prizes

DARPA Grand Challenge 2005: autonomous vehicle $ 2M(http://www.grandchallenge.org)Netflix Prize 2006: movie ratings $ 1M(http://www.netflixprize.com)Hutter Prize: compression of web documents $ 50K(http://prize.hutter1.net/)

Money Prizes

Future Directions

Practically

Massive databases: require specific algorithmsMore and more structure: e.g. machine translationMultimedia: e.g. video streams

Theoretically

Better connection betweenapproximation/estimation/computationBetter understanding of model selection (e.g. cross-validation)Formalization and understanding of feature selection

Olivier Bousquet, Pertinence - Machine Learning Thoughtsml.typepad.com/Talks/ob_marseille.pdf ·...

Documents

PERTINENCE INFORMATIONNELLE DE LA VALORISATION À LA …

(Ir)relevance of Phenomenology? (Im)pertinence de la

Challenges in Statistical Machine Translationhomepages.inf.ed.ac.uk/pkoehn/publications/challenges... · 2004-06-13 · Challenges in Statistical Machine Translation p Phrase Translation

Pertinence Generation in Radiological Diagnosis - CSJ Archive

Évaluation de la pertinence des techniques neuro-méningées

LE SYSCOA ET LA PERTINENCE DE L’INFORMATION …

L’étude d’évaluabilité : U tilité et pertinence pour l ...evaluationcanada.ca/fr/system/files/cjpe-entries/31-1-018_0.pdfL’étude d’évaluabilité : U tilité et pertinence

Machine Translation Challenges and Language Divergences

Évaluation de la pertinence d'une infirmière en pratique

Pertinence et limites des PPP - une analyse économique

Pertinence de l'interprétation du rythme cardiaque foetal

The Regional Pertinence of University-Generated Knowledge

Analyse architecturale et pertinence des techniques de ... · Analyse architecturale et pertinence des techniques de représentation numériques Une expérience pédagogique Derycke

Pertinence de l'utilisation de flacons à hémoculture

Challenges in Large Scale Machine Learning

L’UEMOA: interrogation sur la pertinence économique en

Personalist Republicanism. Particularity and Pertinence of ... · QUIÉN • Nº 8 (2018): 19-33 19 Personalist Republicanism. Particularity and Pertinence of French Personalist Political

La culture de plaie : pertinence et indications - Avis

Machine Translation Challenges, Solutions, and Applicationsufal.mff.cuni.cz/mtm13/files/mtm13-keynote-mt-challenges... · Machine Translation Challenges, Solutions, and Applications

Machine Intelligence: Promises and Challenges