80
L. Risser CNRS / Institut de Mathématiques de Toulouse Introduction to Machine Learning Laurent Risser Institut de Mathématiques de Toulouse [email protected] L. Risser CNRS / Institut de Mathématiques de Toulouse

Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Introduction to Machine Learning

Laurent Risser Institut de Mathématiques de Toulouse

[email protected]

L. Risser CNRS / Institut de Mathématiques de Toulouse

Page 2: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

0) From Statistics to Machine Learning

As a starter: From Statistics to Machine Learning

Page 3: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

0) From Statistics to Machine Learning

Page 4: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

0) From Statistics to Machine Learning

Page 5: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

0) From Statistics to Machine Learning

Page 6: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

0) From Statistics to Machine Learning

Page 7: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Talk overview

• From statistics to machine learning • Introductory examples

• Supervised learning • Unsupervised learning

• Classic algorithms in machine learning • K-means • Decision trees and Random forests • SVM • Logistic regression

• Overfitting and cross validation • Overfitting • Cross-validation

• High dimensionality and regularization • Modeling a real-life problem • Effect of regularization • Dimensionality reduction using PCA

• Supervised learning using Neural networks • Conclusion

Page 8: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

1) Introductory examples

Two introductory examples

Page 9: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Supervised learning — Training Data

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

Variable 1

Varia

ble

21.a) Introductory examples — Supervised learning

Page 10: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Example : Variable 1 = Age Variable 2 = Monthly incomes State = Buy a product at Christmas

Variable 1

Varia

ble

2

Supervised learning — Training Data

1.a) Introductory examples — Supervised learning

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

Page 11: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

?

Most likely state of ?

Variable 1

Varia

ble

2

Supervised learning — Prediction

1.a) Introductory examples — Supervised learning

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

Page 12: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

?

Supervised learning (here using a linear model) then prediction on new data

Variable 1

Varia

ble

2

Most likely state of ?

Supervised learning — Prediction

1.a) Introductory examples — Supervised learning

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

Page 13: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

?

Supervised learning (here using a linear model) then prediction on new data

Variable 1

Varia

ble

2

Most likely state of ?

Supervised learning — Prediction

1.a) Introductory examples — Supervised learning

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

To sum-up

1. We have labelled training data 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) as a

function of a loss function (e.g. prediction error) 4. Model validation on a test set 5. Predictions on new observations

Page 14: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

21.b) Introductory examples — Unsupervised learning

Unsupervised learning — Input Data

n = 20 observations p = 2 variables (problem dimension) No label

Page 15: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

21.b) Introductory examples — Unsupervised learning

Unsupervised learning — Input Data

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

Page 16: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

dist(xi , xj), j = {1, …, i-1, i+1, …, n}

xi

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Distance between the observations

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2

Page 17: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

dist(xi+1 , xj), j = {1, …, i, i+2, …, n}

xi+1

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Distance between the observations

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2

Page 18: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Group 1Group 2

Group 3

Distance between the observations

Energy to minimize w.r.t the labels yi

e.g.

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Learning

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)

f (…) =n

∑i=1

n

∑j=1

𝕀yi==yj|xi − xj |

22

{dist(xi, xj)}i, j∈{1,…,n}2

Page 19: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Graph clustering

Strong link

Group 1Group 2

Group 3

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Link with graph clustering

Page 20: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Group 1Group 2

Group 3

Distance between the observations

Energy to minimize w.r.t the labels yi

e.g.

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Learning

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)

f (…) =n

∑i=1

n

∑j=1

𝕀yi==yj|xi − xj |

22

{dist(xi, xj)}i, j∈{1,…,n}2

To sum-up

1. We have training data with no known label 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) based

on an criterion of inner energy 4. Predictions on new observations

Page 21: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2) Classic algorithms

Classic algorithms in Machine Learning

Page 22: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

K-means algorithm

Variable 1

Varia

ble

2

Page 23: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

N seeds are randomly drawn (in this example N=4)

Page 24: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

For each observation, we consider the nearest seed.

Remark: Euclidian distances are used in this example

Page 25: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

Seeds are then centered according to the their corresponding observations…

Page 26: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

… for each observation, we again consider the nearest seed …

Page 27: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

… and we re-iterate until convergence.

Page 28: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Decision trees

Variable 1

Varia

ble

2

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees

Page 29: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Decision trees

Variable 1

Varia

ble

2

D2D1

var2 < 3var2 > 3

D2

D1

All obs.

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees

Page 30: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Decision trees

Variable 1

Varia

ble

2

All obs.

D2D1

var2 < 3var2 > 3

D3

D1

D4D3

var1 < 1var1 > 1

D4

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees

Page 31: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Decision trees

Variable 1

Varia

ble

2

D2D1

var2 < 3var2 > 3

D3

D1

D4D3

var1 < 1var1 > 1

………

All obs.

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees

Page 32: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Random forests

Variable 1

Varia

ble

2

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

High dimension: p >> 1 (p = 2 in the former example)

Learning : • Several trees are independently

defined. • Dimensions are randomly drawn.

Prediction : • A label predicted at a given point is the

one predicted by the majority of trees (bagging).

2.c) Classic algorithms - random forests

Page 33: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Support Vector Machine (SVM)

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

2.d) Classic algorithms - Support Vector Machine (SVM)

Page 34: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

We estimate w and b such as:

yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n

Variable 1

Varia

ble

2

Support Vector Machine (SVM)

w

(w . x

- b) =

1

(w . x

- b) =

0

(w . x

- b) =

-1

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

2.d) Classic algorithms - Support Vector Machine (SVM)

Page 35: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Support Vector Machine (SVM)

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

We estimate w and b such as:

yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n

Full classification with a linear model is impossible now!!!

2.d) Classic algorithms - Support Vector Machine (SVM)

Page 36: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

We estimate w et b that minimize:

[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1

n

2

Support Vector Machine (SVM)

w

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

2

> 0 si yi is not well predicted

2.d) Classic algorithms - Support Vector Machine (SVM)

Page 37: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

We estimate w et b that minimize:

[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1

n

2

Support Vector Machine (SVM)

w

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

2

> 0 si yi is not well predicted

Remark: It is possible (and common) to use non-linear separations by replacing vector products with non-linear relations.

2.d) Classic algorithms - Support Vector Machine (SVM)

Page 38: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.d) Classic algorithms - Support Vector Machine (SVM)

Support Vector Machine (SVM) — Kernel methods

How to handle this case using a Linear model?

Variable 1

Varia

ble

2

Page 39: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.d) Classic algorithms - Support Vector Machine (SVM)

Variable 1

Varia

ble

2

We denote an observation

We classify the instead of theΦ(xi) = (x1i , x2

i , (x1i )2)

xi = (x1i , x2

i )

xi

Variable 1

Varia

ble

3

Var 2

Support Vector Machine (SVM) — Kernel methods

How to handle this case using a Linear model?

… there exists a huge literature on this topic

Page 40: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.e) Classic algorithms - Logistic regression

Logistic regression

For each observation :

• Explicative variable: where

• Response variable:

Estimate as the minimizer of:

Remarks: • Linear classification as by using linear SVM but different model with strong statistical insights • Scales particularly well when n and/or p is very high • Modeling constrains on is gold standard when

xi = (x1i , x2

i , …, xpi )

yi ∈ {−1,1}

p > > 1

i ∈ {1,…, n}

(w

b +p

∑j=1

wjxji > 0b +

p

∑j=1

wjxji < 0

yi = 1yi = − 1

w p > > n

Ideally

w= {w1, …, wp}, b)(w

Page 41: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

3) Over-fitting and cross validation

Overfitting and cross-validation

Page 42: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

3.a) Over-fitting and cross validation — overfitting

Overfitting

Full decision tree Linear SVM or Logistic RegressionTruncated decision tree

Training data:

Page 43: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Area

Area

Area

Area

Area Area

Area

Area

Area Area Area Area

Overfitting

Full decision tree Truncated decision tree

Training data:

100% accuracy 95% accuracy 95% accuracy

Linear SVM or Logistic Regression

3.a) Over-fitting and cross validation — overfitting

Page 44: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Area

Area

Area

Area

Area Area

Area

Area

Area Area Area Area

Overfitting

Full decision tree Truncated decision tree

Training data:

100% accuracy 95% accuracy 95% accuracy

Which strategy would you trust most to predict the label of new observations ?

Linear SVM or Logistic Regression

3.a) Over-fitting and cross validation — overfitting

Page 45: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

All available data to learn decision rules

3.b) Over-fitting and cross validation — cross-validation

Cross-validation: Fundamental paradigm of Machine Learning to validate trained models

Page 46: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Split the data into training data and test data

Test dataTraining data

3.b) Over-fitting and cross validation — cross-validation

Page 47: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Learn the model parameters on training data

Test dataTraining data

93.75% accuracy in this example

3.b) Over-fitting and cross validation — cross-validation

Page 48: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Evaluate the model quality with no risk of overfitting on test data

100% accuracy in this example

Test dataTraining data

3.b) Over-fitting and cross validation — cross-validation

Page 49: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

K-folds: K tests to be more robust and additionally evaluate the model stability

Test: 100% acc.Learn

1st 4-fold 2nd 4-fold

3rd 4-fold 4th 4-fold

3.b) Over-fitting and cross validation — cross-validation

Learn

Learn

LearnTest: 100% acc. Test: 100% acc.

Test: 80% acc.

95% average accuracy and stable parameters… good stuff

Page 50: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Leave-1-out: n tests (recommended when n is small)

Test: sucessLeave x1 Leave x2 Leave x3

Leave x4 Leave x5 Leave x7

Leave x8 Leave x9

3.b) Over-fitting and cross validation — cross-validation

Learn

Test: sucessLearn

Test: sucessLearn

Test: failedLearn

Test: sucessLearn

Test: sucessLearn

Test: sucessLearn

Test: sucessLearn

Page 51: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4) High dimensionality

High dimensionality… regularization, model selection, and/or dimensionality reduction

Page 52: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.a) High dimensionality — Example of problem

Project context : • Observations = MRI images of the brain at different acquisition times (ADNI*) • Labels = Patient state (MCI/AD) • Prediction of Alzheimer disease depending on the hippocampus morphological evolution?

[Baseline] [Baseline + 12 months]

Initial data: • [Baseline] : n = 103 patients are MCI • [Baseline + 12 months] : 84 patients are MCI / 19 patients are AD

Hippocampus

* http://adni.bmap.ucla.edu/

Page 53: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

For each of the n = 103 observations (patients): • xi : Evolution marker on the Template p = 20000 points • yi : State AD or MCI

Questions: • Is it possible to discriminate MCI and AD patients based on the shape evolution? • How to learn the most discriminant markers?

103 x

Subject 2 Baseline

Subject 84 Baseline

Subject 86 Baseline

Subject 103 Baseline

... ...

Subject 2 Baseline

+ 12 months

Subject 84 Baseline

+ 12 months

Subject 86 Baseline

+ 12 months

Subject 103 Baseline

+ 12 months... ...

MCI Group

Treatment 1: Estimate deformations between [Baseline] and [Baseline + 12 months] [Ourselin et al. Im Vis Comp., 2001], [Vialard et al. IJCV, 2012]

Subject 85 Baseline

Subject 85 Baseline

+ 12 months

Subject 1 Baseline

Subject 1 Baseline

+ 12 months

Registration

AD Group

Treatment 2 : Transport evolution markers on an Template/average shape

4.a) High dimensionality — Example of problem

Registration Registration Registration Registration Registration

Page 54: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.a) High dimensionality — Example of problem

where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate

= F( + b )

# points

# su

jets

# po

ints1

1

y X wdef.

Logistic regression predictive model that defines the probability of the yi depending on the xi :

Log-likelihood optimization:

# su

jets

where:

Regularization parameter (mandatory as n>p)

Page 55: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Log-likelihood optimization: Regularization parameter (mandatory as n>p)

4.a) High dimensionality — Example of problem

where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate

= F( + b )

# points

# su

jets

# po

ints1

1

y X wdef.

Logistic regression predictive model that defines the probability of the yi depending on the xi :

# su

jets

where:

2x = 3

2x1 + 3x2 = 3

2x1 + 3x2 = 33x1 + 1x2 = 1

2x1 + 3x2 + 1x3 − x4 = 15x1 − x2 + 2x3 + x4 = 1

n = 1 et p = 1 OK

n = 1 et p = 2 KO

n = 2 et p = 2 OK

n = 2 et p = 4 KO

Page 56: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.b) High dimensionality — Effect of regularization

Optimization of w using: Lewis & Overton, Nonsmooth optimization via quasi-Newton methods. Math. Programming 2012

Tested regularization models:

(1) Ridge :

(2) LASSO :

(3) Elastic net :

(4) Sobolev semi-norm:

(5) Total Variation :

(6) Fused LASSO :

Page 57: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.b) High dimensionality — Effect of regularization (and model selection)

(1) Ridge (4) Sobolev semi-norm

(2) LASSO (5) Total Variation

(3) Elastic net (6) Fused LASSO

Representation of w for three λ on a slice of the hippocampus: • Blue and red: strong local influence • Green: little or no local influence

Page 58: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.b) High dimensionality — Effect of regularization

Results obtained using a cross validation method (here leave-10%-out) : • Spec+Sens = 2 good prediction in 100% of the cases • Spec+Sens = 1 coin flipping (Heads or Tails) has the same predictive power • Spec+Sens = 0 good prediction in 0% of the cases

Best results obtained using a regularization pertinent with regard to the data: • Spatial distribution taken into account • Allows clear transitions

[Fiot J.B. et al., NeuroImage: Clinical, 2012]

Page 59: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.c) High dimensionality — Dimensionality reduction using SVD

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

How to establish a general ranking between these countries ???

Weighted sum between the scores, then ranking of these sums.

Page 60: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.c) High dimensionality — Dimensionality reduction using SVD

Weighted sum of the scores is equivalent to a matrix x vector multiplication:

Vector containing the scores = M . w

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

How to establish a general ranking between these countries ???

Weighted sum between the scores, then ranking of these sums.

Page 61: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.c) High dimensionality — Dimensionality reduction using SVD

One can also look for a vector of norm 1 that maximizes the variability between the scores

Optimal vector = 1st eigenvector (v1) of the SVD Variability level = 1st eigenvalue (λ1) of the SVD

Vector of scores with the highest variability = M . v1

Matrix M

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

Page 62: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.c) High dimensionality — Dimensionality reduction using SVD

One can now search the vector of norm 1, orthogonal to v1 , that maximizes the variability

Optimal vector = 2nd eigenvector (v2) of the PCA Variability level = 2nd eigenvalue (λ2) dof the PCA

… and so on ...Can be calculated analytically

Matrix M

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

Page 63: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Black: Projection of the data on PC2 and PC3

Red: Influence of the variables in PC2 and PC3

4.c) High dimensionality — Dimensionality reduction using SVD

Fantastic tool to visualize and interpret high dimensional data …

Page 64: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Scree plot of the eigenvalues

4.c) High dimensionality — Dimensionality reduction using SVD

λ1 λ2 λ3 λ4 λ5 λ6 λ7 λ8

… and a powerful tool to reduce the problem dimensionality before training a M.L. model

Cap

ture

d va

riabi

lity

proj.PC1 proj.PC2 proj.PC3 Australie … … … Belgique … … … Brésil … … … RoyaumeUni … … … Canada … … … Chine … … … Croatie … … … Ethiopie … … … France … … … Allemagne … … … …

100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 …

Projection of the data from a 9D space to a 3D space preserves

here 90% of their variability

!!!

Page 65: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks

Supervised learning using Neural Networks

Page 66: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

https://pythonprogramming.net/neural-networks-machine-learning-tutorial/

Deep learning …

Known PredictedLearned

• Very efficient in important applications (signal, images).

• Computationally heavy learning phase but quick predictions.

• Very large amount of parameters to learn.

• Requires large databases of annotated data (or wise network designs).

5) Supervised learning using Neural Networks

Page 67: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Prediction

I = Image RGB 200*200

h1(I)

h2(I)

Prediction

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

5) Supervised learning using Neural Networks — User point of view

Page 68: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

1

0

5) Supervised learning using Neural Networks — User point of view

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

Prediction

Page 69: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

1

1

5) Supervised learning using Neural Networks — User point of view

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

Prediction

Page 70: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

0

0

5) Supervised learning using Neural Networks — User point of view

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

Prediction

Page 71: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Training phase: Parameters optimization to get the best predictions in average

0

0

Input training data Many images of dogs and cats

Output training data Labels of each image

… ……

0

00

0

……

……

5) Supervised learning using Neural Networks — User point of view

Classifier (black-box)

Page 72: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

The xi are typically the intensities of an RGB image I in each of its channel Predicted labels

5) Supervised learning using Neural Networks — Into the black-box

Input layer Hidden layers Output layer

Page 73: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Page 74: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Page 75: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Page 76: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Page 77: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Minimize the expectation of the prediction error ( ! its average on the K training observations): ≈

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Predicted labels

Known labels

Page 78: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Stochastic gradient descent:

where

• Calculated on a subsample of the K observations

at each iteration (batch)

• Calculated analytically if l=L-1

• Back-propagated if l<L-1

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Page 79: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Optimisation de l’espérance (moyenne) de :

In practice: • Various types of layers • Various types of architectures • Various strategies to perform the stochastic gradient descent

Hidden important properties of N.N.: • Prediction and training can be straightforwardly parallelized on GPUs • Nvidia cuDNN library massively used by Keras, TensorFlow, Theano, PyTorch, …

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Page 80: Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

That’s all for now

MERCI !!!