Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques

L. Risser CNRS / Institut de Mathématiques de Toulouse

Introduction to Machine Learning

Laurent Risser Institut de Mathématiques de Toulouse

[email protected]


mailto:[email protected]

mailto:[email protected]


0) From Statistics to Machine Learning

As a starter: From Statistics to Machine Learning


1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

































Talk overview

• From statistics to machine learning • Introductory examples

• Supervised learning • Unsupervised learning

• Classic algorithms in machine learning • K-means • Decision trees and Random forests • SVM • Logistic regression

• Overfitting and cross validation • Overfitting • Cross-validation

• High dimensionality and regularization • Modeling a real-life problem • Effect of regularization • Dimensionality reduction using PCA

• Supervised learning using Neural networks • Conclusion


1) Introductory examples

Two introductory examples


Supervised learning — Training Data

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

Variable 1

Varia

ble

21.a) Introductory examples — Supervised learning


Example : Variable 1 = Age Variable 2 = Monthly incomes State = Buy a product at Christmas

Variable 1

Varia

ble

2

Supervised learning — Training Data




?

Most likely state of ?

Variable 1

Varia

ble

2

Supervised learning — Prediction




?

Supervised learning (here using a linear model) then prediction on new data

Variable 1

Varia

ble

2






?

Supervised learning (here using a linear model) then prediction on new data

Variable 1

Varia

ble

2





To sum-up

1. We have labelled training data 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) as a

function of a loss function (e.g. prediction error) 4. Model validation on a test set 5. Predictions on new observations


Variable 1

Varia

ble

21.b) Introductory examples — Unsupervised learning

Unsupervised learning — Input Data

n = 20 observations p = 2 variables (problem dimension) No label


Variable 1

Varia

ble


Unsupervised learning — Input Data


Is it reasonably possible to distinguish several sub-groups

of observations?


Variable 1

Varia

ble

2

dist(xi , xj), j = {1, …, i-1, i+1, …, n}

xi


Unsupervised learning — Distance between the observations



of observations?

Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2


Variable 1

Varia

ble

2

dist(xi+1 , xj), j = {1, …, i, i+2, …, n}

xi+1


Unsupervised learning — Distance between the observations



of observations?

Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2


Variable 1

Varia

ble

2

Group 1Group 2

Group 3

Distance between the observations

Energy to minimize w.r.t the labels yi

e.g.


Unsupervised learning — Learning



of observations?

f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)

f (…) =n

∑i=1

n

∑j=1

𝕀yi==yj|xi − xj |

22

{dist(xi, xj)}i, j∈{1,…,n}2


Variable 1

Varia

ble

2

Graph clustering

Strong link

Group 1Group 2

Group 3


Unsupervised learning — Link with graph clustering


Variable 1

Varia

ble

2

Group 1Group 2

Group 3

Distance between the observations

Energy to minimize w.r.t the labels yi

e.g.


Unsupervised learning — Learning



of observations?

f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)

f (…) =n

∑i=1

n

∑j=1

𝕀yi==yj|xi − xj |

22

{dist(xi, xj)}i, j∈{1,…,n}2

To sum-up

1. We have training data with no known label 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) based

on an criterion of inner energy 4. Predictions on new observations


2) Classic algorithms

Classic algorithms in Machine Learning


2.a) Classic algorithms - K-means

K-means algorithm

Variable 1

Varia

ble

2



Variable 1

Varia

ble

2

K-means algorithm

N seeds are randomly drawn (in this example N=4)



Variable 1

Varia

ble

2

K-means algorithm

For each observation, we consider the nearest seed.

Remark: Euclidian distances are used in this example



Variable 1

Varia

ble

2

K-means algorithm

Seeds are then centered according to the their corresponding observations…



Variable 1

Varia

ble

2

K-means algorithm

… for each observation, we again consider the nearest seed …



Variable 1

Varia

ble

2

K-means algorithm

… and we re-iterate until convergence.


Decision trees

Variable 1

Varia

ble

2

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees


Decision trees

Variable 1

Varia

ble

2

D2D1

var2 < 3var2 > 3

D2

D1

All obs.






Decision trees

Variable 1

Varia

ble

2

All obs.

D2D1

var2 < 3var2 > 3

D3

D1

D4D3

var1 < 1var1 > 1

D4






Decision trees

Variable 1

Varia

ble

2

D2D1

var2 < 3var2 > 3

D3

D1

D4D3

var1 < 1var1 > 1

………

…

All obs.






Random forests

Variable 1

Varia

ble

2



High dimension: p >> 1 (p = 2 in the former example)

Learning : • Several trees are independently

defined. • Dimensions are randomly drawn.

Prediction : • A label predicted at a given point is the

one predicted by the majority of trees (bagging).

2.c) Classic algorithms - random forests


Variable 1

Varia

ble

2

Support Vector Machine (SVM)



2.d) Classic algorithms - Support Vector Machine (SVM)


We estimate w and b such as:

yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n

Variable 1

Varia

ble

2


w

(w . x

- b) =

1

(w . x

- b) =

0

(w . x

- b) =

-1





Variable 1

Varia

ble

2




We estimate w and b such as:

yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n

Full classification with a linear model is impossible now!!!



Variable 1

Varia

ble

2

We estimate w et b that minimize:

[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1

n

2


w



2

> 0 si yi is not well predicted



Variable 1

Varia

ble

2

We estimate w et b that minimize:

[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1

n

2


w



2

> 0 si yi is not well predicted

Remark: It is possible (and common) to use non-linear separations by replacing vector products with non-linear relations.




Support Vector Machine (SVM) — Kernel methods

How to handle this case using a Linear model?

Variable 1

Varia

ble

2



Variable 1

Varia

ble

2

We denote an observation

We classify the instead of theΦ(xi) = (x1i , x2

i , (x1i )2)

xi = (x1i , x2

i )

xi

Variable 1

Varia

ble

3

Var 2

Support Vector Machine (SVM) — Kernel methods

How to handle this case using a Linear model?

… there exists a huge literature on this topic


2.e) Classic algorithms - Logistic regression

Logistic regression

For each observation :

• Explicative variable: where

• Response variable:

Estimate as the minimizer of:

Remarks: • Linear classification as by using linear SVM but different model with strong statistical insights • Scales particularly well when n and/or p is very high • Modeling constrains on is gold standard when

xi = (x1i , x2

i , …, xpi )

yi ∈ {−1,1}

p > > 1

i ∈ {1,…, n}

(w

b +p

∑j=1

wjxji > 0b +

p

∑j=1

wjxji < 0

yi = 1yi = − 1

w p > > n

Ideally

w= {w1, …, wp}, b)(w


3) Over-fitting and cross validation

Overfitting and cross-validation


3.a) Over-fitting and cross validation — overfitting

Overfitting

Full decision tree Linear SVM or Logistic RegressionTruncated decision tree

Training data:


Area

Area

Area

Area

Area Area

Area

Area

Area Area Area Area

Overfitting

Full decision tree Truncated decision tree

Training data:

100% accuracy 95% accuracy 95% accuracy

Linear SVM or Logistic Regression



Area

Area

Area

Area

Area Area

Area

Area

Area Area Area Area

Overfitting

Full decision tree Truncated decision tree

Training data:

100% accuracy 95% accuracy 95% accuracy

Which strategy would you trust most to predict the label of new observations ?

Linear SVM or Logistic Regression



All available data to learn decision rules

3.b) Over-fitting and cross validation — cross-validation

Cross-validation: Fundamental paradigm of Machine Learning to validate trained models


Split the data into training data and test data

Test dataTraining data



Learn the model parameters on training data


93.75% accuracy in this example



Evaluate the model quality with no risk of overfitting on test data

100% accuracy in this example




K-folds: K tests to be more robust and additionally evaluate the model stability

Test: 100% acc.Learn

1st 4-fold 2nd 4-fold

3rd 4-fold 4th 4-fold


Learn

Learn

LearnTest: 100% acc. Test: 100% acc.

Test: 80% acc.

95% average accuracy and stable parameters… good stuff


Leave-1-out: n tests (recommended when n is small)

Test: sucessLeave x1 Leave x2 Leave x3

Leave x4 Leave x5 Leave x7

Leave x8 Leave x9

…


Learn

Test: sucessLearn

Test: sucessLearn

Test: failedLearn

Test: sucessLearn

Test: sucessLearn

Test: sucessLearn

Test: sucessLearn


4) High dimensionality

High dimensionality… regularization, model selection, and/or dimensionality reduction


4.a) High dimensionality — Example of problem

Project context : • Observations = MRI images of the brain at different acquisition times (ADNI*) • Labels = Patient state (MCI/AD) • Prediction of Alzheimer disease depending on the hippocampus morphological evolution?

[Baseline] [Baseline + 12 months]

Initial data: • [Baseline] : n = 103 patients are MCI • [Baseline + 12 months] : 84 patients are MCI / 19 patients are AD

Hippocampus

* http://adni.bmap.ucla.edu/


For each of the n = 103 observations (patients): • xi : Evolution marker on the Template p = 20000 points • yi : State AD or MCI

Questions: • Is it possible to discriminate MCI and AD patients based on the shape evolution? • How to learn the most discriminant markers?

103 x

Subject 2 Baseline

Subject 84 Baseline

Subject 86 Baseline

Subject 103 Baseline

... ...

Subject 2 Baseline

+ 12 months

Subject 84 Baseline

+ 12 months

Subject 86 Baseline

+ 12 months

Subject 103 Baseline

+ 12 months... ...

MCI Group

Treatment 1: Estimate deformations between [Baseline] and [Baseline + 12 months] [Ourselin et al. Im Vis Comp., 2001], [Vialard et al. IJCV, 2012]

Subject 85 Baseline

Subject 85 Baseline

+ 12 months

Subject 1 Baseline

Subject 1 Baseline

+ 12 months

Registration

AD Group

Treatment 2 : Transport evolution markers on an Template/average shape


Registration Registration Registration Registration Registration



where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate

= F( + b )

# points

# su

jets

# po

ints1

1

y X wdef.

Logistic regression predictive model that defines the probability of the yi depending on the xi :

Log-likelihood optimization:

# su

jets

where:

Regularization parameter (mandatory as n>p)


Log-likelihood optimization: Regularization parameter (mandatory as n>p)


where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate

= F( + b )

# points

# su

jets

# po

ints1

1

y X wdef.

Logistic regression predictive model that defines the probability of the yi depending on the xi :

# su

jets

where:

2x = 3

2x1 + 3x2 = 3

2x1 + 3x2 = 33x1 + 1x2 = 1

2x1 + 3x2 + 1x3 − x4 = 15x1 − x2 + 2x3 + x4 = 1

n = 1 et p = 1 OK

n = 1 et p = 2 KO

n = 2 et p = 2 OK

n = 2 et p = 4 KO


4.b) High dimensionality — Effect of regularization

Optimization of w using: Lewis & Overton, Nonsmooth optimization via quasi-Newton methods. Math. Programming 2012

Tested regularization models:

(1) Ridge :

(2) LASSO :

(3) Elastic net :

(4) Sobolev semi-norm:

(5) Total Variation :

(6) Fused LASSO :


4.b) High dimensionality — Effect of regularization (and model selection)

(1) Ridge (4) Sobolev semi-norm

(2) LASSO (5) Total Variation

(3) Elastic net (6) Fused LASSO

Representation of w for three λ on a slice of the hippocampus: • Blue and red: strong local influence • Green: little or no local influence


4.b) High dimensionality — Effect of regularization

Results obtained using a cross validation method (here leave-10%-out) : • Spec+Sens = 2 good prediction in 100% of the cases • Spec+Sens = 1 coin flipping (Heads or Tails) has the same predictive power • Spec+Sens = 0 good prediction in 0% of the cases

Best results obtained using a regularization pertinent with regard to the data: • Spatial distribution taken into account • Allows clear transitions

[Fiot J.B. et al., NeuroImage: Clinical, 2012]


4.c) High dimensionality — Dimensionality reduction using SVD

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

How to establish a general ranking between these countries ???

Weighted sum between the scores, then ranking of these sums.



Weighted sum of the scores is equivalent to a matrix x vector multiplication:

Vector containing the scores = M . w


How to establish a general ranking between these countries ???

Weighted sum between the scores, then ranking of these sums.



One can also look for a vector of norm 1 that maximizes the variability between the scores

Optimal vector = 1st eigenvector (v1) of the SVD Variability level = 1st eigenvalue (λ1) of the SVD

Vector of scores with the highest variability = M . v1

Matrix M




One can now search the vector of norm 1, orthogonal to v1 , that maximizes the variability

Optimal vector = 2nd eigenvector (v2) of the PCA Variability level = 2nd eigenvalue (λ2) dof the PCA

… and so on ...Can be calculated analytically

Matrix M



Black: Projection of the data on PC2 and PC3

Red: Influence of the variables in PC2 and PC3


Fantastic tool to visualize and interpret high dimensional data …


Scree plot of the eigenvalues


λ1 λ2 λ3 λ4 λ5 λ6 λ7 λ8

… and a powerful tool to reduce the problem dimensionality before training a M.L. model

Cap

ture

d va

riabi

lity

proj.PC1 proj.PC2 proj.PC3 Australie … … … Belgique … … … Brésil … … … RoyaumeUni … … … Canada … … … Chine … … … Croatie … … … Ethiopie … … … France … … … Allemagne … … … …

100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 …

Projection of the data from a 9D space to a 3D space preserves

here 90% of their variability

!!!


5) Supervised learning using Neural Networks

Supervised learning using Neural Networks


https://pythonprogramming.net/neural-networks-machine-learning-tutorial/

Deep learning …

Known PredictedLearned

• Very efficient in important applications (signal, images).

• Computationally heavy learning phase but quick predictions.

• Very large amount of parameters to learn.

• Requires large databases of annotated data (or wise network designs).

5) Supervised learning using Neural Networks

https://pythonprogramming.net/neural-networks-machine-learning-tutorial/

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Prediction

I = Image RGB 200*200

h1(I)

h2(I)

Prediction

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

5) Supervised learning using Neural Networks — User point of view


1

0


Image of dog or cat



Prediction


1

1


Image of dog or cat



Prediction


0

0


Image of dog or cat



Prediction


Training phase: Parameters optimization to get the best predictions in average

0

0

Input training data Many images of dogs and cats

Output training data Labels of each image

… ……

…

…

…

…

0

00

0

……

……




The xi are typically the intensities of an RGB image I in each of its channel Predicted labels

5) Supervised learning using Neural Networks — Into the black-box

Input layer Hidden layers Output layer



Layer 1 Layer 2 Layer 3 Layer L











Minimize the expectation of the prediction error ( ! its average on the K training observations): ≈



Predicted labels

Known labels


Stochastic gradient descent:

where

• Calculated on a subsample of the K observations

at each iteration (batch)

• Calculated analytically if l=L-1

• Back-propagated if l<L-1




Optimisation de l’espérance (moyenne) de :

In practice: • Various types of layers • Various types of architectures • Various strategies to perform the stochastic gradient descent

Hidden important properties of N.N.: • Prediction and training can be straightforwardly parallelized on GPUs • Nvidia cuDNN library massively used by Keras, TensorFlow, Theano, PyTorch, …




That’s all for now

MERCI !!!

Documents

Introduction to Machine Learning - laurent.risser.free.frlaurent.risser.free.fr/TEACHING/IntroductionToML_2019.pdf · Introduction to Machine Learning Laurent Risser Institut de Mathématiques