Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Lecture 1: Introduction to

Machine Learning

Isabelle Guyonisabelle@clopinet.com

What is Machine Learning?

Learning

algorithm

TRAININGDATA Answer

Trained

machine

What for?

• Classification

• Time series prediction

• Regression

• Clustering

Applications

inputs

training examples

Bioinformatics

Ecology

OCRHWR

MarketAnalysis

TextCategorization

Machine Vision

10 102 103 104 105

Banking / Telecom / Retail

• Identify:– Prospective customers– Dissatisfied customers– Good customers– Bad payers

• Obtain:– More effective advertising– Less credit risk– Fewer fraud– Decreased churn rate

Biomedical / Biometrics

• Medicine:– Screening– Diagnosis and prognosis– Drug discovery

• Security:– Face recognition– Signature / fingerprint / iris

verification– DNA fingerprinting 6

Computer / Internet

• Computer interfaces:– Troubleshooting wizards – Handwriting and speech– Brain waves

• Internet– Hit ranking– Spam filtering– Text categorization– Text translation– Recommendation 7

Conventions

X={xij}

y ={yj}

Learning problem

Colon cancer, Alon et al 1999

Unsupervised learningIs there structure in data?

Supervised learningPredict an outcome y.

Data matrix: X

m lines = patterns (data points, examples): samples, patients, documents, images, …

n columns = features: (attributes, input variables): genes, proteins, words, pixels, …

Some Learning Machines

• Linear models

• Kernel methods

• Neural networks

• Decision trees

Linear Models

• f(x) = w x +b = j=1:n wj xj +b

Linearity in the parameters, NOT in the input components.

• f(x) = w (x) +b = j wj j(x) +b (Perceptron)

• f(x) = i=1:m i k(xi,x) +b (Kernel method)

Artificial Neurons

f(x) = w x + b

Synapses

Activation of other neurons Dendrites

Cell potential

Activation function

McCulloch and Pitts, 1943

Linear Decision Boundary

0.5-0.5

hyperplane

Perceptron

Rosenblatt, 1957

f(x) = w (x) + b

NL Decision Boundary

0.5-0.5

Hs.128749Hs.234680

Kernel Method

Potential functions, Aizerman et al 1964

f(x) = i i k(xi,x) + b

k(x1,x)

k(x2,x)

k(xm,x)

k(. ,. ) is a similarity measure or “kernel”.

A kernel is:• a similarity measure• a dot product in some feature space: k(s, t) = (s) (t)

But we do not need to know the representation.

Examples:

• k(s, t) = exp(-||s-t||2/2) Gaussian kernel

• k(s, t) = (s t)q Polynomial

kernel

What is a Kernel?

Hebb’s Rule

wj wj + yi xij

yxj wj

Synapse

Activation of another neuron

Dendrite

Link to “Naïve Bayes”

Kernel “Trick” (for Hebb’s rule)

• Hebb’s rule for the Perceptron:

w = i yi (xi)

f(x) = w (x) = i yi (xi) (x)

• Define a dot product:

k(xi,x) = (xi) (x)

f(x) = i yi k(xi,x)

Kernel “Trick” (general)

• f(x) = i i k(xi, x)

• k(xi, x) = (xi) (x)

• f(x) = w (x)

• w = i i (xi)

Dual forms

f(x) = i k(xi, x)

k(xi, x) = (xi).(x)

Potential Function algorithmi i + yi if yif(xi)<0(Aizerman et al 1964)

Dual minoveri i + yi for min yif(xi)

Dual LMSi i + (yi - f(xi))

Simple Kernel Methods

f(x) = w • (x)

Perceptron algorithm w w + yi (xi) if yif(xi)<0(Rosenblatt 1958)

Minover (optimum margin)w w + yi (xi) for min yif(xi)(Krauth-Mézard 1987)

LMS regressionw w + yi- f(xi)) (xi)

iw = i (xi)i

(ancestor of SVM 1992, similar to kernel Adatron, 1998,

and SMO, 1999)

Multi-Layer Perceptron

Back-propagation, Rumelhart et al, 1986

“hidden units”

internal “latent” variables

Chessboard Problem

Tree Classifiers

CART (Breiman, 1984) or C4.5 (Quinlan, 1993)

At each step, choose the feature that

“reduces entropy” most. Work towards “node purity”.

All the data

Choose f2

Choose f1

Iris Data (Fisher, 1936)

Linear discriminant Tree classifier

Gaussian mixture Kernel method (SVM)

setosavirginica

versicolor

Figure from Norbert Jankowski and Krzysztof Grabczewski

Fit / Robustness Tradeoff

Performance evaluation

f(x) = 0

f(x) > 0

f(x) < 0

f(x) = 0

f(x) > 0

f(x) < 0

f(x) = -1

f(x) > -1

f(x) < -1f(x) = -1

f(x) > -1

f(x) < -1

f(x) = 1

f(x) > 1

f(x) < 1

f(x) = 1

f(x) > 1

f(x) < 1

ROC Curve

For a given threshold on f(x), you get a point on the ROC curve. Actu

al ROC

Positive class success rate

(hit rate, sensitivity)

1 - negative class success rate (false alarm rate, 1-specificity)

Random ROC

Ideal ROC curve

ROC Curve

Ideal ROC curve (AUC=1)

0 AUC 1

Actual R

Random ROC (AUC=0.5)

For a given threshold on f(x), you get a point on the ROC curve.

Positive class success rate

(hit rate, sensitivity)

1 - negative class success rate (false alarm rate, 1-specificity)

What is a Risk Functional?

A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.

Examples:

• Classification: – Error rate: (1/m) i=1:m 1(F(xi)yi)

– 1- AUC

• Regression: – Mean square error: (1/m) i=1:m(f(xi)-yi)2

How to train?

• Define a risk functional R[f(x,w)]• Optimize it w.r.t. w (gradient descent,

mathematical programming, simulated annealing, genetic algorithms, etc.)

Parameter space (w)

R[f(x,w)]

w*(… to be continued in the next lecture)

Summary

• With linear threshold units (“neurons”) we can build:– Linear discriminant (including Naïve Bayes)– Kernel methods– Neural networks– Decision trees

• The architectural hyper-parameters may include:– The choice of basis functions (features)– The kernel – The number of units

• Learning means fitting:– Parameters (weights)– Hyper-parameters– Be aware of the fit vs. robustness tradeoff

Want to Learn More?

• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html

• The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/

• Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000. http://clopinet.com/isabelle/Papers/guyon_stork_nips98.ps.gz

• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book

Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Documents

Causality Workbenchclopinet.com/causality Challenges in Causality Isabelle Guyon, Clopinet Constantin Aliferis and Alexander Statnikov, Vanderbilt Univ

RESULTS OF THE NIPS 2006 MODEL SELECTION GAME Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley, Olivier Guyon, and many other volunteers, see

Looking at people (again!)clopinet.com/isabelle/Projects/CVPR2011/slides/Forsight.pdfComputational Behavioural Science • Observe people • Using vision, physiological markers ‒

Higgs Machine Learning Challenge Claire Adam-Bourdarios, Glen Cowan, Cécile Germain, Isabelle Guyon, Balazs Kegl, David Rousseau higgsml@lal.in2p3.frhiggsml@lal.in2p3.fr

Machine Learning Crash Course Computer Vision James Hays Slides: Isabelle Guyon, Erik Sudderth, Mark Johnson, Derek Hoiem Photo: CMU Machine Learning Department

Learning Theory Put to Work Isabelle Guyon isabelle@clopinet.com

An Introduction to Variable and Feature Selection - MIT CSAIL · An Introduction to Variable and Feature Selection Isabelle Guyon ISABELLE@CLOPINET.COM Clopinet 955 Creston Road Berkeley,

Causal feature selection - ClopiNet · Causal feature selection Isabelle Guyon, Clopinet, California ... mechanisms, distinguishing between actual features and experimental artifacts,

Isabelle Guyon, President, ChaLearn at MLconf SF - 11/13/15

A computer-based system to support forensic …clopinet.com/isabelle/Projects/WANDA/proper/proper_white.pdfKatrin Franke, Mario K¨oppen: A computer-based system to support forensic

Feature selection and causal discovery fundamentals and applications Isabelle Guyon isabelle@clopinet.com

Causality Workbenchclopinet.com/causality Cause-Effect Pair Challenge Isabelle Guyon, ChaLearn IJCNN 2013 IEEE/INNS

Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University

RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon Amir Reza Saffari Azar Alamdari Gideon Dror

Unsupervised and Transfer Learning Challenges in · PDF fileUnsupervised and Transfer Learning Challenges in Machine Learning, Volume 7 Isabelle Guyon, Gideon Dror, Vincent Lemaire,

Causality Workbenchclopinet.com/causality The LOCANET task (Pot-luck challenge, NIPS 2008) Isabelle Guyon, Clopinet Alexander Statnikov, Vanderbilt Univ

Detecting Stable Clusters Using Principal Component Analysis · Detecting Stable Clusters Using Principal Component Analysis Asa Ben-Hur and Isabelle Guyon 1 Introduction Clustering

A practical guide to model selection - Semantic …...A practical guide to model selection Isabelle Guyon ClopiNet, Berkeley, CA 94708, USA email: isabelle@clopinet.com 1. Introduction

Clinical trials with non-adherence & unblinding: a ...clopinet.com/isabelle/Projects/NIPS2013/slides/Silver_NIPS2013.pdfIntro Overview Overview Even with perfect randomization, clinical

Distinguishing between Cause and Effect: Estimation of ...clopinet.com/isabelle/Projects/NIPS2013/slides/Peters_NIPS2013.pdf · Y non-Gaussian X Y Then there is no X = ˚Y + N X N