35
Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle @ clopinet .com

Lecture 1: Introduction to Machine Learning Isabelle Guyon [email protected]

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Lecture 1: Introduction to

Machine Learning

Isabelle [email protected]

Page 2: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

What is Machine Learning?

Learning

algorithm

TRAININGDATA Answer

Trained

machine

Query

Page 3: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

What for?

• Classification

• Time series prediction

• Regression

• Clustering

Page 4: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Applications

inputs

training examples

10

102

103

104

105

Bioinformatics

Ecology

OCRHWR

MarketAnalysis

TextCategorization

Machine Vision

Syst

em d

iagn

osis

10 102 103 104 105

Page 5: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Banking / Telecom / Retail

• Identify:– Prospective customers– Dissatisfied customers– Good customers– Bad payers

• Obtain:– More effective advertising– Less credit risk– Fewer fraud– Decreased churn rate

Page 6: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Biomedical / Biometrics

• Medicine:– Screening– Diagnosis and prognosis– Drug discovery

• Security:– Face recognition– Signature / fingerprint / iris

verification– DNA fingerprinting 6

Page 7: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Computer / Internet

• Computer interfaces:– Troubleshooting wizards – Handwriting and speech– Brain waves

• Internet– Hit ranking– Spam filtering– Text categorization– Text translation– Recommendation 7

Page 8: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Conventions

X={xij}

n

mxi

y ={yj}

w

Page 9: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Learning problem

Colon cancer, Alon et al 1999

Unsupervised learningIs there structure in data?

Supervised learningPredict an outcome y.

Data matrix: X

m lines = patterns (data points, examples): samples, patients, documents, images, …

n columns = features: (attributes, input variables): genes, proteins, words, pixels, …

Page 10: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Some Learning Machines

• Linear models

• Kernel methods

• Neural networks

• Decision trees

Page 11: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Linear Models

• f(x) = w x +b = j=1:n wj xj +b

Linearity in the parameters, NOT in the input components.

• f(x) = w (x) +b = j wj j(x) +b (Perceptron)

• f(x) = i=1:m i k(xi,x) +b (Kernel method)

Page 12: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Artificial Neurons

x1

x2

xn

1

f(x)

w1

w2

wn

b

f(x) = w x + b

Axon

Synapses

Activation of other neurons Dendrites

Cell potential

Activation function

McCulloch and Pitts, 1943

Page 13: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Linear Decision Boundary

-0.50

0.5-0.5

00.5

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

X1X2

X3

x1x2

x3

hyperplane

x1

x2

Page 14: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Perceptron

Rosenblatt, 1957

f(x)

f(x) = w (x) + b

1(x)

1

x1

x2

xn

2(x)

N(x)

w1

w2

wN

b

Page 15: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

NL Decision Boundary

x1

x2

-0.5

0

0.5

-0.5

0

0.5-0.5

0

0.5

Hs.128749Hs.234680

Hs.

7780

x1

x2

x3

Page 16: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Kernel Method

Potential functions, Aizerman et al 1964

f(x) = i i k(xi,x) + b

k(x1,x)

1

x1

x2

xn

1

2

m

b

k(x2,x)

k(xm,x)

k(. ,. ) is a similarity measure or “kernel”.

Page 17: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

A kernel is:• a similarity measure• a dot product in some feature space: k(s, t) = (s) (t)

But we do not need to know the representation.

Examples:

• k(s, t) = exp(-||s-t||2/2) Gaussian kernel

• k(s, t) = (s t)q Polynomial

kernel

What is a Kernel?

Page 18: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Hebb’s Rule

wj wj + yi xij

Axon

yxj wj

Synapse

Activation of another neuron

Dendrite

Link to “Naïve Bayes”

Page 19: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Kernel “Trick” (for Hebb’s rule)

• Hebb’s rule for the Perceptron:

w = i yi (xi)

f(x) = w (x) = i yi (xi) (x)

• Define a dot product:

k(xi,x) = (xi) (x)

f(x) = i yi k(xi,x)

Page 20: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Kernel “Trick” (general)

• f(x) = i i k(xi, x)

• k(xi, x) = (xi) (x)

• f(x) = w (x)

• w = i i (xi)

Dual forms

Page 21: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

f(x) = i k(xi, x)

k(xi, x) = (xi).(x)

Potential Function algorithmi i + yi if yif(xi)<0(Aizerman et al 1964)

Dual minoveri i + yi for min yif(xi)

Dual LMSi i + (yi - f(xi))

Simple Kernel Methods

f(x) = w • (x)

Perceptron algorithm w w + yi (xi) if yif(xi)<0(Rosenblatt 1958)

Minover (optimum margin)w w + yi (xi) for min yif(xi)(Krauth-Mézard 1987)

LMS regressionw w + yi- f(xi)) (xi)

iw = i (xi)i

(ancestor of SVM 1992, similar to kernel Adatron, 1998,

and SMO, 1999)

Page 22: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Multi-Layer Perceptron

Back-propagation, Rumelhart et al, 1986

xj

“hidden units”

internal “latent” variables

Page 23: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Chessboard Problem

Page 24: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Tree Classifiers

CART (Breiman, 1984) or C4.5 (Quinlan, 1993)

At each step, choose the feature that

“reduces entropy” most. Work towards “node purity”.

All the data

f1

f2

Choose f2

Choose f1

Page 25: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Iris Data (Fisher, 1936)

Linear discriminant Tree classifier

Gaussian mixture Kernel method (SVM)

setosavirginica

versicolor

Figure from Norbert Jankowski and Krzysztof Grabczewski

Page 26: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

x1

x2

Fit / Robustness Tradeoff

x1

x2

15

Page 27: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

x1

x2

Performance evaluation

x1

x2

f(x) = 0

f(x) > 0

f(x) < 0

f(x) = 0

f(x) > 0

f(x) < 0

Page 28: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

x1

x2

x1

x2

f(x) = -1

f(x) > -1

f(x) < -1f(x) = -1

f(x) > -1

f(x) < -1

Performance evaluation

Page 29: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

x1

x2

x1

x2

f(x) = 1

f(x) > 1

f(x) < 1

f(x) = 1

f(x) > 1

f(x) < 1

Performance evaluation

Page 30: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

ROC Curve

100%

100%

For a given threshold on f(x), you get a point on the ROC curve. Actu

al ROC

0

Positive class success rate

(hit rate, sensitivity)

1 - negative class success rate (false alarm rate, 1-specificity)

Random ROC

Ideal ROC curve

Page 31: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

ROC Curve

Ideal ROC curve (AUC=1)

100%

100%

0 AUC 1

Actual R

OC

Random ROC (AUC=0.5)

0

For a given threshold on f(x), you get a point on the ROC curve.

Positive class success rate

(hit rate, sensitivity)

1 - negative class success rate (false alarm rate, 1-specificity)

Page 32: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

What is a Risk Functional?

A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.

Examples:

• Classification: – Error rate: (1/m) i=1:m 1(F(xi)yi)

– 1- AUC

• Regression: – Mean square error: (1/m) i=1:m(f(xi)-yi)2

Page 33: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

How to train?

• Define a risk functional R[f(x,w)]• Optimize it w.r.t. w (gradient descent,

mathematical programming, simulated annealing, genetic algorithms, etc.)

Parameter space (w)

R[f(x,w)]

w*(… to be continued in the next lecture)

Page 34: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Summary

• With linear threshold units (“neurons”) we can build:– Linear discriminant (including Naïve Bayes)– Kernel methods– Neural networks– Decision trees

• The architectural hyper-parameters may include:– The choice of basis functions (features)– The kernel – The number of units

• Learning means fitting:– Parameters (weights)– Hyper-parameters– Be aware of the fit vs. robustness tradeoff

Page 35: Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

Want to Learn More?

• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html

• The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/

• Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000. http://clopinet.com/isabelle/Papers/guyon_stork_nips98.ps.gz

• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book