Artificial Intelligence (CSC9YE) Machine Learning ... › ... › lectures › CSCU9YE_ML1_a16.pdf · gure from Andrew Ng, Coursera Where the archetypal startup of 2008 was \x but

$Page 1: Artificial Intelligence (CSC9YE) Machine Learning ... › ... › lectures › CSCU9YE_ML1_a16.pdf · gure from Andrew Ng, Coursera Where the archetypal startup of 2008 was \x but$
Artificial Intelligence (CSC9YE)Machine Learning: Lectures 6 and 7

Fabio [email protected]

[email protected]

Overview

Part I. IntroductionDefinitionsSupervised LearningModel Selection

Part II. Decision TreesRecursive PartitioningClassification And Regression TreesEnsembles

Part I

Motivationfigure from Andrew Ng, Coursera

Where the archetypal startup of 2008 was “x but on aphone” and the startup of 2014 was “Uber but for x”,this year is the year of “doing x with machine learning.”

from “Google says machine learning is the future. So I tried it myself”.

The Guardian, 28 June 2016

1 / 33

https://www.theguardian.com/technology/2016/jun/28/google-says-machine-learning-is-the-future-so-i-tried-it-myself

https://www.theguardian.com/technology/2016/jun/28/google-says-machine-learning-is-the-future-so-i-tried-it-myself

Definitionfrom (T. Mitchell 1997)

I Machine learning is concerned with building computerprograms that can automatically improve with experience.

I A machine learning algorithm is an algorithm that is able tolearn from data. What does it mean?

“A computer program is said to learn from experience Ewith respect to some class of tasks T and performancemeasure P, if its performance at tasks in T, as measuredby P, improves with experience E.”

2 / 33

Learning Paradigms

Supervised Learning: the machine is presented with a series ofinput-output examples and learns a function thatmatches inputs to outputs. success=max accuracy

I regressionI classification

Unsupervised Learning: the machine is presented with a series ofinputs and learns how they are organised. success=?

I clustering (or segmentation)I dimensionality reduction

Reinforcement Learning: the machine learns to determine the idealbehaviour based on feedback from the environment,rewards or punishments. success=max reward

I game playingI on-line control

3 / 33

The Two Cultures

NatureXy

Data

< y,X >< y,X >< y,X >

Machine Learning: subfield of Artificial Intelligence

emphasis: on algorithms and applications at scalegoal: prediction

Statistical Learning: subfield of Statistics

emphasis: on models, assumptions and interpretabilitygoal: inference

4 / 33

Supervised Learning setting

< x11, x12, … , x1p >

< x21, x22, … , x2p >

< x31, x32, … , x3p >

...

...

...< xn1, xn2, … , xnp >

y1

y2

y3

...

...

...yn

X y Data: list of observations in the formL = {< X , y >}

X n×p feature matrix / design matrix

n samples / examples / data pointsp features / predictors / covariates

yn×1 target vector / labelsI regression: continuous valuesI classification: finite set of types

Problem: learn y = f (X )

5 / 33

A Simple Regression Task

E.g., < x , y > continuous variables, n = 20 points, p = 1 features.How to automatically find a mapping f from x to y?

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

6 / 33

Parametric Models

I Assume f (x) = β0 + β1x

I Find the parameters β that best fit the observed data

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

7 / 33

Assessing Model Accuracyfor a regression task

I Unknown data-generating process: yi = f (xi )

“signal”

+ εi“noise”

I Predictions from the model: yi = f (xi )

I How well is the model doing?Compare predictions yi to their corresponding true values yi :

Mean Squared Error: MSE(y , y) = 1n

∑ni=1 (yi − yi )

2

Mean Absolute Error: MAE(y , y) = 1n

∑ni=1 |yi − yi |

or other loss functions

I Learning a model seeks to minimise a loss function, whichgives the cost of predicting y instead of y

8 / 33

Model Selectiondegree=1 degree=2 degree=3

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

degree=4 degree=5 degree=6

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

degree=7 degree=8 degree=9

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

9 / 33

Prediction Error: Train Error vs Test Error

Train Error: average error on the same observations that are usedto build the model

Test Error: average error when making predictions on new data,not used to build the model

25

50

75

1 2 3 4 5 6 7 8polynomial degree

MS

E settesttrain

10 / 33

Underfitting and Overfitting

I small training error and large test error indicates overfitting,i.e. learning the noise instead of the signal

25

50

75

1 2 3 4 5 6 7 8polynomial degree

MS

E settesttrain

I how to estimate the test error? hold-out or resampling!

11 / 33

Hold-out approach: Train/Test Splitto evaluate a single model

X y

Seen Data(Training Set)

Unseen Data(Validation Set)

I build the model using only a subsetof available data (training set)

I measure model accuracy on theheld-out data (validation set)

I expected accuracy = accuracy onthe validation set

+ simple to code

+ fast to evaluate

- does not exploit all data

- results depend on split

12 / 33

Cross-Validation approach: e.g., K-Foldto evaluate a model building procedure

X y

…

Iteration 1 Iteration 2 Iteration 3 Iteration K

Training Set (Fold)

Validation Set(Out-Of-Fold)

Accuracy 1 Accuracy 2 Accuracy 3 Accuracy K

I split the data into K subsets and repeat K times:1. build a model using (K − 1) subsets as training set2. measure model accuracy on the held-old subset

I expected accuracy = average accuracy over iterations

+ good estimation of the generalisation error

- data efficient but computationally expensive

13 / 33

Expected Prediction Error

I In practice:

1. set aside a validation set, hidden from the training set2. use cross-validation on the training set for model selection3. refit the selected final model on the whole training set4. validate the final model on the held-out cases

I In theory:

Error(x)2 = Bias(f (x))2 + Var(f (x))

model dependent

+ Var(ε)

irreducible

where the last term is “noise” and the first terms indicate:

“bias” how close are predictions and their corresponding true values?“variance” how much do predictions vary if training data change?

I a good model has low bias and low variance

14 / 33

Bias and Variance

Low Bias

Low Variance

••••••••

High Variance

••••

••

••

High Bias

••••••••

•

•

•••• •

•

15 / 33

The Bias-Variance Trade-off

x

ydegree = 1

x

error(x)

bias2(x)

var(x)

noise(x)

x

y

degree = 3

x

x

y

degree = 9

x

model is too simple optimal tradeoff model is too complexbias dominates error variance dominates error

16 / 33

Test and Training Errorfigure from (Hastie et al. 2009)

38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Pre

dic

tion

Err

or

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)

2. Unfortunatelytraining error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). In

that case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

17 / 33

Part II

A Binary Classification Task

E.g., < x1, x2 >∈ R features, < y >∈ {red, blue} labels, n = 200.How to automatically find a mapping f from (x1, x2) to y?

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00x1

x 2

y●

●

redblue

18 / 33

Performance of a Binary Classifier

I confusion matrix:actual class y

yes no

predicted class yyes True Positives (TP) False Positive (FP)no False Negative (FN) True Negative (TN)

I metrics from the confusion matrix:

Accuracy: TP+TNn

Misclassification Rate: FP+FNn = 1− Accuracy

TP Rate (aka “Sensitivity” aka “Recall”): TPnyes

= TPTP+FN

TN Rate (aka “Specificity”): TNnno

= TNFP+TN

FP Rate: FPnno

= FPFP+TN = 1− Specificity

Precision: TPTP+FP

F1 Score: 2 ∗ Precision×RecallPrecision+Recall

I ROC curve: TP Rate (y-axis) vs FP Rate (x-axis)

AUC (aka AUROC): area under the curve

19 / 33

Decision Tree: Divide and Conquer

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00x1

x 2

y●

●

redblue

x2 < 0.5

x1 >= 0.7

red125 75

red94 6

blue31 69

red28 1

blue3 68

yes no

I Divide and Conquer:1. recursively partition the input data2. fit a simple model within each partition

I In particular, for simplicity:1a. binary splits (yes/no) that induce axis-parallel partitions1b. greedy selection of the split that maximises nodes “purity”

2. same prediction for all samples within a partition20 / 33

Tree Building Algorithmcode from (G. Louppe 2014)

function BuildDecisionTree(L)Create node tif the stopping criterion is met for t then

Assign a model to ytelse

Find the split on L that maximizes impurity decrease

s∗ = arg maxs

i(t)− pLi(tsL)− pR i(t

sR)

Partition L into LtL ∪ LtR according to s∗

tL = BuildDecisionTree(LtL)tR = BuildDecisionTree(LtR )

end ifreturn t

end function

21 / 33

Measuring Nodes Impurityfor a binary classification task, figure from (Hastie et al. 2009)

9.2 Tree-Based Methods 309

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

p

Entropy

Gini ind

ex

Misclas

sifica

tion e

rror

FIGURE 9.3. Node impurity measures for two-class classification, as a functionof the proportion p in class 2. Cross-entropy has been scaled to pass through(0.5, 0.5).

impurity measure Qm(T ) defined in (9.15), but this is not suitable forclassification. In a node m, representing a region Rm with Nm observations,let

pmk =1

Nm

!

xi∈Rm

I(yi = k),

the proportion of class k observations in node m. We classify the obser-vations in node m to class k(m) = arg maxk pmk, the majority class innode m. Different measures Qm(T ) of node impurity include the following:

Misclassification error: 1Nm

"i∈Rm

I(yi = k(m)) = 1 − pmk(m).

Gini index:"

k =k′ pmkpmk′ ="K

k=1 pmk(1 − pmk).

Cross-entropy or deviance: − "Kk=1 pmk log pmk.

(9.17)For two classes, if p is the proportion in the second class, these three mea-sures are 1 − max(p, 1 − p), 2p(1 − p) and −p log p − (1 − p) log (1 − p),respectively. They are shown in Figure 9.3. All three are similar, but cross-entropy and the Gini index are differentiable, and hence more amenable tonumerical optimization. Comparing (9.13) and (9.15), we see that we needto weight the node impurity measures by the number NmL

and NmRof

observations in the two child nodes created by splitting node m.In addition, cross-entropy and the Gini index are more sensitive to changes

in the node probabilities than the misclassification rate. For example, ina two-class problem with 400 observations in each class (denote this by(400, 400)), suppose one split created nodes (300, 100) and (100, 300), while

I If p is the proportion of samples of the other class in node t:

Misclassification Rate: i(t) = p −max (p, 1− p)Gini Index: i(t) = 2p(1− p)

Cross-Entropy: i(t) = −p log(p)−(1−p) log(1−p)2 log(2)

22 / 33

Classification And Regression Trees

By swapping impurity function and leaf model, decision trees canbe used to solve classification and regression tasks:

classification:

I y symbolic, discrete, e.g., Y = {red, blue}I y = arg maxc∈Y p(c |t), i.e. the majority class in node t

I i(t) = entropy(t) or i(t) = gini(t)

regression:

I y numeric, continuous

I y = mean(y |t), i.e. the point average in node t

I i(t) = 1nt

∑x,y∈Lt (y − yt)

2, i.e. the mean squared error

23 / 33

A Simple Regression Tree

Data from Part I: < x , y > continuous variables, n = 20 points

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

x < 418

x >= 154 x < 460

19.5n=20

14.8n=14

11.7n=9

20.5n=5

30.5n=6

24.4n=3

36.5n=3

yes no

24 / 33

Model Selection on tree parametersdepth=1 depth=2

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

depth=3 depth=4

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

25 / 33

Stopping condition: e.g., max depth or min samplesdepth=1 depth=2

x < 418

19.5n=20

14.8n=14

30.5n=6

yes no

x < 418

x >= 154 x < 460

19.5n=20

14.8n=14

11.7n=9

20.5n=5

30.5n=6

24.4n=3

36.5n=3

yes no

depth=3 depth=4

x < 418

x >= 154

x < 366 x >= 37.1

x < 460

x >= 444

19.5n=20

14.8n=14

11.7n=9

10.6n=8

20.5n=1

20.5n=5

18.4n=3

23.7n=2

30.5n=6

24.4n=3

22.3n=2

28.5n=1

36.5n=3

yes no

x < 418

x >= 154

x < 366 x >= 37.1

x < 21.9

x < 460

x >= 444 x < 474

x >= 478

19.5n=20

14.8n=14

11.7n=9

10.6n=8

20.5n=1

20.5n=5

18.4n=3

23.7n=2

20.4n=1

27n=1

30.5n=6

24.4n=3

22.3n=2

28.5n=1

36.5n=3

33.3n=1

38.1n=2

33.7n=1

42.6n=1

yes no

26 / 33

Recall: Underfitting and Overfitting

10

20

30

1 2 3 4tree maximum depth

MS

E settesttrain

I Overly complex trees are likely to overfit the training data:I to avoid this, tune the stopping criteria (or post-hoc prune)I cross-validation can be used for model selection

27 / 33

Recall: Bias and Variance

Low Bias

Low Variance

• •••••••

High Variance

•••• •• ••

High Bias••••• •••

•

•

•

••

••

•

28 / 33

Bias and Variance of a Single Tree

x

ydepth = 1

x

error(x)

bias2(x)

var(x)

noise(x)

x

y

depth = 3

x

x

y

depth = 5

x

I Decision trees have, in general, low bias but high variance:I to reduce variance, combine the predictions of several trees!

29 / 33

Bootstrapping and Aggregating: Bagginggeneral-purpose procedure to construct an ensemble of (same model type) estimators

…

BootstrappedSet 1

BootstrappedSet 2

BootstrappedSet M

TrainingSet

Model 1

Model 2

Model M

AggregatedModel

…

I Training the ensemble:1. by sampling with replacement, build bootstrapped training sets2. on each bootstrapped set, fit a separate learning model

I Testing the ensemble:1. submit test data to all models and aggregate their predictions:

I by majority voting for classification tasksI by averaging for regression tasks

I Aggregation reduces variance if models are not correlated:I to de-correlate tree models, randomise tree construction!

30 / 33

Random Forestsslide from G. Louppe

𝒙

𝑝𝜑1(𝑌 = 𝑐|𝑋 = 𝒙)

𝜑1 𝜑𝑀

…

𝑝𝜑𝑚(𝑌 = 𝑐|𝑋 = 𝒙)

∑

𝑝𝜓(𝑌 = 𝑐|𝑋 = 𝒙)

Randomisation• Bootstrap samples } Random Forests• Random selection of K ≤ p split variables } Extra-Trees• Random selection of the threshold

31 / 33

Strengths and Weaknesses

I Decision Trees (single):

+ flexible: diverse tasks, heterogeneous features+ very fast to train and to use+ easy to visualise and to interpret+ low bias- high variance- require tuning (or pruning)- not very accurate

I Random Forests (ensemble):

+ as flexible as decision trees+ reasonably fast, embarrassingly parallel+ little tuning required (bushy trees are fine!)+ tuneable randomisation for fine bias/variance control+ usually very accurate- not so easy to interpret

32 / 33

Machine Learning Algorithms: where to start?map by A. Mueller, scikit-learn

33 / 33

References

Goodfellow, I., Bengio, Y., and Courville, A. (2016).

Deep learning.

Book in preparation for MIT Press.

Hastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2009).

The Elements of Statistical Learning: Data Mining, Inference, andPrediction.

Springer, second edition.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013).

An Introduction to Statistical Learning: with Applications in R.

Springer.

Louppe, G. (2014).

Understanding Random Forests: From Theory to Practice.

PhD thesis, Universite de Liege, Liege, Belgique.

Documents

Artificial Intelligence (CSC9YE) Machine Learning ... › ... › lectures › CSCU9YE_ML1_a16.pdf · gure from Andrew Ng, Coursera Where the archetypal startup of 2008 was \x but