How to validate your model

"IT'S GOOD, BUT IT'S NOT RIGHT!..." HOW TO VALIDATE YOUR MODEL

ALEX HENDERSON [email protected]

University of Manchester & SurfaceSpectra

SurfaceSpectra (XPS)

XPS of Polymers Database Surface Analysis by Auger and

X-Ray Photoelectron Spectroscopy

surfacespectra.com

SurfaceSpectra (SIMS)

Static SIMS Library TOF-SIMS: Materials Analysis

by Mass Spectrometry identity

surfacespectra.com

Multivariate analysis of SIMS spectra

Content here taken from my chapter “Multivariate analysis of SIMS spectra” in

“TOF-SIMS: Materials Analysis by Mass Spectrometry”

Eds. John C. Vickerman and David Briggs, Second edition,

2013, SurfaceSpectra & IM Publications

ISBN: 978-1-906715-17-5

Book and individual electronic chapters available

from: http://impublications.com/tof-sims/

See also: http://surfacespectra.com/books/tof-sims

Availability

Code: MATLAB (requires Statistics Toolbox for gscatter & mahal)

Data: SIMS spectra from samples of bacteria Applied Surface Science 252 (2006) 6869, DOI: 10.1016/j.apsusc.2006.02.153

J.S. Fletcher, A. Henderson, R.M. Jarvis, N.P. Lockyer, J.C. Vickerman and R. Goodacre

Analyst 134 (2009) 2352, DOI: 10.1039/b907570d S. Vaidyanathan, J.S. Fletcher, R.M. Jarvis, A. Henderson, N.P. Lockyer, R. Goodacre and J.C. Vickerman

Both code and data will be made available shortly

on one or more of the following platforms

Check http://manchester.ac.uk/sarc for information

Uses of validation

Right/wrong answer

Luck

Overfitting

Mistakes

Outliers

Usefulness of result

Sensitivity

Specificity

Uses of validation we will cover

Right/wrong answer

Luck Sampling (cross-validation, bootstrap)

Overfitting PRESS and RSS tests

Mistakes Visualisation

Outliers Robust methods (LIBRA)

Usefulness of result

Sensitivity Distance metrics

Specificity and confusion matrices

Chemometrics and Intelligent Laboratory Systems. 75, 127 (2005). http://wis.kuleuven.be/stat/robust/LIBRA.html

Example data

Bacterial samples related to urinary tract infection

5 bacterial species; 2 or 3 strains of each

Citrobacter freundii coded Cf (14 spectra)

Escherichia coli (E. Coli) coded Ec (32 spectra)

Enterococcus spp. coded En (33 spectra)

Klebsiella pneumonia coded Kp (15 spectra)

Proteus mirabilis coded Pm (21 spectra)

Each species/strain grown 3 times

(biological replicates)

Applied Surface Science 252 (2006) 6869-6874

Example data

Positive ion ToF-SIMS data

Samples analysed in random sequence over 3 days

Each sample analysed 3 times (technical replicates)

115 spectra ion total

Mass range: 1 – 800 u

Bin summed to nominal mass (±0.5 u)

Square root of data taken

Each spectrum normalised to unity

Applied Surface Science 252 (2006) 6869-6874

PCA results

Not too useful

without

context…

-8 -6 -4 -2 0 2 4 6 8 10

x 10-3

-6

-4

-2

0

2

4

6x 10

-3

principal component 1 (55.2%)

princip

al com

ponent

2 (

13.2

%)

PCA results

A priori

knowledge

indicates some

correlation

between

principal

components

and the

bacterial

classes

Labelled as

species

-8 -6 -4 -2 0 2 4 6 8 10

x 10-3

-6

-4

-2

0

2

4

6x 10

-3


princip

al com

ponent

2 (

13.2

%)

Cf

Ec

En

Kp

Pm

PCA results

A priori

knowledge

indicates some

correlation

between

principal

components

and the

bacterial

classes

Labelled as

strains

-8 -6 -4 -2 0 2 4 6 8 10

x 10-3

-6

-4

-2

0

2

4

6x 10

-3


princip

al com

ponent

2 (

13.2

%)

Cf102

Cf109

Ec013

Ec017

Ec041

Ec007

EnC82

EnC85

EnC90

EnC93

Kp052

Kp059

Pm065

Pm070

Pm073

Canonical Variates Analysis (CVA)

Also known as Discriminant Function Analysis (DFA)

Often used in conjunction with PCA PC-DFA

Blends principal components to better match a priori classes

PCA identifies unique characteristics of the data set as a whole

CVA identifies proportions of these characteristics that best match known classes of samples

Need to define how many PCs to use – see later…

The Great UKSAF Bake Off!

David Scurr did some baking(!!), what did he make?


Approach…

Go to a real baker and collect some items of

different types: bread, scones, cakes, etc.

Analyse them by SIMS, XPS or Raman [insert your

favourite technique here] and do multivariate

analysis on the data

Pre-process

PCA

CVA


PCA gives unique ingredients (characteristics):

Flour

Butter

Yeast

Eggs

Sugar

Salt

Etc

Note: it did not identify fatty acids, amino acids or other discrete chemicals


CVA gives proportions of the PCA results

(ingredients) that match the bakery items helps

to identify the recipe


CVA gives proportions of the PCA results

(ingredients) that match the bakery items helps

to identify the recipe

Now analyse David’s offering in the same manner

Should identify the proportions of the various

ingredients (PCs)


CVA gives proportions of the PCA results (ingredients) that match the bakery items helps to identify the recipe

Now analyse David’s offering in the same manner

Should identify the proportions of the various ingredients (PCs)

Given one of David’s (unrecognisable?) baked items, CVA could predict what it was supposed to be

(Well we can only hope!)

Types of variance

Reduce within-class variance (W) Increase between-class variance (B)

Fisher’s ratio

Wish to maximise the ratio of the between-class

variance to the within-class variance

𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝐵𝑒𝑡𝑤𝑒𝑒𝑛_𝑐𝑙𝑎𝑠𝑠 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 + 𝑊𝑖𝑡ℎ𝑖𝑛_𝑐𝑙𝑎𝑠𝑠 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒

𝐵𝑒𝑡𝑤𝑒𝑒𝑛 = 𝑇𝑜𝑡𝑎𝑙 − 𝑊𝑖𝑡ℎ𝑖𝑛

𝐵𝑒𝑡𝑤𝑒𝑒𝑛

𝑊𝑖𝑡ℎ𝑖𝑛=

𝑇𝑜𝑡𝑎𝑙 − 𝑊𝑖𝑡ℎ𝑖𝑛

𝑊𝑖𝑡ℎ𝑖𝑛

𝐵𝑒𝑡𝑤𝑒𝑒𝑛


𝑇𝑜𝑡𝑎𝑙

𝑊𝑖𝑡ℎ𝑖𝑛− 1

𝐹𝑖𝑠ℎ𝑒𝑟′𝑠 𝑟𝑎𝑡𝑖𝑜 =𝐵𝑒𝑡𝑤𝑒𝑒𝑛


𝑇𝑜𝑡𝑎𝑙

𝑊𝑖𝑡ℎ𝑖𝑛− 1

CVA results

Outcome of

CVA using 9

principal

components.

Class

separation

better defined

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06-0.015

-0.01

-0.005

0

0.005

0.01

0.015

0.02

canonical variate 1 (eig=12.6)

canonic

al variate

2 (

eig

=5.7

1)

Cf

Ec

En

Kp

Pm

PCA versus CVA

PCA CVA

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06-0.015

-0.01

-0.005

0

0.005

0.01

0.015

0.02


canonic

al variate

2 (

eig

=5.7

1)

Cf

Ec

En

Kp

Pm

-8 -6 -4 -2 0 2 4 6 8 10

x 10-3

-6

-4

-2

0

2

4

6x 10

-3


princip

al com

ponent

2 (

13.2

%)

Cf

Ec

En

Kp

Pm

Data projection

Predict classification of an unknown sample

Unseen by the model – not used to train the model

Pre-treat in the same manner as the training data

Remove the mean of the TRAINING data not the unknown or test data

Need to have the same origin for matrix rotation

Rotate the unknown data by the same amount as the training data

See where the unknown samples turn up

CVA outcome with projected data

Outcome of

CVA using 9

principal

components.

Model trained

with bootstrap

sampling

(empty circles)

and test data

projected in

(filled circles)

-0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08-0.015

-0.01

-0.005

0

0.005

0.01

0.015


canonic

al variate

2 (

eig

=4.6

4)

Cf

Ec

En

Kp

Pm

Cf

Ec

En

Kp

Pm

How close is close?

Problems:

Can only visualise up to 3 dimensions

Something that looks close in 2D may not be close in

another dimension

Spread of data may not be (hyper)spherical

Need an N-D measurement system

Simple solution – Euclidean distance

Use Euclidean distance metric

Measure N-dimensional distance of projected data

from each group centroid (essentially N-dimensional

trigonometry)

Smallest distance gives assigned group

Problems with non (hyper)spherical group

distribution

Better solution – Mahalanobis distance

Developed by Prasanta Chandra Mahalanobis in

India in 1936

Takes into account the spread of the data when

calculating the distance

Mahalanobis versus Euclidean

All 4 stars

have the

same

Euclidean

distance from

the group

centroid

(green circle).

Blue stars

have a

smaller

Mahalanobis

distance than

the red stars.

Testing the model (Holdout)

Randomly split data into training set and test set

(~2:1)

Use the training set to develop the model

Pre-process PCA CVA

Project the test data into the model

Pre-process (using mean of training set) rotate

Measure distance of each test point from each

group; smallest distance is predicted class

Count how many correct answers we get

Caution

We are assigning the test sample to the nearest

grouping, it could be very far away from that

grouping

Could/should put limits on distance

Contingency table

Method of displaying results from a 2 class test

Used to assess how well the model performs

Relates to test samples not training samples

Need to define a perspective

For example; “We wish to predict class A”

Contingency table

Define a perspective. Eg “From the point of view that I want to predict Class A…”

Truly Class A Truly Class B

(therefore not Class A)

Totals

Predicted to be

Class A

Correctly predicted as

Class A

TRUE | POSITIVE

Wrongly predicted as

Class A

FALSE | POSITIVE

Total number

predicted as

Class A

(TP+FP)

Predicted to be

Class B

Wrongly predicted as

not Class A

FALSE | NEGATIVE

Correctly predicted

not Class A

TRUE | NEGATIVE

Total number

predicted as

Class B

(FN+TN)

Totals Total number of Class A

(TP+FN)

Total number of Class B

(FP+TN)

Total number of

samples

Sensitivity and specificity

Sensitivity: the proportion of things we were looking for that were found

𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

How good is the model at getting things right

Specificity: the proportion of things we were not interested in that were found

𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁

𝐹𝑃 + 𝑇𝑁

How good is the model at making sure we don’t get things wrong

Requires a perspective

Contingency table example

𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦𝐴 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁=

12

12 + 2= 86%

𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦𝐴 =𝑇𝑁

𝐹𝑃 + 𝑇𝑁=

7

4 + 7= 64%

True Group A True Group B Totals

Predicted Group A 12 (TP) 4 (FP) 16

Predicted Group B 2 (FN) 7 (TN) 9

Totals 14 (TP+FN) 11 (FP+TN) 25

Confusion matrix

Contingency table extended to more than 2

classes

Relates to test samples not training samples

Cannot simply use sensitivity and specificity

Use ‘percentage correctly classified’ instead (%CC)

Confusion matrix example

Overall %CC = (3+11+12+2+4)/44 = 73%

True Cf True Ec True En True Kp True Pm

Pred. Cf 3 1 0 0 3

Pred. Ec 1 11 0 4 1

Pred. En 0 0 12 2 0

Pred. Kp 0 0 0 2 0

Pred. Pm 0 0 0 0 4

Total 4 12 12 8 8

%CC 75% 92% 100% 25% 50%

Is it right or were we lucky?

Randomly selected training and test data only once

Get a single answer without feel for likelihood

Randomly selecting a second times gives different

answer

Which one is right?

How many times do we repeat?

Cross-validation (k-fold)

Decide how many ‘folds’ to make (arbitrary) k

Randomly allocate all data into k groupings

Define first grouping as a test set

Pool all other groupings as a training set

Perform analysis (PCA CVA Conf. Mat.)

Repeat, but use second group as test set and pool all other groups as training set

Repeat until each grouping has been a test set

Produces k outcomes distribution of results

Stratification

If training sets are chosen randomly some classes

may be missed out entirely

Solution is to randomly select the same proportion

from each class

Ensures all classes are represented in training set

and therefore the model

Cross-validation (leave-one-out)

Extreme case of k-fold cross validation

If we have N spectra, let k=N

Each spectrum is treated as a test set (single entry)

Model is trained using N-1 spectra each time

Produces N outcomes

Distribution of outcomes

k-fold CV produces k confusion matrices

Eg. Each class will have k %CC values

5 classes and 10-fold CV gives 50 %CC values

Treat each class as a distribution and determine the

mean, standard deviation etc

Eg. 10-fold CV produces a mean (of 10 values) and

standard deviation of the percentage correctly

classified for class A, rather than a single value

Bootstrap

Introduced by Bradley Efron in 70’s

Sampling with replacement

I also wish to thank the many friends who suggested names more colorful than Bootstrap, including Swiss Army Knife, Meat Axe, Swan-Dive, Jack-Rabbit, and my personal favorite, the Shotgun, which, to paraphrase Tukey, “can blow the head off any problem if the statistician can stand the resulting mess.” – Bradley Efron (1977)

The Annals of Statistics 7 (1979)1–26

Population versus sample

Population is all possible spectra from all bacteria

(in the world)

Sample is our data, a much smaller collection

How can we tell if our collection is representative?

Bootstrap attempts to assess our model

Bootstrap

‘Sampling with replacement’ means each spectrum

has an equal probability of being selected

If our data is anything like the true population (all

possible spectra) we would expect some spectra to

be very similar

Need to repeat many times to get suitable

distribution

Not great for small sample sets

Bootstrap

Say we have N spectra

We randomly select one of these, record its identity on a list and replace it

Repeat this N times

Our list is now N spectra long

Some repeats and some not present (63.2% unique)

Training set is our list (including repeats)

Test set is the data not in the list (those never selected)

Perform analysis (PCA CVA Conf. Mat.)

Repeat many, many times (>50; perhaps >1000)

Sampling comparison

Protocol Pro Con

Holdout • Simple to implement

• Computationally light

• Single answer which may be

inaccurate

• Should be stratified

k-fold CV • Computationally light

• Useful for large datasets

• Small number of answers so

difficult to determine the

distribution

• Should be stratified

LOOCV • Relatively simple to implement

• No need to stratify

• Good distribution of answers

• Somewhat biased toward ‘best’

answer

• Computationally heavy

Bootstrap • Little/no bias

• Good distribution of answers

• Computationally heavy

• Requires large datasets

• Test set varies in size

Really confused?!

Each test produces a confusion matrix

Holdout only gives a single confusion matrix

k-fold CV k matrices

LOOCV of N spectra N confusion matrices

Bootstrap could produce >1000

Need to assess the results as a distribution

For example;

class A may be correctly classified with an average (mean) of p and a spread (standard deviation) of q

Repeating the analysis gives a better understanding of the situation

How many PCs do we use?

Complicated!

Malinowski compared 15 methods in 3 categories:

Empirical, statistical and pseudo-statistical

He didn’t like any of them!

Three common approaches are

Scree plot

95% cumulative explained variance

PRESS test

J. Chemometrics 23 (2009) 1–6

Cattell scree plot

Plot

percentage of

variance

explained by

each PC.

Stop when

curve levels

out.

Rather

subjective and

difficult to

determine

Multivariate Behavioral Research 1 (1966) 245 and Multivariate Behavioral Research 12 (1977) 289

Cattell scree plot

Plot

percentage of

variance

explained by

each PC.

Stop when

curve levels

out.

Rather

subjective and

difficult to

determine

Multivariate Behavioral Research 1 (1966) 245 and Multivariate Behavioral Research 12 (1977) 289


Plot

accumulated

variance

explained by

each PC.

Stop when

greater than

95%


Plot

accumulated

variance

explained by

each PC.

Stop when

greater than

95%

Residual Sum of Squares (RSS)

PCA is a specific rotation of the data matrix

Possible to rotate back to exactly recover the original

Usually only want to keep the PCs that correspond to real data and discard the noise

Using only the informative PCs we should be able to reconstruct the original data well

Subtract reconstructed data from original data and calculate the error

Iteratively increase number of PCs used to reconstruct data until slope of error changes

Predicted Residual Error Sum of Squares (PRESS)

RSS predicts the original data using a number of PCs

PRESS uses LOOCV to give a better representation

Start with 1 PC

Reconstruct the data and compare to original

Check with LOOCV error value

Increase number of PCs and repeat

Stop when slope changes requisite number of PCs

PRESS/RSS

RSS interpretation involves determination of a slope

change

PRESS interpretation also involves a slope change

Brereton suggests using the PRESS/RSS ratio

Requisite number of PCs is when ratio > 1

R.G. Brereton, Chemometrics for Pattern Recognition, John Wiley & Sons Ltd (2009)

Predicted Residual Error Sum of Squares (PRESS)

Start with 1 PC and slowly increase. At each step perform LOOCV. Determine when the difference between the steps > 1

(This is actually a ratio of PRESSn+1 and RSSn)

R.G. Brereton, Chemometrics for Pattern Recognition, John Wiley & Sons Ltd (2009)

Overfitting

The more principal components used the better the

fit DANGER

Just because you can doesn’t mean you should

Use the minimum number of PCs to generate your

model

Better to err on the side of caution

Overfitting

9 PCs 20 PCs 50 PCs

See how groups tighten and separate with

increased number of PCs

"It's good, but it's not right!..."

Any the wiser?

The combination of some data and an aching desire

for an answer does not ensure that a reasonable

answer can be extracted from a given body of data.

John W. Tukey

Any the wiser?

The combination of some data and an aching desire

for an answer does not ensure that a reasonable

answer can be extracted from a given body of data.

John W. Tukey

There are known knowns. These are things we know

that we know. There are known unknowns. That is to

say, there are things that we know we don't know. But

there are also unknown unknowns. There are things we

don't know we don't know.

Donald Rumsfeld

Education

How to validate your model