Applied Machine Learning - Max Planck SocietyApplied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin

Applied Machine Learning

Annalisa MarsicoOWL RNA Bionformatics group

Max Planck Institute for Molecular Genetics Free University of Berlin

22 April, SoSe 2015

Goals

• Feature Selection rather than Feature reduction: regularized linear models

• From Regression to Classification– Logistic regression– Regularization, partial least square also possible

• How to improve overfitting• How to evaluate a classification model

– Class imbalance

The Variance-Bias Tradeoff

N

iii yy

NMSE

1

2ˆ1 Mean Square Errorcan be decomposed

ianceModelbiasModelMSEE var_)_()( 22

Irreducible noise Reflects how closethe function of the modelis to the real relationship input-output

Reflects how good the modelis in generalizing

Low bias / high variance

Low variance /high bias

Complex models can have high variance. Collinearity gives rise to high variance models -> we can try to reduce the model‘s variance as a wayto reduce collinearity-> by reducing the variance we increase the bias in the model

The Variance-Bias Tradeoff

ianceModelbiasModelMSEE var_)_()( 22

N.B. Ordinary linear regression produces unbiased coefficients

Ridge Regression

N

iii yySSE

1

2ˆ

Controlling or regularizing the parameter estimates can be done by adding a penalty to the SSE

P

jj

N

iiiL yySSE

1

2

1

22 ˆ

L2-normPenalty controlsthe amount of shrinkage

Path of the regression coefficients for different values of λ.

Ridge Regression-How to chose the penalty

What‘s happening in this region?

Lasso (Least Absolute Shrinkage and Selection Operation) regression

P

j

N

iiiL yySSE

11

21 ˆ

It seems like a small modification but the practical implications are significant.While regression coefficients are still shrunk towards zero, by penalizing the absolute value some parameters are actually set to zero for some values of λ

Questions

• Is PLSR a feature selection method?• Is Ridge Regression a feature selection

method?• Is Lasso a feature selection method?

Elastic net

Generalizaiton of the Lasso model. It combines two types of penalty.

p

jj

P

jj

N

iiiEnet yySSE

122

1

21

1

2ˆ

Advantage: enables regularization via the ridge-type penalty andfeature selection via the Lasso-like penalty. Zou and Hastie (2005) suggestedThis model is good to deal with groups of high correlated predictors

Comparison between Ridge, Lasso and Elastic Net

Lasso subjected to the penalty:

t 21

Ridge subjectedto the penalty:

t 22

21

Elastic net penalty:

2.0

)1(2

j

jj

Elliptical regions is the residual sum ofsquare function. The center is the least squareestimate

Linear models for Classification

Classification• The process of predicting categorical / qualitative responses

– Often predict probability to belong to a certain class / category– We have a set of observation (x1,y1)...(xn,yn) to train classifier

Example: predict if individual will default with his credit card based on annual Income and monthly balance on a set of 10000 individuals

We want to learn a model that predicts Y (default) from X1 (balance) and X2 (income)

Classification• Linear regression not suitable

– No natural way to convert qualitative response into quantitative– For binary classes we can use dummy variables

10

YIf default

If not default Fitted valuesconverted to outputClass G

Y

defaultno

defaultG

_5.0ˆ Y

5.0ˆ Y

if

if

If we try to predict Y with linear regression we might not get a number between 0 and 1

Rather than modeling the response Y directly we model the probability that Y belongsto a certain class P(G=1|X) and P(G=0|X)

Classification

Linear decision boundary 5.0....: 2211 pp xxxx

Logistic RegressionHow to model the relationship between p(X) = P(Y=1|X) and X?

Logistic function

)(1)|0(1

)()|1(10

10

XpXYPe

eXpXYP X

X

After a bit ofmanipulation

XXp

Xp

eXp

Xp X

10)(1)(log

)(1)(

10

odds

Log-oddsor logit

Logit is linear in X !

Estimating coefficients in logistic regression

Maximum likelihood to fit the model and learn the β parameters. I.e. We chosethose β that maximize the likelihood function of the data:

),|()(

)(1)(1: 0:

ii

yi yiii

xXkYPxp

xpxpLi i

Logistic Regression interpretation

What do the coefficient represent?

Generalized linear models)(ˆ TXXfY

)(ˆ XfYY

Linear model

We will always have an error in trying to approximate the real function Y

)ˆ()(ˆ TXgXfY

g = activation functionIn a linear model g = identity funcitonIn logistic regression g = logistic function

Generalized linear model

The RSS criterium can still be used to find f(X)!Only, f(X) is a more complicated function..

Logistic Regression vs Linear Regression

Linear Regression Logistic Regression

Ryˆ 1,0ˆy

0)( Xxy T )()( 0 Xxy T

g = identity function g = sigmoid function

Regularized Logistic Regression

Classification model can alos use penalty (Ridge, Lasso, etc.) to improve fitE.g. Logistic regression we can maximize a constrainted likelihood function

p

jjL

1

2)(log E.g. Ridge-like penalty

The glmnet package in R uses a combination of Ridge and Lasso penalty

p

j

p

jjjL

1 1

2

211)(log

α = „mixing proportion“ that toggles between the pure Lasso penalty (α=1) andpure Ridge (when α=0). α controls the total amount of penalization

Regularized Logistic Regression

Example: accuracy for different models with different α and λ parameters

Can Partial Least Square be extended to Logistic Regression?

• Yes..It will find new variables that simultaneously reduce dimension and correlate to the response (but Y=0,1)

• One tuninig parameter: number of components

Over-Fitting and Model Tuning

The problem of Over-Fitting

• Tendency to over-emphasize patterns• Need to evaluate the model to be confident

that it will do well in the future (on new data)• Problems in the data:

– Data quality– Limited number of samples


• We want to use existing data to find the bestparameters which give not only the best accuracy, but also the most realistic

• Originally: Split the data into a training set and testset.

• Modern approaches: – Split the data into multiple sets for training, i.e. Parameter

tuning– Split data into one (or more) distinct set for evaluation

purposes


• When the model, in addition to learning general patterns in the data learns the noise

• This kind of model will have poor accuracy when predicting a new sample

Let‘s consider the followingclassification problem

Which of these two classifiers is likely togeneralize better to new data?

The problem of over-fitting

Parameter TuningSeveral models have at least one tuning parameterWe want to find the best set of parametersGeneral strategy for parameter tuning

Data Splitting• Given a certain amount of data, we have to decide how to ‚spend‘

the data points i.e. which data used for tuning / training and whichones for evaluation

• Important: Evaluation must be carried out on samples never used in modeltuning

• Many data points -> Test set• Few data points -> Re-sampling (Cross-validation)

• Stratified sampling: random sampling within subgroups whendisproportion between classes present

Resampling techniques

K-fold Cross ValidationExample: predicting cancer patients from gene expression & clinical data

• Patients split into k sets of roughly same size• Model is fit using all patients, except the first subset (first fold)• Held-out patients used for predictions and estimation of performance• First subset is returned to the training set and procedure repeated for all k sets• k estimates of performance are summarized (usually averaged)

Schema of cross-validation process with k = 3

Leave One Out Cross-Validation (LOOCV)

• k-fold cross-validation, k= number of patients (only one patientis held out at time)

• Final performance computed from the k individual held-out predictions

• Computationally expensive! k= 10 more attractive

• but k small reduces the bias between predicted performanceand real performance

• In practise they give similar results

Bootstrap• Random sample of the patients with replacement I.e. After a data point (patient)

is selected for a subset, it can be still selected for the same dataset

• Some patients represented multiple times in the a set, others notselected at all

• Not selected samples, „out-of bag“ samples used for predictionand performance estimation

Schema of bootstrap procedure

Choosing final tuning paramters

Pick the parameter settingassociated with best accuracy/minimum error

Not always the best choise..Example: SVM accuracy vs Cost Parameter. 5-fold cross-validation

Practical hints to choose the model

• Test set is a single evaluation: sometimes limited ability• Small sample size:

– We might want to use all points for model building– Resampling might be a better solution

• There is no resampling method better than the others– Depends on the situation, e.g. Sample size, computational cost– Bootstrap can have lower error rate compared to k-fold CV

• How to practically choose between models, e.g. SVM orligistic regression?

• How can you compare different models?

Performance in Classification Models

Class Prediction

• RMSE and R2 are not appropriate in the context of classification

• Although classification models mainly return a continuous value (e.g. prob between 0 and 1) -> we need a class prediction (discrete)

• However, sometimes the probability can be useful to gain ‘confidence’

Class Prediction - examples

• E-mail message with p=0.51 and another message with p=0.9 would both be classified as ‘spam’

• Imagine a model to classify molecules based on toxicity: molecule1 with class probability 0.52, 0.48and molecule2 with class probability 0.98, 0.02 will be both classified as ‘non-toxic’ -> but confidence for molecule 1 is higher

Softmax Transformation

Prediction for the l-class

• Confusion matrix: example for a two classes outcome

Evaluating Predicted Classes

Predicted Observed

Event Nonevent

Event TP FP

Nonevent FN TN

e.g. Event: healthy, Nonevent: toxic

Where classes are correctly predictedWhere classes are wrongly predicted

Drawbacks of accuracy measure

• It does not make a distinction about the type of error being made. E.g. in spam filtering, the ‘cost’ of deleting an important e-mail is higher than allowing a spam e-mail pass the filter..

• It does not consider the frequency of each class. E.g. in a compound screening model the molecules with biological activity are a minority

Other metricsPredicted Observed

Event Nonevent

Event TP FP

Nonevent FN TN

Potential trade-offs between sensitivity and specificity can be made and still keepthe same accuracy

ݕݐݒݐݏ =

+ ܨ

Specificity= ேேାி

Sensitivity (true positive rate) is the rate of correctly predicting the event of interestfor all samples having an event

Specificity is the rate for non-even samples predicted correctly

False Positive Rate = 1 - Specificity

Predicted Observed

Event Nonevent

Event TP FP

Nonevent FN TN

Other metrics

Sensitivity and Specificity are conditional measures -> they depend on the eventIn theory if the event is rare (prevalence w), this should be taken into account..

wySpecificitySensitivitwwySpecificitNPV

wySpecificitwySensitivitwySensitivitPPV

111

11

• Given a collection of continous data points plots sensitivity and false discovery rate at different thresholds

Receiver Operating Characteristic (ROC) Curves

ROC curve for a logistic regression model to predict toxicity of a model

What is theeffect of alteringthe threshold ?

AUC =Area Under the Curve, quantitativeassessment of the model

Precision-Recall curveTP

/TP+

FN

TP/TP+FNFP/FP+TNTP

/TP+

FP

PR curve is much more sensitive to the false positives (e.g. healthy patients thatwere predicted to have cancer) in cases there the negative class (e.g. Healthypatients) dominates.

Class imbalance

• Imbalance: when one or more classes have very low propostion in the training data

• Can have significant impact on the effectivness of the model

• E.g. Pharmaceutical research: High-throughput screening only few molecules show activity: frequency of intersting compounds is low.

The effect of class imbalance - example

Three models usde to model the high-throughput screening data and evaluated ona test-set

The result of class imbalance (most of compounds show no activity) is thatmodels are comparable, have good specificity, but very little sensitivity

Class imbalance – What can be done?

• Change the cutoff to increase prediction accuracy of the minority class -> find appropriate balance between sensitivity and specificity

How do we determine the new cutoff?Find the point on the ROC curveclosest to the perfect model

min(distance)

Class Imbalance – sampling models• Reduce effect of imbalance during training

– Down-sampling (of the majority class)• Bootstrap (such that lcasses are balanced in bootstrap set)

– Up-sampling (of the minority class)• Some samples from the minority class appear more than once in the set

– SMOTE (combination of down-sampling and up-sampling)

Class1: healthyClass2: cancer

Predictor A: mutationPredictor B: patient age

Goals

• Feature Selection rather than Feature reduction: regularized linear models

• From Regression to Classification– Logistic regression– Regularization, partial least square also possible

• How to improve overfitting• How to evaluate a classification model

– Class imbalance

Documents

Applied Machine Learning - Max Planck SocietyApplied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin