Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Applied Machine Learning
Annalisa MarsicoOWL RNA Bionformatics group
Max Planck Institute for Molecular Genetics Free University of Berlin
22 April, SoSe 2015
Goals
• Feature Selection rather than Feature reduction: regularized linear models
• From Regression to Classification– Logistic regression– Regularization, partial least square also possible
• How to improve overfitting• How to evaluate a classification model
– Class imbalance
The Variance-Bias Tradeoff
N
iii yy
NMSE
1
2ˆ1 Mean Square Errorcan be decomposed
ianceModelbiasModelMSEE var_)_()( 22
Irreducible noise Reflects how closethe function of the modelis to the real relationship input-output
Reflects how good the modelis in generalizing
Low bias / high variance
Low variance /high bias
Complex models can have high variance. Collinearity gives rise to high variance models -> we can try to reduce the model‘s variance as a wayto reduce collinearity-> by reducing the variance we increase the bias in the model
The Variance-Bias Tradeoff
ianceModelbiasModelMSEE var_)_()( 22
N.B. Ordinary linear regression produces unbiased coefficients
Ridge Regression
N
iii yySSE
1
2ˆ
Controlling or regularizing the parameter estimates can be done by adding a penalty to the SSE
P
jj
N
iiiL yySSE
1
2
1
22 ˆ
L2-normPenalty controlsthe amount of shrinkage
Path of the regression coefficients for different values of λ.
Ridge Regression-How to chose the penalty
What‘s happening in this region?
Lasso (Least Absolute Shrinkage and Selection Operation) regression
P
j
N
iiiL yySSE
11
21 ˆ
It seems like a small modification but the practical implications are significant.While regression coefficients are still shrunk towards zero, by penalizing the absolute value some parameters are actually set to zero for some values of λ
Questions
• Is PLSR a feature selection method?• Is Ridge Regression a feature selection
method?• Is Lasso a feature selection method?
Elastic net
Generalizaiton of the Lasso model. It combines two types of penalty.
p
jj
P
jj
N
iiiEnet yySSE
122
1
21
1
2ˆ
Advantage: enables regularization via the ridge-type penalty andfeature selection via the Lasso-like penalty. Zou and Hastie (2005) suggestedThis model is good to deal with groups of high correlated predictors
Comparison between Ridge, Lasso and Elastic Net
Lasso subjected to the penalty:
t 21
Ridge subjectedto the penalty:
t 22
21
Elastic net penalty:
2.0
)1(2
j
jj
Elliptical regions is the residual sum ofsquare function. The center is the least squareestimate
Linear models for Classification
Classification• The process of predicting categorical / qualitative responses
– Often predict probability to belong to a certain class / category– We have a set of observation (x1,y1)...(xn,yn) to train classifier
Example: predict if individual will default with his credit card based on annual Income and monthly balance on a set of 10000 individuals
We want to learn a model that predicts Y (default) from X1 (balance) and X2 (income)
Classification• Linear regression not suitable
– No natural way to convert qualitative response into quantitative– For binary classes we can use dummy variables
10
YIf default
If not default Fitted valuesconverted to outputClass G
Y
defaultno
defaultG
_5.0ˆ Y
5.0ˆ Y
if
if
If we try to predict Y with linear regression we might not get a number between 0 and 1
Rather than modeling the response Y directly we model the probability that Y belongsto a certain class P(G=1|X) and P(G=0|X)
Classification
Linear decision boundary 5.0....: 2211 pp xxxx
Logistic RegressionHow to model the relationship between p(X) = P(Y=1|X) and X?
Logistic function
)(1)|0(1
)()|1(10
10
XpXYPe
eXpXYP X
X
After a bit ofmanipulation
XXp
Xp
eXp
Xp X
10)(1)(log
)(1)(
10
odds
Log-oddsor logit
Logit is linear in X !
Estimating coefficients in logistic regression
Maximum likelihood to fit the model and learn the β parameters. I.e. We chosethose β that maximize the likelihood function of the data:
),|()(
)(1)(1: 0:
ii
yi yiii
xXkYPxp
xpxpLi i
Logistic Regression interpretation
What do the coefficient represent?
Generalized linear models)(ˆ TXXfY
)(ˆ XfYY
Linear model
We will always have an error in trying to approximate the real function Y
)ˆ()(ˆ TXgXfY
g = activation functionIn a linear model g = identity funcitonIn logistic regression g = logistic function
Generalized linear model
The RSS criterium can still be used to find f(X)!Only, f(X) is a more complicated function..
Logistic Regression vs Linear Regression
Linear Regression Logistic Regression
Ryˆ 1,0ˆy
0)( Xxy T )()( 0 Xxy T
g = identity function g = sigmoid function
Regularized Logistic Regression
Classification model can alos use penalty (Ridge, Lasso, etc.) to improve fitE.g. Logistic regression we can maximize a constrainted likelihood function
p
jjL
1
2)(log E.g. Ridge-like penalty
The glmnet package in R uses a combination of Ridge and Lasso penalty
p
j
p
jjjL
1 1
2
211)(log
α = „mixing proportion“ that toggles between the pure Lasso penalty (α=1) andpure Ridge (when α=0). α controls the total amount of penalization
Regularized Logistic Regression
Example: accuracy for different models with different α and λ parameters
Can Partial Least Square be extended to Logistic Regression?
• Yes..It will find new variables that simultaneously reduce dimension and correlate to the response (but Y=0,1)
• One tuninig parameter: number of components
Over-Fitting and Model Tuning
The problem of Over-Fitting
• Tendency to over-emphasize patterns• Need to evaluate the model to be confident
that it will do well in the future (on new data)• Problems in the data:
– Data quality– Limited number of samples
The problem of Over-Fitting
• We want to use existing data to find the bestparameters which give not only the best accuracy, but also the most realistic
• Originally: Split the data into a training set and testset.
• Modern approaches: – Split the data into multiple sets for training, i.e. Parameter
tuning– Split data into one (or more) distinct set for evaluation
purposes
The problem of Over-Fitting
• When the model, in addition to learning general patterns in the data learns the noise
• This kind of model will have poor accuracy when predicting a new sample
Let‘s consider the followingclassification problem
Which of these two classifiers is likely togeneralize better to new data?
The problem of over-fitting
Parameter TuningSeveral models have at least one tuning parameterWe want to find the best set of parametersGeneral strategy for parameter tuning
Data Splitting• Given a certain amount of data, we have to decide how to ‚spend‘
the data points i.e. which data used for tuning / training and whichones for evaluation
• Important: Evaluation must be carried out on samples never used in modeltuning
• Many data points -> Test set• Few data points -> Re-sampling (Cross-validation)
• Stratified sampling: random sampling within subgroups whendisproportion between classes present
Resampling techniques
K-fold Cross ValidationExample: predicting cancer patients from gene expression & clinical data
• Patients split into k sets of roughly same size• Model is fit using all patients, except the first subset (first fold)• Held-out patients used for predictions and estimation of performance• First subset is returned to the training set and procedure repeated for all k sets• k estimates of performance are summarized (usually averaged)
Schema of cross-validation process with k = 3
Leave One Out Cross-Validation (LOOCV)
• k-fold cross-validation, k= number of patients (only one patientis held out at time)
• Final performance computed from the k individual held-out predictions
• Computationally expensive! k= 10 more attractive
• but k small reduces the bias between predicted performanceand real performance
• In practise they give similar results
Bootstrap• Random sample of the patients with replacement I.e. After a data point (patient)
is selected for a subset, it can be still selected for the same dataset
• Some patients represented multiple times in the a set, others notselected at all
• Not selected samples, „out-of bag“ samples used for predictionand performance estimation
Schema of bootstrap procedure
Choosing final tuning paramters
Pick the parameter settingassociated with best accuracy/minimum error
Not always the best choise..Example: SVM accuracy vs Cost Parameter. 5-fold cross-validation
Practical hints to choose the model
• Test set is a single evaluation: sometimes limited ability• Small sample size:
– We might want to use all points for model building– Resampling might be a better solution
• There is no resampling method better than the others– Depends on the situation, e.g. Sample size, computational cost– Bootstrap can have lower error rate compared to k-fold CV
• How to practically choose between models, e.g. SVM orligistic regression?
• How can you compare different models?
Performance in Classification Models
Class Prediction
• RMSE and R2 are not appropriate in the context of classification
• Although classification models mainly return a continuous value (e.g. prob between 0 and 1) -> we need a class prediction (discrete)
• However, sometimes the probability can be useful to gain ‘confidence’
Class Prediction - examples
• E-mail message with p=0.51 and another message with p=0.9 would both be classified as ‘spam’
• Imagine a model to classify molecules based on toxicity: molecule1 with class probability 0.52, 0.48and molecule2 with class probability 0.98, 0.02 will be both classified as ‘non-toxic’ -> but confidence for molecule 1 is higher
Softmax Transformation
Prediction for the l-class
• Confusion matrix: example for a two classes outcome
Evaluating Predicted Classes
Predicted Observed
Event Nonevent
Event TP FP
Nonevent FN TN
e.g. Event: healthy, Nonevent: toxic
Where classes are correctly predictedWhere classes are wrongly predicted
Drawbacks of accuracy measure
• It does not make a distinction about the type of error being made. E.g. in spam filtering, the ‘cost’ of deleting an important e-mail is higher than allowing a spam e-mail pass the filter..
• It does not consider the frequency of each class. E.g. in a compound screening model the molecules with biological activity are a minority
Other metricsPredicted Observed
Event Nonevent
Event TP FP
Nonevent FN TN
Potential trade-offs between sensitivity and specificity can be made and still keepthe same accuracy
ݕݐݒݐݏ =
+ ܨ
Specificity= ேேାி
Sensitivity (true positive rate) is the rate of correctly predicting the event of interestfor all samples having an event
Specificity is the rate for non-even samples predicted correctly
False Positive Rate = 1 - Specificity
Predicted Observed
Event Nonevent
Event TP FP
Nonevent FN TN
Other metrics
Sensitivity and Specificity are conditional measures -> they depend on the eventIn theory if the event is rare (prevalence w), this should be taken into account..
wySpecificitySensitivitwwySpecificitNPV
wySpecificitwySensitivitwySensitivitPPV
111
11
• Given a collection of continous data points plots sensitivity and false discovery rate at different thresholds
Receiver Operating Characteristic (ROC) Curves
ROC curve for a logistic regression model to predict toxicity of a model
What is theeffect of alteringthe threshold ?
AUC =Area Under the Curve, quantitativeassessment of the model
Precision-Recall curveTP
/TP+
FN
TP/TP+FNFP/FP+TNTP
/TP+
FP
PR curve is much more sensitive to the false positives (e.g. healthy patients thatwere predicted to have cancer) in cases there the negative class (e.g. Healthypatients) dominates.
Class imbalance
• Imbalance: when one or more classes have very low propostion in the training data
• Can have significant impact on the effectivness of the model
• E.g. Pharmaceutical research: High-throughput screening only few molecules show activity: frequency of intersting compounds is low.
The effect of class imbalance - example
Three models usde to model the high-throughput screening data and evaluated ona test-set
The result of class imbalance (most of compounds show no activity) is thatmodels are comparable, have good specificity, but very little sensitivity
Class imbalance – What can be done?
• Change the cutoff to increase prediction accuracy of the minority class -> find appropriate balance between sensitivity and specificity
How do we determine the new cutoff?Find the point on the ROC curveclosest to the perfect model
min(distance)
Class Imbalance – sampling models• Reduce effect of imbalance during training
– Down-sampling (of the majority class)• Bootstrap (such that lcasses are balanced in bootstrap set)
– Up-sampling (of the minority class)• Some samples from the minority class appear more than once in the set
– SMOTE (combination of down-sampling and up-sampling)
Class1: healthyClass2: cancer
Predictor A: mutationPredictor B: patient age
Goals
• Feature Selection rather than Feature reduction: regularized linear models
• From Regression to Classification– Logistic regression– Regularization, partial least square also possible
• How to improve overfitting• How to evaluate a classification model
– Class imbalance