Penalized Regression Essentials: Ridge, Lasso & Elastic Net€¦ · Note that, the shrinkage requires the selection of a tuning parameter (lambda) that determines the amount of shrinkage

1 | P a g e

Penalized Regression Essentials: Ridge, Lasso & Elastic Net

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/

The standard linear model (or the ordinary least squares method) performs poorly in a situation, where you have a large multivariate data set containing a number of variables superior to the number of samples.

A better alternative is the penalized regression allowing to create a linear regression model that is penalized, for having too many variables in the model, by adding a constraint in the equation (James et al. 2014,P. Bruce and Bruce (2017)). This is also known as shrinkage or regularization methods. The consequence of imposing this penalty, is to reduce (i.e. shrink) the coefficient values towards zero. This allows the less contributive variables to have a coefficient close to zero or equal zero.

Note that, the shrinkage requires the selection of a tuning parameter (lambda) that determines the amount of shrinkage.

In this chapter we’ll describe the most commonly used penalized regression methods, including ridge regression, lasso regression and elastic net regression. We’ll also provide practical examples in R. Contents:

• Shrinkage methods o Ridge regression o Lasso regression o Elastic Net

• Loading required R packages • Preparing the data • Computing penalized linear regression

o Additional data preparation o R functions o Computing ridge regression o Computing lasso regression o Computing elastic net regession o Comparing the different models o Using caret package

• Discussion • References

The Book:

Machine Learning Essentials: Practical Guide in R



http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#shrinkage-methods

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#ridge-regression

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#lasso-regression

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#elastic-net

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#loading-required-r-packages

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#preparing-the-data

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#computing-penalized-linear-regression

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#additional-data-preparation

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#r-functions

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#computing-ridge-regression

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#computing-lasso-regression

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#computing-elastic-net-regession

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#comparing-the-different-models

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#using-caret-package

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#discussion

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#references

http://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/



2 | P a g e

Shrinkage methods Ridge regression Ridge regression shrinks the regression coefficients, so that variables, with minor contribution to the outcome, have their coefficients close to zero.

The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called L2-norm, which is the sum of the squared coefficients. The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical. When λ=0, the penalty term has no effect, and ridge regression will produce the classical least square coefficients. However, as λ increases to infinite, the impact of the shrinkage penalty grows, and the ridge regression coefficients will get close zero. Note that, in contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression (James et al. 2014), so that all the predictors are on the same scale. The standardization of a predictor x, can be achieved using the formula x' = x / sd(x), where sd(x) is the standard deviation of x. The consequence of this is that, all standardized predictors will have a standard deviation of one allowing the final fit to not depend on the scale on which the predictors are measured. One important advantage of the ridge regression, is that it still performs well, compared to the ordinary least square method (Chapter @ref(linear-regression)), in a situation where you have a large multivariate data with the number of predictors (p) larger than the number of observations (n).

One disadvantage of the ridge regression is that, it will include all the predictors in the final model, unlike the stepwise regression methods (Chapter @ref(stepwise-regression)), which will generally select models that involve a reduced set of variables.

Ridge regression shrinks the coefficients towards zero, but it will not set any of them exactly to zero. The lasso regression is an alternative that overcomes this drawback.

Lasso regression Lasso stands for Least Absolute Shrinkage and Selection Operator. It shrinks the regression coefficients toward zero by penalizing the regression model with a penalty term called L1-norm, which is the sum of the absolute coefficients. In the case of lasso regression, the penalty has the effect of forcing some of the coefficient estimates, with a minor contribution to the model, to be exactly equal to zero. This means that, lasso can be also seen as an alternative to the subset selection methods for performing variable selection in order to reduce the complexity of the model.

As in ridge regression, selecting a good value of λ for the lasso is critical. One obvious advantage of lasso regression over ridge regression, is that it produces simpler and more interpretable models that incorporate only a reduced set of the predictors. However, neither ridge regression nor the lasso will universally dominate the other.

Generally, lasso might perform better in a situation where some of the predictors have large coefficients, and the remaining predictors have very small coefficients.

Ridge regression will perform better when the outcome is a function of many predictors, all with coefficients of roughly equal size (James et al. 2014). Cross-validation methods can be used for identifying which of these two techniques is better on a particular data set.

Elastic Net

Elastic Net produces a regression model that is penalized with both the L1-norm and L2-norm. The consequence of this is to effectively shrink coefficients (like in ridge regression) and to set some coefficients to zero (as in LASSO).

3 | P a g e

Loading required R packages

• tidyverse for easy data manipulation and visualization • caret for easy machine learning workflow • glmnet, for computing penalized regression

library(tidyverse)

library(caret) library(glmnet)

Preparing the data

We’ll use the Boston data set [in MASS package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (mdev), in Boston Suburbs, based on multiple predictor variables. We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data

data("Boston", package = "MASS")

# Split the data into training and test set

set.seed(123)

training.samples <- Boston$medv %>%

createDataPartition(p = 0.8, list = FALSE)

train.data <- Boston[training.samples, ] test.data <- Boston[-training.samples, ]

Computing penalized linear regression

Additional data preparation

You need to create two objects:

• y for storing the outcome variable • x for holding the predictor variables. This should be created using the function model.matrix() allowing to

automatically transform any qualitative variables (if any) into dummy variables (Chapter @ref(regression-with-categorical-variables)), which is important because glmnet() can only take numerical, quantitative inputs. After creating the model matrix, we remove the intercept component at index = 1.

# Predictor variables

x <- model.matrix(medv~., train.data)[,-1]

# Outcome variable y <- train.data$medv

R functions

We’ll use the R function glmnet() [glmnet package] for computing penalized linear regression models. The simplified format is as follow:

glmnet(x, y, alpha = 1, lambda = NULL) • x: matrix of predictor variables • y: the response or outcome variable, which is a binary variable. • alpha: the elasticnet mixing parameter. Allowed values include:

4 | P a g e

o “1”: for lasso regression o “0”: for ridge regression o a value between 0 and 1 (say 0.3) for elastic net regression.

• lamba: a numeric value defining the amount of shrinkage. Should be specify by analyst. In penalized regression, you need to specify a constant lambda to adjust the amount of the coefficient shrinkage. The best lambda for your data, can be defined as the lambda that minimize the cross-validation prediction error rate. This can be determined automatically using the function cv.glmnet(). In the following sections, we start by computing ridge, lasso and elastic net regression models. Next, we’ll compare the different models in order to choose the best one for our data.

The best model is defined as the model that has the lowest prediction error, RMSE (Chapter @ref(regression-model-accuracy-metrics)).

Computing ridge regression

# Find the best lambda using cross-validation

set.seed(123)

cv <- cv.glmnet(x, y, alpha = 0)

# Display the best lambda value cv$lambda.min ## [1] 0.758

# Fit the final model on the training data

model <- glmnet(x, y, alpha = 0, lambda = cv$lambda.min)

# Display regression coefficients coef(model) ## 14 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) 28.69633 ## crim -0.07285 ## zn 0.03417 ## indus -0.05745 ## chas 2.49123 ## nox -11.09232 ## rm 3.98132 ## age -0.00314 ## dis -1.19296 ## rad 0.14068 ## tax -0.00610 ## ptratio -0.86400 ## black 0.00937 ## lstat -0.47914

# Make predictions on the test data

x.test <- model.matrix(medv ~., test.data)[,-1]

predictions <- model %>% predict(x.test) %>% as.vector()

# Model performance metrics

data.frame(

RMSE = RMSE(predictions, test.data$medv),

Rsquare = R2(predictions, test.data$medv) ) ## RMSE Rsquare ## 1 4.98 0.671

5 | P a g e

Note that by default, the function glmnet() standardizes variables so that their scales are comparable. However, the coefficients are always returned on the original scale.

Computing lasso regression

The only difference between the R code used for ridge regression is that, for lasso regression you need to specify the argument alpha = 1 instead of alpha = 0 (for ridge regression).

# Find the best lambda using cross-validation

set.seed(123)

cv <- cv.glmnet(x, y, alpha = 1)

# Display the best lambda value cv$lambda.min ## [1] 0.00852

# Fit the final model on the training data

model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)

# Dsiplay regression coefficients coef(model) ## 14 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) 36.90539 ## crim -0.09222 ## zn 0.04842 ## indus -0.00841 ## chas 2.28624 ## nox -16.79651 ## rm 3.81186 ## age . ## dis -1.59603 ## rad 0.28546 ## tax -0.01240 ## ptratio -0.95041 ## black 0.00965 ## lstat -0.52880



predictions <- model %>% predict(x.test) %>% as.vector()


data.frame(



Computing elastic net regession

The elastic net regression can be easily computed using the caret workflow, which invokes the glmnet package. We use caret to automatically select the best tuning parameters alpha and lambda. The caret packages tests a range of possible alpha and lambda values, then selects the best values for lambda and alpha, resulting to a final model that is an elastic net model.

6 | P a g e

Here, we’ll test the combination of 10 different values for alpha and lambda. This is specified using the option tuneLength. The best alpha and lambda values are those values that minimize the cross-validation error (Chapter @ref(cross-validation)).

# Build the model using the training set

set.seed(123)

model <- train(

medv ~., data = train.data, method = "glmnet",

trControl = trainControl("cv", number = 10),

tuneLength = 10

)

# Best tuning parameter model$bestTune ## alpha lambda ## 6 0.1 0.21

# Coefficient of the final model. You need

# to specify the best lambda coef(model$finalModel, model$bestTune$lambda) ## 14 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 33.04083 ## crim -0.07898 ## zn 0.04136 ## indus -0.03093 ## chas 2.34443 ## nox -14.30442 ## rm 3.90863 ## age . ## dis -1.41783 ## rad 0.20564 ## tax -0.00879 ## ptratio -0.91214 ## black 0.00946 ## lstat -0.51770



predictions <- model %>% predict(x.test)


data.frame(



Comparing the different models

The different models performance metrics are comparable. Using lasso or elastic net regression set the coefficient of the predictor variable age to zero, leading to a simpler model compared to the ridge regression, which include all predictor variables.

7 | P a g e

All things equal, we should go for the simpler model. In our example, we can choose the lasso or the elastic net regression models.

Note that, we can easily compute and compare ridge, lasso and elastic net regression using the caret workflow. caret will automatically choose the best tuning parameter values, compute the final model and evaluate the model performance using cross-validation techniques.

Using caret package

1. Setup a grid range of lambda values: lambda <- 10^seq(-3, 3, length = 100) 1. Compute ridge regression:

# Build the model

set.seed(123)

ridge <- train(



tuneGrid = expand.grid(alpha = 0, lambda = lambda)

)

# Model coefficients

coef(ridge$finalModel, ridge$bestTune$lambda)

# Make predictions

predictions <- ridge %>% predict(test.data)

# Model prediction performance

data.frame(


Rsquare = R2(predictions, test.data$medv) ) 2. Compute lasso regression:

# Build the model

set.seed(123)

lasso <- train(



tuneGrid = expand.grid(alpha = 1, lambda = lambda)

)


coef(lasso$finalModel, lasso$bestTune$lambda)

# Make predictions

predictions <- lasso %>% predict(test.data)


data.frame(


Rsquare = R2(predictions, test.data$medv) ) 3. Elastic net regression:

# Build the model

set.seed(123)

elastic <- train(

8 | P a g e



tuneLength = 10

)


coef(elastic$finalModel, elastic$bestTune$lambda)

# Make predictions

predictions <- elastic %>% predict(test.data)


data.frame(


Rsquare = R2(predictions, test.data$medv) ) 4. Comparing models performance:

The performance of the different models - ridge, lasso and elastic net - can be easily compared using caret. The best model is defined as the one that minimizes the prediction error.

models <- list(ridge = ridge, lasso = lasso, elastic = elastic) resamples(models) %>% summary( metric = "RMSE") ## ## Call: ## summary.resamples(object = ., metric = "RMSE") ## ## Models: ridge, lasso, elastic ## Number of resamples: 10 ## ## RMSE ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## ridge 3.10 3.96 4.38 4.73 5.52 7.43 0 ## lasso 3.16 4.03 4.39 4.73 5.51 7.27 0 ## elastic 3.13 4.00 4.37 4.72 5.52 7.32 0 It can be seen that the elastic net model has the lowest median RMSE.

Discussion

In this chapter we described the most commonly used penalized regression methods, including ridge regression, lasso regression and elastic net regression. These methods are very useful in a situation, where you have a large multivariate data sets.

References

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.

9 | P a g e

A guide to Ridge, Lasso, and Elastic Net Regression and applying it in R

https://hackernoon.com/an-introduction-to-ridge-lasso-and-elastic-net-regression-cca60b4b934f

Regression analysis is a statistical technique that models and approximates the relationship between a dependent and one or more independent variables. This article will quickly introduce three commonly used regression models using R and the Boston housing data-set: Ridge, Lasso, and Elastic Net.

First we need to understand the basics of regression and what parameters of the equation are changed when using a specific model. Simple linear regression, also known as ordinary least squares (OLS) attempts to minimize the sum of error squared. The error in this case is the difference between the actual data point and its predicted value.

Visualization of the squared error (from Setosa.io)

The equation for this model is referred to as the cost function and is a way to find the optimal error by minimizing and

measuring it. The gradient descent algorithm is used to find the optimal cost function by going over a number of iterations.

But the data we need to define and analyze is not always so easy to characterize with the base OLS model.

Equation for least ordinary squares

One situation is the data showing multi-collinearity, this is when predictor variables are correlated to each other and to the

response variable. To picture this let’s say we’re doing a study that looks at a response variable — patient weight, and our

predictor variables would be height, sex, and diet. The problem here is that height and sex are also correlated and can

inflate the standard error of their coefficients which may make them seem statistically insignificant.

https://hackernoon.com/an-introduction-to-ridge-lasso-and-elastic-net-regression-cca60b4b934f

10 | P a g e

To produce a more accurate model of complex data we can add a penalty term to the OLS equation. A penalty adds a bias towards certain values. These are known as L1 regularization(Lasso regression) and L2 regularization(ridge regression). The best model we can hope to come up with minimizes both the bias and the variance:

Variance/bias trade off (KDnuggets.com)

Ridge Regression Ridge regression uses L2 regularization which adds the following penalty term to the OLS equation.

L2 regularization penalty term

The L2 term is equal to the square of the magnitude of the coefficients. In this case if lambda(λ) is zero then the equation is the basic OLS but if it is greater than zero then we add a constraint to the coefficients. This constraint results in minimized coefficients (aka shrinkage) that trend towards zero the larger the value of lambda. Shrinking the coefficients leads to a lower variance and in turn a lower error value. Therefore Ridge regression decreases the complexity of a model but does not reduce the number of variables, it rather just shrinks their effect. Lasso regression

Lasso regression uses the L1 penalty term and stands for Least Absolute Shrinkage and Selection Operator. The penalty applied for L1 is equal to the absolute value of the magnitude of the coefficients:

L1 regularization penalty term

Similar to ridge regression, a lambda value of zero spits out the basic OLS equation, however given a suitable lambda value lasso regression can drive some coefficients to zero. The larger the value of lambda the more features are shrunk to zero. This can eliminate some features entirely and give us a subset of predictors that helps mitigate multi-collinearity and model complexity. Predictors not shrunk towards zero signify that they are important and thus L1 regularization allows for feature selection (sparse selection).

11 | P a g e

Elastic Net A third commonly used model of regression is the Elastic Net which incorporates penalties from both L1 and L2 regularization:

Elastic net regularization

In addition to setting and choosing a lambda value elastic net also allows us to tune the alpha parameter where 𝞪 = 0 corresponds to ridge and 𝞪 = 1 to lasso. Simply put, if you plug in 0 for alpha, the penalty function reduces to the L2 (ridge) term and if we set alpha to 1 we get the L1 (lasso) term. Therefore we can choose an alpha value between 0 and 1 to optimize the elastic net. Effectively this will shrink some coefficients and set some to 0 for sparse selection. Preparing the data We will be using the following packages: library(tidyverse) library(caret) library(glmnet) We’ll also be using R’s built in Boston housing market data set as it has many predictor variables. data(“Boston”, package = “MASS”) #set a seed so you can reproduce the results set.seed(1212) #split the data into training and test data sample_size <- floor(0.75 * nrow(Boston)) training_index <- sample(seq_len(nrow(Boston)), size = sample_size) train <- Boston[training_index, ] test <- Boston[-training_index, ] We also should create two objects to store predictor (x) and response variables (y, median value) # Predictor x <- model.matrix(medv~., train)[,-1] # Response y <- train$medv Performing Ridge regression

As we mentioned in the previous sections, lambda values have a large effect on coefficients so now we will compute and chose a suitable one.

Here we perform a cross validation and take a peek at the lambda value corresponding to the lowest prediction error before fitting the data to the model and viewing the coefficients.

cv.r <- cv.glmnet(x, y, alpha = 0) cv.r$lambda.min model.ridge <- glmnet(x, y, alpha = 0, lambda = cv.r$lambda.min) coef(model.ridge)

12 | P a g e

We can see here that certain coefficients have been pushed towards zero and minimized while RM (number of rooms) has a significantly higher weight than the rest.

Ridge regression coefficients

We now look at how our model performs by using our test data on it. x.test.ridge <- model.matrix(medv ~., test)[,-1] predictions.ridge <- model.ridge %>% predict(x.test.ridge) %>% as.vector() data.frame( RMSE.r = RMSE(predictions.ridge, test$medv), Rsquare.r = R2(predictions.ridge, test$medv)) RMSE = 4.8721 and R² = 0.7205

Performing Lasso regression

The steps will be identical to what we have done for ridge regression. The value of alpha is the only change here (remember 𝞪 = 1 denotes lasso)

cv.l <- cv.glmnet(x, y, alpha = 1) cv.l$lambda.min model.lasso <- glmnet(x, y, alpha = 1, lambda = cv.l$lambda.min) coef(model.lasso) x.test.lasso <- model.matrix(medv ~., test)[,-1] predictions.lasso <- model.lasso %>% predict(x.test.lasso) %>% as.vector() data.frame( RMSE.l = RMSE(predictions.lasso, test$medv), Rsquare.l = R2(predictions.lasso, test$medv)) RMSE = 4.8494 and R² = 0.7223

13 | P a g e

Performing Elastic Net regression Performing Elastic Net requires us to tune parameters to identify the best alpha and lambda values and for this we need to use the caret package. We will tune the model by iterating over a number of alpha and lambda pairs and we can see which pair has the lowest associated error. model.net <- train( medv ~., data = train, method = "glmnet", trControl = trainControl("cv", number = 10), tuneLength = 10) model.net$bestTune coef(model.net$finalModel, model.net$bestTune$lambda) x.test.net <- model.matrix(medv ~., test)[,-1] predictions.net <- model.net %>% predict(x.test.net) data.frame( RMSE.net = RMSE(predictions.net, test$medv), Rsquare.net = R2(predictions.net, test$medv)) RMSE = 4.8523 and R² = 0.7219 Conclusion

We can see that the R mean-squared values using all three models were very close to each other, but both did marginally perform better than ridge regression (Lasso having done best). Lasso regression also showed the highest R² value.

https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4

Most Useful Metrics

In the first blog, we will cover metrics in regression only.

Regression Metrics

Most of the blogs have focussed on classification metrics like precision, recall, AUC etc. For a change, I wanted to explore all kinds of metrics including those used in regression as well. MAE and RMSE are the two most popular metrics for continuous variables. Let’s start with the more popular one.

https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4

14 | P a g e

RMSE (Root Mean Square Error)

It represents the sample standard deviation of the differences between predicted values and observed values (called residuals). Mathematically, it is calculated using this formula:

MAE

MAE is the average of the absolute difference between the predicted values and observed value. The MAE is a linear score which means that all the individual differences are weighted equally in the average. For example, the difference between 10 and 0 will be twice the difference between 5 and 0. However, same is not true for RMSE which we will discuss more in details further. Mathematically, it is calculated using this formula:

So which one should you choose and why?

Well, it is easy to understand and interpret MAE because it directly takes the average of offsets whereas RMSE penalizes the higher difference more than MAE. Let’s understand the above statement with the two examples: Case 1: Actual Values = [2,4,6,8] , Predicted Values = [4,6,8,10] Case 2: Actual Values = [2,4,6,8] , Predicted Values = [4,6,8,12]

MAE for case 1 = 2.0, RMSE for case 1 = 2.0 MAE for case 2 = 2.5, RMSE for case 2 = 2.65

From the above example, we can see that RMSE penalizes the last value prediction more heavily than MAE. Generally, RMSE will be higher than or equal to MAE. The only case where it equals MAE is when all the differences are equal or zero (true for case 1 where the difference between actual and predicted is 2 for all observations). However, even after being more complex and biased towards higher deviation, RMSE is still the default metric of many

models because loss function defined in terms of RMSE is smoothly differentiable and makes it easier to perform

mathematical operations.

Though this may not sound very pleasing, it is a very important reason and makes it very popular. I will try to explain the above logic mathematically.

Let’s take a simple linear model in one variable: y = mx+b Here, we are trying to find “m” and “b” and we are provided with data (x,y). If we define loss function (J) in terms of RMSE: then we can easily differentiate J wrt. to m and b and get the updated m and b (this is how gradient descent works, I won’t be explaining it here)

https://en.wikipedia.org/wiki/Sample_standard_deviation

15 | P a g e

https://spin.atomicobject.com/wp-content/uploads/linear_regression_gradient1.png

The above equations are simpler to solve and the same won’t apply for MAE.

However if you want a metric just to compare between two models from interpretation point of view, then I think MAE is a better choice. It is important to note that the units of both RMSE & MAE are same as y values which is not true for R Square. The range of RMSE & MAE is from 0 to infinity. Edit: One important distinction between MAE & RMSE that I forgot to mention earlier is that minimizing the squared error

over a set of numbers results in finding its mean, and minimizing the absolute error results in finding its median. This is the

reason why MAE is robust to outliers whereas RMSE is not. This answer explains this concept in detail.

R Squared (R²) and Adjusted R Squared

R Squared & Adjusted R Squared are often used for explanatory purposes and explains how well your selected independent variable(s) explain the variability in your dependent variable(s). Both these metrics are quite misunderstood and therefore I would like to clarify them first before going through their pros and cons.

Mathematically, R_Squared is given by:

The numerator is MSE ( average of the squares of the residuals) and the denominator is the variance in Y values. Higher

the MSE, smaller the R_squared and poorer is the model.

Adjusted R²

Just like R², adjusted R² also shows how well terms fit a curve or line but adjusts for the number of terms in a model. It is given by below formula:

where n is the total number of observations and k is the number of predictors. Adjusted R² will always be less than or equal to R²

https://spin.atomicobject.com/wp-content/uploads/linear_regression_gradient1.png

https://www.quora.com/How-would-a-model-change-if-we-minimized-absolute-error-instead-of-squared-error-What-about-the-other-way-around

16 | P a g e

Why should you choose Adjusted R² over R²? There are some problems with normal R² which are solved by Adjusted R². An adjusted R² will consider the marginal improvement added by an additional term in your model. So it will increase if you add the useful terms and it will decrease if you add less useful predictors. However, R² increases with increasing terms even though the model is not actually improving. It will be easier to understand this with an example.

Here, Case 1 is the simple case where we have 5 observations of (x,y). In case 2, we have one more variable which is twice of variable 1 (perfectly correlated with var 1). In Case 3, we have produced a slight disturbance in var2 such that it is no longer perfectly correlated with var1. So if we fit simple ordinary least squares (OLS) model for each case, logically we are not providing any extra or useful information to case 2 and case 3 with respect to case 1. So our metric value should not improve for these models. However, it is actually not true for R² which gives a higher value for model 2 and 3. But your adjusted R² takes care of this problem and it is actually decreasing for both cases 2 & 3. Let’s give some numbers to these variables (x,y) and look at the results obtained in Python. Note: Predicted values will be same for both model 1 and model 2 and therefore, r_squared will also be same because it

depends only on predicted and actual values.

From the above table, we can see that even though we are not adding any additional information from case 1 to case 2, still R² is increasing whereas adjusted R² is showing the correct trend (penalizing model 2 for more number of variables) Comparison of Adjusted R² over RMSE For the previous example, we will see that RMSE is same for case 1 and case 2 similar to R². This is the case where Adjusted R² does a better job than RMSE whose scope is limited to comparing predicted values with actual values. Also, the absolute value of RMSE does not actually tell how bad a model is. It can only be used to compare across two models whereas Adjusted R² easily does that. For example, if a model has adjusted R² equal to 0.05 then it is definitely poor.

However, if you care only about prediction accuracy then RMSE is best. It is computationally simple, easily differentiable and

present as default metric for most of the models.

Common Misconception: I have often seen on the web that the range of R² lies between 0 and 1 which is not actually true. The maximum value of R² is 1 but minimum can be negative infinity. Consider the case where model is predicting highly negative value for all the observations even though y_actual is positive. In this case, R² will be less than 0. This will be a highly unlikely scenario but the possibility still exists.

17 | P a g e

Regression Analysis: Lasso, Ridge, and Elastic Net

https://medium.com/@yongddeng/regression-analysis-lasso-ridge-and-elastic-net-9e65dc61d6d3

1. Introduction

A machine learning model can overcome underfitting by adding more parameters, although its complexity increases and will require more efforts for interpretation.[1] However, a real dilemma of a data scientist is that minimising the prediction errors which are decomposed due to the bias and/or variance somehow turns into overfitting problems. Lasso, Ridge, and Elastic Net are popular ways of regularised statistical modeling approaches, that are the topic of this article, and will be discussed both mathematically and computationally.

1.1 Motivation Consider a linear model Y = Xβ + ε where εi ∼ N(0, σ2 ), and i = 1, …, n, the parameters β0, β1, …, βp are typically estimated from a sample of n observations using the OLS criterion. If the true relationship between Y and X1, X2, …, Xp is approximately linear, the estimates will have a low bias. However,

• if n >> p, they will tend to have low variance

• if n > p, there can be high variability in the estimates

• if n < p, the OLS solution is not even unique proof. β_hat = ( X^TX)^-1(X^TY) is valid if and only if the matrix X^TXis invertible, and is not the case if n < p as the p×p matrix has maximum rank n.

The one treasure that has been searched for is optimum point where the decrease in bias is equal to the increase in variance. In practice, there is no analytical way to find this point, but still, some methods are more commonly used than others. [2]

https://medium.com/@yongddeng/regression-analysis-lasso-ridge-and-elastic-net-9e65dc61d6d3

http://scott.fortmann-roe.com/docs/BiasVariance.html

http://www.statslab.cam.ac.uk/~dp497/Lecture%20IV.pdf

18 | P a g e

Lasso and Ridge are two distinct ways of Regularisation, a.k.a Shrinkage Method, which is a form of regression that constrains/regularise or shrinks the coefficient estimates towards zero. This technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. The key difference between the two techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, does the job of feature selection, whilst, Ridge will never make them exactly zero. Therefore, the final model deduced via Ridge will include all predictors.[3]

1.2 Loss Function This function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. An objective function is either a loss function or its negative, in which case it is to be maximised.

In regression analysis, the typical loss function is Residual Sum of Squares, denoted as RSS = E[Y - f(X)]² = ∑e². The coefficients are chosen, so that the expected loss function E[L(Y, f(X)] is minimised.[4]

Let the square loss L(Y, f(X)) = (Y - f(X))² then the optimal predictor f*(X) = argminE[Y - f(X)]² = E[Y|X] which is the regression function. Moreover, the loss function is approximated by the empirical loss RSS(β)/N where RSS(β) = ∑(yi - f(xi))² = ∑(yi - β0 - ∑xijβj)².[5]

1.3 Lagrangian Multiplier In addition to loss function, the Lagrangian Multiplier is a strategy for finding the local maxima and minima of a function subject to equality constraints. The method can be summarised as follows:

1. isolate any possible singular point of the solution set of the constraining equations,

2. find all the stationary points of the Lagrange function,

3. establish which of those stationary points and singular points are global minima (or maxima, in case of maximisation problems) of the objective function.

It allows the optimisation to be solved without explicit parameterisation in terms of the constraints. This strong property makes the method of Lagrange multiplier to be widely used for a challenging constrained optimistion problems. [6]

https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a

https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0

http://personal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf

https://en.wikipedia.org/wiki/Lagrange_multiplier

19 | P a g e

2. Definitions

The first chapter highlighted two most important features of the method such that regularisation 1) tends to reduce the variability of the estimates, hence improving the model’s stability, 2) can go as far as to set some of the coefficients to zero, thus also allowing for variable selection.

2.1 Regularisation A traditional feature selection method like stepwise regression works well with a small set of features but regularisations are a great alternative when a large set of features are given. Let the loss function to be extended with an shrinkage penalty with a objective of regularising the coefficients by a tunning parameter lambda λ. [7]

For every value of λ, there is s such that will provide the equivalent equations, as shown for the overall objective function with a penalty factor:

Minimising a penalised version of the least squares loss function as in LASSO (L1-norm penalty) and Ridge Regression (L2-norm penalty) may be better understood by geometrical descriptions.

https://dataandbeyond.wordpress.com/2017/08/30/understanding-linear-regression-part-ii/

20 | P a g e

2.2 LASSO (L1) Least Absolute Shrinkage and Selection Operator minimises the least squares with an additional penalty/regularisation term. Due to the L1-Norm, some of the coefficients are more likely set equal to zero, depending on the regularisation parameter λ which needs to be chosen/tuned by the Cross-Validation which will be discussedlater;[8]

2.3 Ridge Regression (L2) Ridge regression, also known as Tikhonov Regularisation seeks λ that minimises the penalised or regularised RSS. As the L2 norm is differentiable, learning problems using the method can be solved by Gradient Descent;

https://www.ni.tu-berlin.de/fileadmin/fg215/teaching/nnproject/Lasso_Project.pdf

21 | P a g e

2.4 Elastic Net (L1 + L2) The method linearly combines the L1 and L2 penalties of the LASSO and Ridge Regression. Including the Elastic Net, these methods are especially powerful when applied to very large data where the number of variables might be in the thousands or even millions. In mathematics, sparse and dense often refer to the number of zero vs. non-zero elements in an array (e.g. vector or matrix). A sparse array is one that contains mostly zeros and few non-zero entries, and a dense array contains mostly non-zeros. LASSO and Ridge encourage sparse and dense model, respectively, but since it never be that clear how the true model looks like, it’s typical to apply both methods and determine the best model.[9]

3. Examples

Step by step, 1) build a model with a training dataset, 2) apply the shrink method, 3) find the optimal value of the parameter by using Cross-Validation, 4) reduce the variability and enjoy the benefits of variable selection if needed, and 5) verify a final model with a test dataset.

Linear, LASSO, and Ridge are written in order, and adjusted R² will be closely looked at as a measure of accuracy. Below codes run in R language. 3.1 Dataset

## Split data wine_quality <- read.csv("winequality-red.csv", header = TRUE, sep = ";", check.names = FALSE) names(wine_quality) <- gsub(" ", "_", names(wine_quality)) set.seed(123) numrow <- nrow(wine_quality) train_idx <- sample(1:numrow, size = as.integer(0.7*numrow)) train_data <- wine_quality[train_idx, ] test_data <- wine_quality[-train_idx, ]

https://gerardnico.com/data_mining/shrinkage

22 | P a g e

xvars <- c("fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", "density") yvar <- c("quality") x_train <- as.matrix(train_data[, xvars]) x_test <- as.matrix(test_data[, xvars]) y_train <- as.double(as.matrix(train_data[, yvar]))

3.2 Builiding Model ## Liear Regression frmla <- paste(yvar, "~", paste(xvars, collapse = "+")) lm_fit <- lm(as.formula(frmla), data = train_data) print(summary(lm_fit)) wine_v2 <- train_data[, xvars] # VIF print(vif(wine_v2)) pred_y <- predict(lm_fit, newdata = test_data) # test accuracy R2 <- 1 - (sum((test_data[, yvar] - pred_y)^2) / sum((test_data[, yvar] - mean(test_data[, yvar]))^2)) print(paste("Test Adjusted R-squared :", R2))

23 | P a g e

3.3 L1-Regularisation ## LASSO print(paste("LASSO")) lambdas <- c(1e-4, 1e-3, 1e-2, 0.1, 0.5, 1.0, 5.0, 10.0) initrsq <- 0 for(lmbd in lambdas){ lasso_fit = glmnet(x_train, y_train, alpha = 1, lambda = lmbd) pred_y = predict(lasso_fit, x_test) R2 = 1 - (sum((test_data[, yvar] - pred_y)^2) / sum((test_data[, yvar] - mean(test_data[, yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:", lmbd, "Test Adjusted R-squared :", round(R2, 4))) initrsq = R2 } }

3.4 L2-Regularisation ## Ridge Regression print(paste("Ridge Regression")) lambdas <- c(1e-4, 1e-3, 1e-2, 0.1, 0.5, 1.0, 5.0, 10.0) initrsq <- 0 for(lmbd in lambdas){ ridge_fit = glmnet(x_train, y_train, alpha = 0, lambda = lmbd) pred_y = predict(ridge_fit, x_test) R2 = 1 - (sum((test_data[, yvar] - pred_y)^2) / sum((test_data[, yvar] - mean(test_data[, yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:", lmbd, "Test Adjusted R-squared :", round(R2, 4))) initrsq = R2 } }

24 | P a g e

3.5 Tunning Parameter There exists several methods of cross-validation for the optimal value of λ.

1. Leave-One-Out

2. K-Fold

Logic of K-Fold

The K-Fold has been chosen and applied to L2-Regularisation as an example. ridge_fit <- glmnet(x_train, y_train, alpha = 0) ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0) ridge_cv$lambda.min head(coef(ridge_cv, s = "lambda.min"))

3.3 Visulisations More deatils about interpretations of plot can be found in the following reference.[10] par(mfrow = c(1,2)) plot(ridge_fit, xvar = "lambda", label = TRUE) plot(ridge_cv)

https://gerardnico.com/lang/r/ridge_lasso

25 | P a g e

4. Pros and Cons

To wrap up what has been discussed, important facts are again listed in bullet points.[11] 4.1 Summary • Works well for high variance in LSE.

• Recall the Lagrangian Multiplier to find a global minima among many local minimas.

• Ridge only shrinks the coefficients of the variable depends on its importance to the model accuracy rather than providing 0, so, unnecessary variables still remain.

• LASSO, for the above reason, uses less memory in calculation compared to a method of best subset selection which requires 2^p models.

4.2 Conclusions I tried to cast at least a glance of Regression Analysis across the last three articles. Even myself have felt the shortage of academic knowledge in the writings, but I wish those of you who have reached to my work find it useful in any possible ways.

http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/

26 | P a g e

A Complete Tutorial on Ridge and Lasso Regression in Python https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

Introduction When we talk about Regression, we often end up discussing Linear and Logistic Regression. But, that’s not the end. Do you know there are 7 types of Regressions ? Linear and logistic regression is just the most loved members from the family of regressions. Last week, I saw a recorded talk at NYC Data Science Academy from Owen Zhang, Chief Product Officer at DataRobot. He said, ‘if you are using regression without regularization, you have to be very special!’. I hope you get what a person of his stature referred to. I understood it very well and decided to explore regularization techniques in detail. In this article, I have explained the complex science behind ‘Ridge Regression‘ and ‘Lasso Regression‘ which are the most fundamental regularization techniques used in data science, sadly still not used by many. The overall idea of regression remains the same. It’s the way in which the model coefficients are determined which makes all the difference. I strongly encourage you to go through multiple regression before reading this. You can take help from this article or any other preferred material.

Table of Contents

1. Brief Overview– How is Ridge and Lasso Regression different? 2. Why Penalize the Magnitude of Coefficients– Why should they work? 3. Ridge Regression– How ridge works? 4. Lasso Regression– How lasso works? 5. Sneak Peak into Mathematics (Optional)– Some underlying mathematical principles. 6. Conclusion– Comparing Ridge and Lasso Regression

1. Brief Overview Ridge and Lasso regression are powerful techniques generally used for creating parsimonious models in presence of a ‘large’ number of features. Here ‘large’ can typically mean either of two things:

1. Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting) 2. Large enough to cause computational challenges. With modern systems, this situation might arise in case of

millions or billions of features Though Ridge and Lasso might appear to work towards a common goal, the inherent properties and practical use cases differ substantially. If you’ve heard of them before, you must know that they work by penalizing the magnitude of coefficients of features along with minimizing the error between predicted and actual observations. These are called ‘regularization’ techniques. The key difference is in how they assign penalty to the coefficients:

1. Ridge Regression: o Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of coefficients o Minimization objective = LS Obj + α * (sum of square of coefficients)

2. Lasso Regression: o Performs L1 regularization, i.e. adds penalty equivalent to absolute value of the magnitude of

coefficients o Minimization objective = LS Obj + α * (sum of absolute value of coefficients)

Note that here ‘LS Obj’ refers to ‘least squares objective’, i.e. the linear regression objective without regularization. If terms like ‘penalty’ and ‘regularization’ seem very unfamiliar to you, don’t worry we’ll talk about these in more detail through the course of this article. Before digging further into how they work, lets try to get some intuition into why penalizing the magnitude of coefficients should work in the first place.

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

https://courses.analyticsvidhya.com/courses/introduction-to-data-science-2?utm_source=blog&utm_medium=RideandLassoRegressionarticle

https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/?utm_source=blog&utm_medium=RideandLassoRegressionarticle


https://www.analyticsvidhya.com/blog/2015/10/regression-python-beginners/?utm_source=blog&utm_medium=RideandLassoRegressionarticle

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#one

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#two

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#three

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#four

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#five

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#six


27 | P a g e

2. Why Penalize the Magnitude of Coefficients? Lets try to understand the impact of model complexity on the magnitude of coefficients. As an example, I have simulated a sine curve (between 60° and 300°) and added some random noise using the following code:

#Importing libraries. The same will be used throughout the article.

import numpy as np

import pandas as pd

import random

import matplotlib.pyplot as plt

%matplotlib inline

from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 12, 10

#Define input array with angles from 60deg to 300deg converted to radians

x = np.array([i*np.pi/180 for i in range(60,300,4)])

np.random.seed(10) #Setting seed for reproducability

y = np.sin(x) + np.random.normal(0,0.15,len(x))

data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])

plt.plot(data['x'],data['y'],'.')

The input-output looks like:

This resembles a sine curve but not exactly because of the noise. We’ll use this as an example to test different scenarios in this article. Let’s try to estimate the sine function using polynomial regression with powers of x from 1 to 15. Let’s add a column for each power up to 15 in our dataframe. This can be accomplished using the following code:

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/1.sine-curve.png

28 | P a g e

for i in range(2,16): #power of 1 is already there

colname = 'x_%d'%i #new var will be x_power

data[colname] = data['x']**i

print data.head()

The dataframe looks like:

Now that we have all the 15 powers, lets make 15 different linear regression models with each model containing variables with powers of x from 1 to the particular model number. For example, the feature set of model 8 will be – {x, x_2, x_3, … ,x_8}. First, we’ll define a generic function which takes in the required maximum power of x as an input and returns a list containing – [ model RSS, intercept, coef_x, coef_x2, … up to entered power ]. Here RSS refers to ‘Residual Sum of Squares’ which is nothing but the sum of square of errors between the predicted and actual values in the training data set. The python code defining the function is:

#Import Linear Regression model from scikit-learn.

from sklearn.linear_model import LinearRegression

def linear_regression(data, power, models_to_plot):

#initialize predictors:

predictors=['x']

if power>=2:

predictors.extend(['x_%d'%i for i in range(2,power+1)])

#Fit the model

linreg = LinearRegression(normalize=True)

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/1.2-15-powers.png

29 | P a g e

linreg.fit(data[predictors],data['y'])

y_pred = linreg.predict(data[predictors])

#Check if a plot is to be made for the entered power

if power in models_to_plot:

plt.subplot(models_to_plot[power])

plt.tight_layout()

plt.plot(data['x'],y_pred)


plt.title('Plot for power: %d'%power)

#Return the result in pre-defined format

rss = sum((y_pred-data['y'])**2)

ret = [rss]

ret.extend([linreg.intercept_])

ret.extend(linreg.coef_)

return ret

Note that this function will not plot the model fit for all the powers but will return the RSS and coefficients for all the models. I’ll skip the details of the code for now to maintain brevity. I’ll be happy to discuss the same through comments below if required. Now, we can make all 15 models and compare the results. For ease of analysis, we’ll store all the results in a Pandas dataframe and plot 6 models to get an idea of the trend. Consider the following code:

#Initialize a dataframe to store the results:

col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]

ind = ['model_pow_%d'%i for i in range(1,16)]

coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

#Define the powers for which a plot is required:

models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

#Iterate through all powers and assimilate results

for i in range(1,16):

coef_matrix_simple.iloc[i-1,0:i+2] = linear_regression(data, power=i, models_to_plot=models_to_plot)

We would expect the models with increasing complexity to better fit the data and result in lower RSS values. This can be verified by looking at the plots generated for 6 models:

30 | P a g e

This clearly aligns with our initial understanding. As the model complexity increases, the models tends to fit even smaller deviations in the training data set. Though this leads to overfitting, lets keep this issue aside for some time and come to our main objective, i.e. the impact on the magnitude of coefficients. This can be analysed by looking at the data frame created above. Python Code:

#Set the display format to be scientific for ease of analysis

pd.options.display.float_format = '{:,.2g}'.format

coef_matrix_simple

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/2.-lin-reg-op.png

31 | P a g e

The output looks like:

It is clearly evident that the size of coefficients increase exponentially with increase in model complexity. I hope this gives some intuition into why putting a constraint on the magnitude of coefficients can be a good idea to reduce model complexity. Lets try to understand this even better. What does a large coefficient signify? It means that we’re putting a lot of emphasis on that feature, i.e. the particular feature is a good predictor for the outcome. When it becomes too large, the algorithm starts modelling intricate relations to estimate the output and ends up overfitting to the particular training data. I hope the concept is clear. I’ll be happy to discuss further in comments if needed. Now, lets understand ridge and lasso regression in detail and see how well they work for the same problem.

3. Ridge Regression As mentioned before, ridge regression performs ‘L2 regularization‘, i.e. it adds a factor of sum of squares of coefficients in the optimization objective. Thus, ridge regression optimizes the following:

Objective = RSS + α * (sum of square of coefficients) Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values:

1. α = 0: o The objective becomes same as simple linear regression. o We’ll get the same coefficients as simple linear regression.

2. α = ∞: o The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less

than zero will make the objective infinite. 3. 0 < α < ∞:

o The magnitude of α will decide the weightage given to different parts of objective. o The coefficients will be somewhere between 0 and ones for simple linear regression.

I hope this gives some sense on how α would impact the magnitude of coefficients. One thing is for sure that any non-zero value would give values less than that of simple linear regression. By how much? We’ll find out soon. Leaving the mathematical details for later, lets see ridge regression in action on the same problem as above.

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/3-linear_output_modIFIED.png

32 | P a g e

First, lets define a generic function for ridge regression similar to the one defined for simple linear regression. The Python code is:

from sklearn.linear_model import Ridge

def ridge_regression(data, predictors, alpha, models_to_plot={}):

#Fit the model

ridgereg = Ridge(alpha=alpha,normalize=True)

ridgereg.fit(data[predictors],data['y'])

y_pred = ridgereg.predict(data[predictors])

#Check if a plot is to be made for the entered alpha

if alpha in models_to_plot:

plt.subplot(models_to_plot[alpha])

plt.tight_layout()



plt.title('Plot for alpha: %.3g'%alpha)



ret = [rss]

ret.extend([ridgereg.intercept_])

ret.extend(ridgereg.coef_)

return ret

Note the ‘Ridge’ function used here. It takes ‘alpha’ as a parameter on initialization. Also, keep in mind that normalizing the inputs is generally a good idea in every type of regression and should be used in case of ridge regression as well. Now, lets analyze the result of Ridge regression for 10 different values of α ranging from 1e-15 to 20. These values have been chosen so that we can easily analyze the trend with change in values of α. These would however differ from case to case. Note that each of these 10 models will contain all the 15 variables and only the value of alpha would differ. This is different from the simple linear regression case where each model had a subset of features. Python Code:

#Initialize predictors to be set of 15 powers of x

predictors=['x']

predictors.extend(['x_%d'%i for i in range(2,16)])

33 | P a g e

#Set the different values of alpha to be tested

alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

#Initialize the dataframe for storing coefficients.


ind = ['alpha_%.2g'%alpha_ridge[i] for i in range(0,10)]

coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15:231, 1e-10:232, 1e-4:233, 1e-3:234, 1e-2:235, 5:236}

for i in range(10):

coef_matrix_ridge.iloc[i,] = ridge_regression(data, predictors, alpha_ridge[i], models_to_plot)

This would generate the following plot:

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/4.-ridge-output.png

34 | P a g e

Here we can clearly observe that as the value of alpha increases, the model complexity reduces. Though higher values of alpha reduce overfitting, significantly high values can cause underfitting as well (eg. alpha = 5). Thus alpha should be chosen wisely. A widely accept technique is cross-validation, i.e. the value of alpha is iterated over a range of values and the one giving higher cross-validation score is chosen. Lets have a look at the value of coefficients in the above models: Python Code:

#Set the display format to be scientific for ease of analysis

pd.options.display.float_format = '{:,.2g}'.format

coef_matrix_ridge

The table looks like:

This straight away gives us the following inferences:

1. The RSS increases with increase in alpha, this model complexity reduces 2. An alpha as small as 1e-15 gives us significant reduction in magnitude of coefficients. How? Compare the

coefficients in the first row of this table to the last row of simple linear regression table. 3. High alpha values can lead to significant underfitting. Note the rapid increase in RSS for values of alpha greater

than 1 4. Though the coefficients are very very small, they are NOT zero.

The first 3 are very intuitive. But #4 is also a crucial observation. Let’s reconfirm the same by determining the number of zeros in each row of the coefficients data set: Python Code:

coef_matrix_ridge.apply(lambda x: sum(x.values==0),axis=1)

Output:

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/5.-ridge-table_modified.png

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/6.-ridge-zeros.png

35 | P a g e

This confirms that all the 15 coefficients are greater than zero in magnitude (can be +ve or -ve). Remember this observation and have a look again until its clear. This will play an important role in later while comparing ridge with lasso regression.

4. Lasso Regression LASSO stands for Least Absolute Shrinkage and Selection Operator. I know it doesn’t give much of an idea but there are 2 key words here – ‘absolute‘ and ‘selection‘. Lets consider the former first and worry about the latter later. Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. Thus, lasso regression optimizes the following:

Objective = RSS + α * (sum of absolute value of coefficients) Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients. Like that of ridge, α can take various values. Lets iterate it here briefly:

1. α = 0: Same coefficients as simple linear regression 2. α = ∞: All coefficients zero (same logic as before) 3. 0 < α < ∞: coefficients between 0 and that of simple linear regression

Yes its appearing to be very similar to Ridge till now. But just hang on with me and you’ll know the difference by the time we finish. Like before, lets run lasso regression on the same problem as above. First we’ll define a generic function:

from sklearn.linear_model import Lasso

def lasso_regression(data, predictors, alpha, models_to_plot={}): #Fit the model lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e5)

lassoreg.fit(data[predictors],data['y'])

y_pred = lassoreg.predict(data[predictors])

#Check if a plot is to be made for the entered alpha

if alpha in models_to_plot:

plt.subplot(models_to_plot[alpha])

plt.tight_layout()



plt.title('Plot for alpha: %.3g'%alpha)



ret = [rss]

ret.extend([lassoreg.intercept_])

ret.extend(lassoreg.coef_)

return ret

36 | P a g e

Notice the additional parameters defined in Lasso function – ‘max_iter‘. This is the maximum number of iterations for which we want the model to run if it doesn’t converge before. This exists for Ridge as as well but setting this to a higher than default value was required in this case. Why? I’ll come to this in next section, just keep it in the back of the envelope. Lets check the output for 10 different values of alpha using the following code:

#Initialize predictors to all 15 powers of x

predictors=['x']

predictors.extend(['x_%d'%i for i in range(2,16)])

#Define the alpha values to test

alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10]

#Initialize the dataframe to store coefficients


ind = ['alpha_%.2g'%alpha_lasso[i] for i in range(0,10)]

coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)

#Define the models to plot

models_to_plot = {1e-10:231, 1e-5:232,1e-4:233, 1e-3:234, 1e-2:235, 1:236}

#Iterate over the 10 alpha values:

for i in range(10):

coef_matrix_lasso.iloc[i,] = lasso_regression(data, predictors, alpha_lasso[i], models_to_plot)

37 | P a g e

This gives us the following plots:

This again tells us that the model complexity decreases with increase in the values of alpha. But notice the straight line at alpha=1. Appears a bit strange to me. Let’s explore this further by looking at the coefficients:

Apart from the expected inference of higher RSS for higher alphas, we can see the following:

1. For the same values of alpha, the coefficients of lasso regression are much smaller as compared to that of ridge regression (compare row 1 of the 2 tables).

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/7.-lasso-output1.png

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/8.-lasso-table_modified.png

38 | P a g e

2. For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge regression 3. Many of the coefficients are zero even for very small values of alpha

Inferences #1,2 might not generalize always but will hold for many cases. The real difference from ridge is coming out in the last inference. Lets check the number of coefficients which are zero in each model using following code:

coef_matrix_lasso.apply(lambda x: sum(x.values==0),axis=1)

Output:

We can observe that even for a small value of alpha, a significant number of coefficients are zero. This also explains the horizontal line fit for alpha=1 in the lasso plots, its just a baseline model! This phenomenon of most of the coefficients being zero is called ‘sparsity‘. Although lasso performs feature selection, this level of sparsity is achieved in special cases only which we’ll discuss towards the end. This has some really interesting implications on the use cases of lasso regression as compared to that of ridge regression. But before coming to the final comparison, lets take a bird’s eye view of the mathematics behind why coefficients are zero in case of lasso but not ridge.

5. Sneak Peak into Statistics (Optional) I personally love statistics but many of you might not. That’s why I have specifically marked this section as ‘OPTIONAL‘. If you feel you can handle the algorithms without going into the maths behind them, I totally respect the decision and you can feel free to skip this section. But I personally feel that getting some elementary understanding of how the thing works can be helpful in the long run. As promised, I’ll keep it to a bird’s eye view. If you wish to get into the details, I recommend taking a good statistics textbook. One of my favorites is the Elements of Statistical Learning. The best part about this is that it has been made available for free by the authors. Let’s start by reviewing the basic structure of data in a regression problem.

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/9.-lasso-zeros.png

39 | P a g e

In this infographic, you can see there are 4 data elements:

1. X: the matrix of input features (nrow: N, ncol: M+1) 2. Y: the actual outcome variable (length:N) 3. Yhat: these are predicted values of Y (length:N) 4. W: the weights or the coefficients (length: M+1)

Here, N is the total number of data points available and M is the total number of features. X has M+1 columns because of M features and 1 intercept. The predicted outcome for any data point i is:

It is simply the weighted sum of each data point with coefficients as the weights. This prediction is achieved by finding the optimum value of weights based on certain criteria, which depends on the type of regression algorithm being used. Lets consider all 3 cases:

1. Simple Linear Regression The objective function (also called as the cost) to be minimized is just the RSS (Residual Sum of Squares), i.e. the sum of squared errors of the predicted outcome as compared to the actual outcome. This can be depicted mathematically as:

In order to minimize this cost, we generally use a ‘gradient descent’ algorithm. I’ll not go into the details right now but you can refer this. The overall algorithm works as:

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/fig31.png

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/eq1.png


40 | P a g e

1. initialize weights (say w=0)

2. iterate till not converged

2.1 iterate over all features (j=0,1...M)

2.1.1 determine the gradient

2.1.2 update the jth weight by subtracting learning rate times the gradient

w(t+1) = w(t) - learning rate * gradient

Here the important step is #2.1.1 where we compute the gradient. Gradient is nothing but a partial differential of the cost with respect to a particular weight (denoted as wj). The gradient for the jth weight will be:

This is formed from 2 parts:

1. 2*{..} : This is formed because we’ve differentiated the square of the term in {..} 2. -wj : This is the differentiation of the part in {..} wrt wj. Since its a summation, all other would become 0 and only

wj would remain. Step #2.1.2 involves updating the weights using the gradient. This update step for simple linear regression looks like:

I hope you are able to follow along. Note the +ve sign in the RHS is formed after multiplication of 2 -ve signs. I would like to explain point #2 of the gradient descent algorithm mentioned above ‘iterate till not converged‘. Here convergence refers to attaining the optimum solution within pre-defined limit. It is checked using the value of gradient. If the gradient is small enough, that means we are very close to optimum and further iterations won’t have a substantial impact on coefficients. The lower-limit on gradient can be changed using the ‘tol‘ parameter. Lets consider the case of ridge regression now.

2. Ridge Regression The objective function (also called the cost) to be minimized is the RSS plus the sum of square of the magnitude of weights. This can be depicted mathematically as:

In this case, the gradient would be:

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/eq3_updated.png

https://www.analyticsvidhya.com/wp-content/uploads/2016/01/eq4-1.png


41 | P a g e

Again in the regularization part of gradient, only wj remains and all other would become zero. The corresponding update rule is:

Here we can see that second part of the RHS is same as that of simple linear regression. Thus, ridge regression is equivalent to reducing the weight by a factor of (1-2λη) first and then applying the same update rule as simple linear regression. I hope this gives some intuition into why the coefficients get reduced to small numbers but never become zero. Note that the criteria for convergence in this case remains similar to simple linear regression, i.e. checking the value of gradients. Lets discuss Lasso regression now.

3. Lasso Regression The objective function (also called the cost) to be minimized is the RSS plus the sum of absolute value of the magnitude of weights. This can be depicted mathematically as:

In this case, the gradient is not defined as the absolute function is not differentiable at x=0. This can be illustrated as:





42 | P a g e

We can see that the parts on the left and right side of 0 are straight lines with defined derivates but the function can’t be differentiated at x=0. In this case, we have to use a different technique called as coordinate descent which is based on the concept of sub-gradients. One of the coordinate descent follows the following algorithms (this is also the default in sklearn):

1. initialize weights (say w=0)

2. iterate till not converged

2.1 iterate over all features (j=0,1...M)

2.1.1 update the jth weight with a value which minimizes the cost

#2.1.1 might look too generalized. But I’m intentionally leaving the details and jumping to the update rule:

Here g(w-j) represents (but not exactly) the difference between actual outcome and the predicted outcome considering all EXCEPT the jth variable. If this value is small, it means that the algorithm is able to predict the outcome fairly well even without the jth variable and thus it can be removed from the equation by setting a zero coefficient. This gives us some intuition into why the coefficients become zero in case of lasso regression. In coordinate descent, checking convergence is another issue. Since gradients are not defined, we need an alternate method. Many alternatives exist but the simplest one is to check the step size of the algorithm. We can check the maximum difference in weights in any particular cycle over all feature weights (#2.1 of algo above). If this is lower than ‘tol’ specified, algo will stop. The convergence is not as fast as gradient descent and we might have to set the ‘max_iter’ parameter if a warning appears saying that the algo stopped before convergence. This is why I specified this parameter in the Lasso generic function. Lets summarize our understanding by comparing the coefficients in all the three cases using the following visual, which shows how the ridge and lasso coefficients behave in comparison to the simple linear regression case.

Apologies for the lack of visual appeal. But I think it is good enough to re-inforced the following facts:

1. The ridge coefficients are a reduced factor of the simple linear regression coefficients and thus never attain zero values but very small values

2. The lasso coefficients become zero in a certain range and are reduced by a constant factor, which explains there low magnitude in comparison to ridge.



43 | P a g e

Before going further, one important issue in case of both ridge and lasso regression is intercept handling. Generally, regularizing the intercept is not a good idea and it should be left out of regularization. This requires slight changes in the implementation, which I’ll leave for you to explore. If you’re still confused and things are a bit fuzzy, I recommend taking the course on Regression which is part of the Machine Learning Specialization by University of Washington at Coursera. Now, lets come to the concluding part where we compare the Ridge and Lasso techniques and see where these can be used.

6. Conclusion Now that we have a fair idea of how ridge and lasso regression work, lets try to consolidate our understanding by comparing them and try to appreciate their specific use cases. I will also compare them with some alternate approaches. Lets analyze these under three buckets:

1. Key Difference • Ridge: It includes all (or none) of the features in the model. Thus, the major advantage of ridge regression is

coefficient shrinkage and reducing model complexity. • Lasso: Along with shrinking coefficients, lasso performs feature selection as well. (Remember the ‘selection‘ in the

lasso full-form?) As we observed earlier, some of the coefficients become exactly zero, which is equivalent to the particular feature being excluded from the model.

Traditionally, techniques like stepwise regression were used to perform feature selection and make parsimonious models. But with advancements in Machine Learning, ridge and lasso regression provide very good alternatives as they give much better output, require fewer tuning parameters and can be automated to a large extend.

2. Typical Use Cases • Ridge: It is majorly used to prevent overfitting. Since it includes all the features, it is not very useful in case of

exorbitantly high #features, say in millions, as it will pose computational challenges. • Lasso: Since it provides sparse solutions, it is generally the model of choice (or some variant of this concept) for

modelling cases where the #features are in millions or more. In such a case, getting a sparse solution is of great computational advantage as the features with zero coefficients can simply be ignored.

Its not hard to see why the stepwise selection techniques become practically very cumbersome to implement in high dimensionality cases. Thus, lasso provides a significant advantage.

3. Presence of Highly Correlated Features • Ridge: It generally works well even in presence of highly correlated features as it will include all of them in the

model but the coefficients will be distributed among them depending on the correlation. • Lasso: It arbitrarily selects any one feature among the highly correlated ones and reduced the coefficients of the

rest to zero. Also, the chosen variable changes randomly with change in model parameters. This generally doesn’t work that well as compared to ridge regression.

This disadvantage of lasso can be observed in the example we discussed above. Since we used a polynomial regression, the variables were highly correlated. ( Not sure why? Check the output of data.corr() ). Thus, we saw that even small values of alpha were giving significant sparsity (i.e. high #coefficients as zero). Along with Ridge and Lasso, Elastic Net is another useful techniques which combines both L1 and L2 regularization. It can be used to balance out the pros and cons of ridge and lasso regression. I encourage you to explore it further.

End Notes In this article, I gave an overview of regularization using ridge and lasso regression. Then, I focused on reasons behind penalizing the magnitude of coefficients should give us parsimonious models. Next, we went into details of ridge and lasso regression and saw their advantages over simple linear regression. We got some intuition into why they should work and also how they work. If you read the optional mathematical part, you probably understood the underlying fundamentals. Regularization techniques are really useful and I encourage you to implement them. If you’re ready to take the challenge, why not try them on the BigMart Sales Prediction problem and share your results in the discussion forum.

https://www.coursera.org/specializations/machine-learning

https://www.coursera.org/specializations/machine-learning

http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-prediction

http://discuss.analyticsvidhya.com/

Documents

Penalized Regression Essentials: Ridge, Lasso & Elastic Net€¦ · Note that, the shrinkage requires the selection of a tuning parameter (lambda) that determines the amount of shrinkage