Ridge and Lasso Regression

4 min readJun 18, 2022

a way to solve for overfitting

To understand ridge and lasso regression, we first have to recap and refresh our understanding of what a model is and what linear regression (OLS) model is.

What is a model:

A model is a construction of logic such that it can automate predictions. We can have a model X, in send it any input I and get an output O (the prediction).

Input Data ----> Statistical Logic ----> Output Data (predictions)

Linear Regression Recap:

Linear regression is one such way to construct the logic. In linear regression, the logic is a mathematical “straight line”. A straight line like below. A line can tell a lot. The slope of the line is chosen such that it has the best fit (least distance — sum of squared from the data points). So it’s goal is to fit the best to the training data!

y-axis
^
|   * /
|    / * *
| * /
|* / *
|_/_______________> x-axis* - are the data points around which line is fitted

The Problem:

Turns out the many time, the fitted line is very well fitted to the training data, so well fitted (explains all the variance of training data), that it is not able to explain the variance on the Test data. This is a problem — as goal of a model is predict on new data it has never seen i.e. Test data. The fitted line (the model) suffers from what is called overfitting aka high variance. The problem is that the fitted line is overly good.

The Solution:

In order to solve this overfitting of fitted line, one way to do it is to move the line around its slope i.e. change the slope. This can help add noise and make the line ‘not as good it is right now’. This is a way to add some penalty (or think of it as noise) to the fitted line and try to reconstruct it. This what Ridge and Lasso regression do, they reconstruct the fitted line (specifically its slope). So how do we choose which new slope to pick? That we do by creating a new cost function and finding the slope the minimizes that.

Linear Regression (OLS)Original Cost function = sum of squared errorsBased on the above, the best fitted line is chosen for which the cost in the least.Ridge or Lasso RegressionRidge - New Cost function = sum of squared errors + (penalty x squared(slope of line)Lasso - New Cost function = sum of squared errors + (penalty x absolute(slope of line)Based on the above, the best fitted line is chosen for which the above cost in the least. Notice how this cost function has a penalty value (usually like 0.1) - it is the rate of change and the slope of the line as extra cost.

Now, the new cost function is recalculated for different values of the slope. For Ridge regression, the best fitted slope i.e. the slope with the least New Cost function, is close to 0 slope value, but not exactly 0 slope value.

It is interesting to see that what Ridge regression is doing is making the slope of the fitted line almost 0 such that the coefficients of all the features (predictor variables) becomes so small that it does not have a big effect on the prediction. This is exactly the job of regularization via Ridge regression — to reduce overfitting — since now variables will reduce the variability they explain on training data and hence explain test data better.

y-axis (new cost function - Ridge)
    ^
    |         )
    | (      )
    |  (    )
    |   (  )
    |    \/
    |_______________> x-axis (slope values)
         0.1y-axis (new cost function - Lasso)
    ^
    |         )
    | (      )
    |  (    )
    |   (  )
    |    \/
    |_______________> x-axis (slope values)
         0.0

Similarly for Lasso, the new cost function is recalculated for different values of the slope. For Lasso regression, the best fitted slope i.e. the slope with the least New Cost function, can converge to exactly 0 slope value. This actually removes some of the features (predictor variables) completely. So, for Lasso, not only it accomplishes the job of regularization — to reduce overfitting by reducing coefficient of each variable — it also goes one step ahead does feature reduction by removing some of the features.

Ridge is (L2) is not robust to outliers since we are squaring the weights. Lasso (L1) is robust to outliers since we are taking absolute of weights, not squares.

About the learning rate — Lambda (λ)

λ as we saw previously is included in deciding the new cost function. The thing to keep in mind is that higher the λ, the higher the penalty error is, which leads to low fitting and potentially high bias. This is because the data will be fitted on higher error penalty, taking the fit away from the good fit.

On the other hand, as λ gets lower, the lower the penalty error is, which leads to high fitting and potentially high variance. The is because the data might not be getting enough noise and lead to high variance.

Conclusion:

A lot of machine learning models might suffer from over overfitting. This happens when the model is overly good on training data. One way to solve this, for linear regression model, is use Ridge or Lasso regression. This technique moves the slope of the predictor variables such that it adds noise and makes the model more generalizable to test data.