## L1 and L2 RegularizationRegularisation is a modified version of regression designed to reduce the danger of overfitting, especially when there is multicollinearity in the data's feature sets. High degrees of multicollinearity within the feature set increases the variance of coefficient estimates in a conventional linear regression model, resulting in estimations that can be quite sensitive to modest changes in the model. By restricting, reducing, or "regularising" the regression coefficient estimates towards zero, this strategy inhibits our model from pursuing a more complicated or flexible fit in favor of a more stable fit with a lower coefficient variance. In terms of a regular linear regression model employing ordinary least squares, this is accomplished by altering our standard loss function (Residual Sum of Squares, RSS) to include a penalty for larger magnitude coefficient values. Regularisation has tradeoffs, just like any other model. We must carefully balance bias and variance by adjusting the hyperparameter that scales the degree of the additional regularisation penalty. The more we "regularise" the data, the less variation we will have, but at the risk of adding more bias. ## L1 Norms versus L2 NormsRidge regression and lasso regression are two strategies for improving ordinary least squares regression's resilience against collinearity. Both of these methods try to minimize a cost function. The cost is determined by two terms: the residual sum of squares (RSS), which is calculated using conventional least squares, and an extra regularizer penalty. In ridge regression, the second term is an L2 norm, whereas in lasso regression it is an L1 norm. Let us look at the equations. In ordinary least squares, we minimize the following cost function: This is referred to as the residual sum of squares (RSS). Ridge regression instead solves: The term is an L2 norm. In lasso regression, we solve: The term is an L1 norm. The L2 term is proportional to the square of the β values, whereas the L1 norm is proportional to the absolute value of the values in β. This key distinction explains the entire difference between how lasso regression and ridge regression "work". L1-versus-L2 appears elsewhere in machine learning, so it's vital to grasp what's going on here. ## L1-L2 Norm Differences**Robustness:**L1 > L2. Robustness refers to a dataset's resilience to outliers. The ability of a model to disregard extreme values in data increases its robustness. The L1 norm is more resilient than the L2 norm for obvious reasons: the L2 norm squares values, increasing the cost of outliers exponentially, whereas the L1 norm just takes the absolute value, treating them linearly.**Stability:**L2 > L1. Stability is defined as the resistance to horizontal changes. This is the perpendicular opposite of robustness. The L2 norm is more consistent than the L1 norm. A subsequent notebook will look into why.
## L1-L2 Regularizer Differences**Robustness:**L1 > L2. Robustness is defined as the computational difficulty. L2 has a closed-form solution since it is a square of something. L1 does not have a closed-form solution since it is a non-differentiable piecewise function with an absolute value. As a result, L1 is computationally more expensive since we cannot solve it using matrix algebra and must rely on approximations (in the lasso example, coordinate descent).**Sparsity:**L1 > L2. Sparsity is the trait of having extremely significant coefficients that are either very close to or very far from zero. In theory, coefficients that are extremely close to zero can be deleted later.
Now we will look at the L1 and L2 Regularization through implementation.
## Importing Libraries## Reading the Dataset## EDAWe will now explore the dataset.
## Data PreprocessingWe'll go with simplicity over more intricate procedures here because our focus is on the models, not fancy preprocessing approaches. We'll do just enough to ensure that our regression models can be used and produce accurate results. Our steps will contain the following: ## OutliersHere we will remove outliers, but we must be cautious when removing outliers since we risk losing useful information, we find two apparent outliers at the bottom right of the plot that reflect "bad" agreements for sellers (low price for wide area).
We chose to delete these two observations since they do not correspond with the rest of the data, and we do not want these blatantly "bad" trades to introduce more bias into our prediction models. ## Numerical to Categorical ConversionsMSSubClass, OverallCond, YrSold, and MoSold, while numerical, are categorical type characteristics, therefore we'll convert them to strings before encoding them. ## Encoding Categorical LabelsWe will now encode all category feature labels whose values range from 0 to n_classes-1. By plotting the distribution of our target feature, we quickly notice that the distribution appears to be righlty skewed.
Typically, our regression models work best when the data is regularly distributed. Thus, for the greatest outcomes, we'll try to normalize the feature using a log transform. (For properly skewed data, a log transform shifts the distribution to seem more "normal", however for leftly skewed data, a log transform merely makes the distribution more leftly skewed.)
Cool, we notice our log transform performed very well and had the desired effect: the new distribution appears much more "normal". Let's apply the log transformation of "SalePrice" to our training data. As we'll see, some of the non-target numerical properties are strongly skewed to the right and left. This time, we'll apply a blanket "yeo-johnson" power transform to try to "normalize" each of them, because this transform "normalizes" both right- and left-skewed data. (Here, any characteristics with a "skewness" magnitude more than 0.75 are considered "heavily" skewed.)
Next, we must establish dummy/indicator variables for all of the category characteristics so that they may be used appropriately in our regression models.
Let's now look for missing values and replace them with the mean of the relevant feature.
Finally, let us set up the matrices required for sklearn, and then we can start with the normal ordinary least squares linear regression model. This concludes our preprocessing procedures. ## Linear Regression
Good, now we have a number to compare future models against RMSE = 0.12178. If we now fit this model, we may examine the highest magnitude coefficient values obtained. We'll eventually compare these results to those generated by our regularisation models.
We don't observe any very high coefficient values chosen here because we performed a decent job of preparing our data. If we had not eliminated outliers and normalized the skewed numerical characteristics, for example, there would have been more volatility and a greater likelihood of the model selecting some noticeably high coefficient values in comparison to these. Even with these quantities, we'll observe how the regularisation models compress them towards zero. ## L2-RegularizationBoth L1 and L2 regularisation seek to improve the residual sum of squares (RSS) plus a regularisation term. The regularisation term for ridge regression (L2) is the sum of the squared coefficients multiplied by a non-negative scaling factor lambda (or alpha in our sklearn model). To compare, we will estimate the average RMSE of this model in the same way as we did for the typical linear regression model. First, we will perform this with alpha = 0.1, and then we will utilize cross-validation to get the best alpha that yields the lowest RMSE. It's worth noting that 0.1 was picked at random, with no specific motive.
We have already seen an improvement over the standard least squares linear regression model. We now get an RMSE of 0.12046 for ridge regression with alpha = 0.1. Remember, we selected 0.1 at random, so it is most certainly not the ideal value. As a result, we can potentially enhance our RMSE even more by adjusting alpha. Let's plot the RMSE as alpha scales to see how the value of alpha affects the RMSE.
Consider the U form. The figure shows that the least RMSE occurs at alpha values of 10-15. To be more exact, we will zoom in with alpha values closer to this range.
The RMSE looks to be small when alpha = 10.62. This appears to be adequate for our needs, so let's calculate our revised RMSE estimate using this newly discovered optimum alpha value.
We have again improved our RMSE. We now get RMSE = 0.11320 for a ridge regression model with an ideal alpha of around 10.62, representing a 7.04% improvement over the linear regression model. This appears to be the best RMSE we can achieve using this training data and a single ridge regression without any additional complex preprocessing or feature engineering. Before going on to Lasso Regression, let us review the highest magnitudes of the selected coefficients and compare them to those chosen by the linear regression model.
As predicted, the regularisation procedure has significantly reduced the greatest magnitude coefficient values to zero when compared to the original linear regression model. ## L1 -Regularization
When evaluated using RMSE, we can find that a lasso regression model with alpha = 0.1 produced the least accurate model thus far. Before we abandon lasso regression, let's apply cross-validation to fine-tune alpha. Perhaps our 0.1 number was significantly wrong. Let's try to perform this in the same way we did with ridge regression.
The optimal alpha appears to be rather small, but we know it must be bigger than 0, so let's utilize sklearn's built-in LassoCV function, which will use cross-validation to choose the best alpha from a list of possible fit alternatives. ## Note: There is a RidgeCV function that works similarly and could have been used for the Ridge model earlier.
Alpha equals 0.0004. This appears to be near enough to ideal for our needs, so we'll calculate our revised RMSE estimate using this newly discovered optimal alpha.
Good, so at an ideal alpha of around 0.0004, the lasso regression model appears to outperform the optimal ridge regression model for this data set. We now have an RMSE of 0.11182, representing an 8.17% improvement over our linear regression model. This appears to be the best RMSE we can achieve with this training data and a single lasso regression, without any more complex preprocessing or feature engineering. Let's take a short look at the characteristics that the lasso regression model considers relevant. Note that the Lasso approach will execute feature selection for you-setting coefficients of features it considers irrelevant to zero.
Again as expected, the values seem to have been compressed toward 0 when compared to those chosen by the original linear regression model. This is a significant distinction to highlight between ridge regression and lasso regression. Ridge regression punishes high coefficient values, but it does not eliminate unnecessary features by reducing their coefficients to zero. It will simply attempt to mitigate their influence. Lasso regression, on the other hand, punishes high coefficient values while also eliminating unimportant features by setting their coefficients to zero. Thus, when training data sets with a large number of irrelevant features, the lasso model can help with feature selection.
This Lasso model appears to have picked 107 of the characteristics in this case, the most relevant of which are shown in the figure above while zeroing out the remaining 112. We won't go into much more depth concerning the specifc features at this time but note that the selected features are not necessarily the "correct" features, and should be examined, especially when multicollinearity exists within the feature set. ## L0-NormFinally, to see how the strength of alpha influences the number of features picked, plot the number of non-zero coefficients produced by lasso as the regularisation parameter alpha is varied. This is also known as the L0-Norm of the coefficients.
As the intensity of the regularisation parameter alpha increases, the number of chosen features rapidly decreases from a maximum of 134 to a plateau of 4 features when alpha is somewhat larger than 0.25. It appears that the higher the intensity of alpha, the more limited the lasso model gets in terms of the number of picked characteristics. Keep this in mind while working with data sets that have a significant number of useless attributes. Next TopicMaximum Likelihood Estimation |