There are numerous new methods of regression that can be utilized rather than the standard Linear as well as Logistic Regression. As we have discussed in Linear Regression, we employ an Ordinary Least Square (OLS) method to calculate the unknowable parameters. However, this is an old technique, and more recent and sophisticated techniques like regularization techniques can be employed to construct Linear and Logistic Regression models.
It is extremely important to realize from the beginning that Regularization methods don’t help in your ability to learn about these parameters; instead, they aid in the generalization phase (i.e., improve the precision of the models when they work with unknown data) that is done by penalizing the model’s complexity (complexity is due to overfitting in which the model memorizes patterns from data and then creates assumptions when there is any ).
In order to ensure that the model avoids adding noise or other characteristics when generalizing and regularizing, regression models are an option. The primary reason that causes the model to become very complicated is that the model is dependent on too many functions and deals with data of very large dimensions; these advanced regression models can be utilized.
Contents [hide]
1 Limitation for OLS
2 Ridge Regression
3 Lasso Regression
4 ElasticNet Regression
The Limits of OLS
The formula for an easy linear regression model would be Y = B0 + B1X1 + E, where B0 stands for the intercept, The independent variable is X1, B1 represents the coefficient, and E stands for error. Through this equation, we obtain the Y, which is the predicted value, and by using Y, we know the E value that has to be reduced, i.e., the error in the prediction between the values observed and those predicted. The term error is crucial since every value that is predicted is not accurate and may differ from the actual value, and the term “error” is used to rectify the difference. This is the place where OLS is able to help as it seeks to reduce the difference, which it achieves by identifying the regression line that reduces the error (below, you can see on the backend of the program, it explores a variety of models of regression (Red lines) and then selects the one that has the least errors in squares (the difference between the predicted value the actual) (Black lines). )).
So OLS, when calculated visually, can be described as the product of distances that exist between every data element and its predicted data point in the direction of the regression line, and it is defined in a manner that this OLS is the lowest.
In the case of OLS, the most efficient line is that which gives the least separation between actual and forecasted data points.
So far, we’ve used an estimate of the square errors to describe the error in our “OLS Regression Model’. However, it is essential to know more about the error. Prediction error can be classified into two kinds: error due to bias and Error due to variance.
To comprehend these sub-components of error, we need to identify the various sources of errors. The causes of these errors are rooted in the concept of generalization, which generally involves generalizing the data we’ve previously observed with a brand new set of data. When we build a model, we only have a single goal, but two are important: How precise the model is and how efficient the model is when working using data that isn’t seen before. These two concepts are fundamentally in opposition to one another. When we seek to improve the accuracy of our model, it decreases its generalizing capabilities. Likewise, in making the model more generalization-friendly, we compromise the model’s accuracy.
For context, suppose we have a data set in which the Y variable comprises two categories: poor and rich. We have 100 independent demographic characteristics that range from where they live to whether they wear glasses or not. If we could improve its accuracy, we can actually make our model remember the patterns in our data that will eventually be able to remember patterns that, in real situations, aren’t useful in determining the financial status of an individual, such as whether men have a mustache or not, their preferences for food, or preferences for food, etc. This creates the issue of overfitting, in which we see numerous patterns and match patterns that exist because of some data noise. If we are dealing with a high degree of complexity, that means there is a significant variation.
Let’s suppose we only take one feature to improve things and improve the situation. This could result in high bias and a very low variation. This will be a source of issues because using only an element to be generalized is an assumption that is huge about the data. We can say that we are able to generalize our data on the basis of the region in the country. This is an extremely large assumption, and it can result in very poor accuracy of the model, causing the model to be unfit.
To improve the model’s accuracy, We then increase the complexity of our model by adding another element like their academic credentials. By adding another feature, we decrease your Bias while increasing the variance; however, if we follow the trend too much and continue adding new features, we’ll eventually get to the same place we started from, i.e., having a very high variance but low bias. This states that a tradeoff exists between the two, and we need to determine the “sweet place’ in which the model isn’t too biased (causing underfitting, which causes the model to arrive at an easy solution that is unlikely to be very useful) and also does not have too much variance (causing overfitting, where patterns in the data set are memorized that may not exist but could be due to the chance ).
Mathematically, the estimate’s mean square error is the sum of the square of bias and the variance. If we take a look at the picture above, we can see how, with an increase in the complexity of the model, the bias decreases, and vice versa; the optimal line is somewhere between the two. Total Error is reduced to a minimum, and the growth in bias is equal to the reduction in variance.
The question now is, what is the reason OLS isn’t the most effective method? Well, according to the Gauss Markov Theorem- ‘among all estimates that are unbiased, OLS has the smallest variance.’ This implies that OLS estimates have the lowest mean square error without bias. However, we could have a bias in the estimate, which could have a smaller mean square error. To do this, we must employ shrinkage estimators, which use the equation for regression as a basis, and substitute coefficients (Beta-K) with Beta’-K, which yields a value lower than the initial coefficient.
This is the original coefficient, which is being multiplied by a term that can be described as 1 divided plus lambda. In other words, if lambda is zero and we have the original coefficient, but in the event that the parameter becomes large, the output value becomes smaller. It is close to the minimal value, which is zero. The lambda parameter will be the estimator of shrinkage, which, when it is correct, will give us an improved (lower) mean square error. To calculate what the lambda value is, we apply a formula, and without going into the details of the math behind the formula, the way it works is basically to determine the extent to which a coefficient is greater than its variance, causing it to output (lambda) to be lower. This means that it can end the coefficient (making it null) or preserve the coefficient but decrease it.
Below we see the model’s complexity increases when we create the OLS Polynomial Regression Model using 15 features (Right) instead of just 1 feature (Left). The increase in the number of features adds complexity to the model. Additionally, it begins to be influenced by less significant features and attempts to account for even smaller variations. In turn, the importance of coefficients will also increase due to the increasing model’s complexity.
To reduce your model’s complexity and counter multicollinearity, we utilize other regression models that can be regularized. This article discusses three varieties of Regularized Regression Modeling techniques: Ridge Regression, Lasso Regression, and Elastic Net Regression.
Ridge Regression
Ridge Regression can be used to build regularized models where constraints are included in the algorithm to penalize the coefficients for being too big and prevent the models from growing overly complicated and leading to overfitting. In a simple term, it is a way of adding an extra penalty to the equation, where w is the vector of coefficients for the model, which is the L2 norm and is a free, tunable parameter. This makes the equation appear like this:
The first component is the term with the lowest square (loss function), and the second is the penal. The Ridge Regression L2 regularization process is performed, which adds penalties equal to the size of the coefficients. So, under Ridge regression, the L2 norm penalty, ani=1w2i, is added to the loss function, penalizing betas. Because the coefficients are multiplied by the penal component, this is an effect that is different from the L1-norm that we employ to calculate Lasso Regression (discussed below). The alpha is a crucial function since its value is determined according to the model you wish to explore. So if we choose alpha as zero, it will be Ridge, and alpha = 1. can be described as LASSO, and anything that falls between 0 and 1 is called an elastic net.
The next step is the most important aspect of the equation: to choose which value we will assign to the variable parameter Lambda. If we choose to use a low lambda value, the result will be exactly that of the OLS Regression, and therefore there will be no generalization taking place. If we choose to use too high of a lambda number, it will broaden too much, pulling down the coefficients for too many of the features toward extreme minimal values (i.e., towards 0). Statisticians employ techniques such as Ridge Trace, a graph that displays ridge regression coefficients as a lambda function. We choose the lambda value that stabilizes the coefficients. Methods like cross-validation and grid searches identify our model’s optimal lambda.
Ridge utilizes L2 regularization to force coefficient values to be spread more evenly. It does not cause any coefficient to shrink until zero; however, L1 regularization makes the coefficients of less important characteristics zero. This is what leads to that Lasso regression.
Note that Ridge Regression could or might not be considered an appropriate Method of Feature Selection method since it simply reduces the magnitude of coefficients and, unlike Lasso, does not render any of the variables obsolete, which could be described as a legitimate feature Selection technique.
Lasso Regression
Lasso is the same as Ridge. However, it uses L1 regularization that adds a penalty ani=1|wi| the loss function, resulting in penalties that are equivalent to the value absolute of coefficients’ magnitude rather than the coefficients’ square (used for L2) that causes weaker features to have coefficients of zero. By making use of L1 regularization, Lasso does an automatic feature selection process where features that have 0 as their coefficient value are eliminated. Also, the lambda value is vital because if its value is too high, it can result in dropping many variables (by making the coefficients that are a small change to zero), rendering the model too general, causing the model not to be able to fit.
If we use the correct value for lambda, we can produce a sparse output, where certain features that are not important are listed with zero coefficients. The variables can be eliminated, and those with coefficients other than zero can be chosen, thereby facilitating the choice of features for the model.
ElasticNet Regression
Elastic Net is a blended mix of Ridge and Lasso Regression, which utilizes L1 and L2 for regularization. It’s beneficial when several aspects are linked to one another.
It is evident in the equation above how the L1 and L2 (||.||1 and ||.||2) both are used to regulate the models.
Therefore, these models of regularized regression are able to identify the proper relationship between dependent and independent variables through regularizing the independent variable’s coefficient. They can help reduce the danger of multicollinearity. Applying regularization to specific regression models could be problematic if the independent variables are not linearly linked to the dependent variables. To address this issue, it is possible to transform the features to ensure that linear models like Ridge and Lasso can be used, for example, using polynomial-based basis functions in conjunction with linear regression. The transformed features enable a linear regression model to discover the polynomial as well as non-linear relationships within the data.