Regularization techniques L1 and L2 regularization example
- Alaxo Joy
- Sep 19, 2024
- 3 min read
GOAL OF REGULARIZATION: Find a balance such that the model is simple and fits the data very well.
Penalty-based regularization is the most common approach for reducing overfitting. In order to understand this point, let us revisit the example of the polynomial with degree d. In this case, the prediction �$ for a given value of x is as follows:
&
�- = & ( !$% �!�!)
It is possible to use a single-layer network with d inputs and a single bias neuron with weight �' in order to model this prediction. This neural network uses linear activations, and the squared loss function for a set of training instances (x, y) from data set D can be defined as follows: � = & ( � − �-)( (+,-)∈0
A large value of d tends to increase overfitting. One possible solution to this problem is to reduce the value of d. Instead of reducing the number of parameters in a hard way, one can use a soft penalty on the use of parameters.
The most common choice is L2-regularization, which is also referred to as Tikhonov regularization. In such a case, the additional penalty is defined by the sum of squares of the values of the parameters. Then, for the regularization parameter λ > 0, one can define the objective function as follows:
L2-regularization decreases the complexity of model but doesn’t reduce the number of parameters. L2 regularization tends to shrink weights (tends to zero; but not zero), leading to a model that consider all features
The regularization parameter (λ) controls the strength of the penalty. A larger λ increases the penalty on large weights. Increasing or decreasing the value of λ reduces the softness of the penalty. One advantage of this type of parameterized penalty is that one can tune this parameter for optimum performance on a portion of the training data set that is not used for learning the parameters. This type of approach is referred to as model validation.
However, it is possible to use other types of penalties on the parameters. A common approach is L1-regularization (Lasso - Least Absolute Shrinkage and Selection Operator) in which the squared penalty is replaced with a penalty on the sum of the absolute magnitudes of the coefficients. Therefore, the new objective function is as follows:
Problem with L1-regularization is the absolute value function ∣�!∣ is not differentiable at zero and hence gradient is undefined.
A question arises as to whether L1- or L2-regularization is desirable. From an accuracy point of view, L2-regularization usually outperforms L1-regularization. This is the reason that L2- regularization is almost always preferred over L1-regularization is most implementations.
Why L2 Regularization is Preferred Over L1 in Deep Networks??????
Smooth Differentiability: Unlike L1 regularization, which is not differentiable at zero, L2 regularization is differentiable everywhere. This makes it easier to implement and optimize using gradient-based methods.
• Weight Shrinking vs. Sparsity
Weight Shrinking: L2 regularization tends to shrink weights uniformly, leading to models that consider all features, but with reduced impact from less important ones. L2 regularization adds a penalty to the loss function based on the size of the weights. Larger weights are penalized more, encouraging the model to keep weights smaller.
Sparsity: By driving some weights to zero, L1 regularization effectively removes the corresponding features from the model, leading to a sparse model ( a model with less parameters).
L1 regularization performs a soft thresholding operation where small weights are driven towards zero. When updating the weights during optimization, L1 applies a constant penalty, which effectively pushes weights towards zero if they are below a certain threshold.
However this become advantageous, when weight become zero to irrelevant features, so that L1 regularization can focus more on important Features
• General Usefulness:
L2 regularization is more generally applicable across different types of models and datasets, making it a default choice in many machine learning frameworks.
• Numerical Stability:
In deep networks, large weights can lead to numerical instability, causing exploding gradients and making optimization difficult. By keeping weights smaller, L2 regularization helps maintain numerical stability, facilitating smoother training and convergence.
L1 vs. L2 Regularization: Key Differences?
L1 and L2 regularization differ in several key aspects:
Penalty Type: L1 regularization penalizes the absolute value of weights, while L2 regularization penalizes the squared values of weights.
Sparsity: L1 regularization induces sparsity, while L2 regularization does not set weights exactly to zero.
Feature Importance: L1 regularization performs feature selection, prioritizing important features, while L2 regularization retains all features.
Computational Cost: L1 regularization is computationally more expensive due to the non-differentiability at zero weights.
code in Python -
from tensorflow.keras.layers import Dense
from tensorflow.keras.regularizers import l1, l2
model.add(Dense(64, activation='relu', kernel_regularizer=l1(0.01)))
Comments