top of page

Regularization techniques L1 and L2 regularization example

GOAL OF REGULARIZATION: Find a balance such that the model is simple and fits the data  very well.  


Penalty-based regularization is the most common approach for reducing overfitting. In order  to understand this point, let us revisit the example of the polynomial with degree d. In this case,  the prediction �$ for a given value of x is as follows: 

�- = & ( !$% �!�!) 


It is possible to use a single-layer network with d inputs and a single bias neuron with weight  �' in order to model this prediction. This neural network uses linear activations, and the  squared loss function for a set of training instances (x, y) from data set D can be defined as  follows: � = & ( � − �-)( (+,-)∈0 


A large value of d tends to increase overfitting. One possible solution to this problem is to  reduce the value of d. Instead of reducing the number of parameters in a hard way, one can use  a soft penalty on the use of parameters.  


The most common choice is L2-regularization, which is also referred to as Tikhonov  regularization. In such a case, the additional penalty is defined by the sum of squares of the  values of the parameters. Then, for the regularization parameter λ > 0, one can define the  objective function as follows: 



L2-regularization decreases the complexity of model but doesn’t reduce the number of  parameters. L2 regularization tends to shrink weights (tends to zero; but not zero), leading to a  model that consider all features


The regularization parameter (λ) controls the strength of the penalty. A larger λ increases the  penalty on large weights. Increasing or decreasing the value of λ reduces the softness of the  penalty. One advantage of this type of parameterized penalty is that one can tune this parameter  for optimum performance on a portion of the training data set that is not used for learning the  parameters. This type of approach is referred to as model validation. 


However, it is possible to use other types of penalties on the parameters. A common approach  is L1-regularization (Lasso - Least Absolute Shrinkage and Selection Operator) in which the  squared penalty is replaced with a penalty on the sum of the absolute magnitudes of the  coefficients. Therefore, the new objective function is as follows: 


Problem with L1-regularization is the absolute value function ∣�!∣ is not differentiable at zero  and hence gradient is undefined. 


A question arises as to whether L1- or L2-regularization is desirable. From an accuracy point  of view, L2-regularization usually outperforms L1-regularization. This is the reason that L2- regularization is almost always preferred over L1-regularization is most implementations.  


Why L2 Regularization is Preferred Over L1 in Deep Networks??????



Smooth Differentiability: Unlike L1 regularization, which is not differentiable at zero, L2  regularization is differentiable everywhere. This makes it easier to implement and optimize  using gradient-based methods. 


Weight Shrinking vs. Sparsity 


Weight Shrinking: L2 regularization tends to shrink weights uniformly, leading to models that  consider all features, but with reduced impact from less important ones. L2 regularization adds  a penalty to the loss function based on the size of the weights. Larger weights are penalized  more, encouraging the model to keep weights smaller. 


Sparsity: By driving some weights to zero, L1 regularization effectively removes the  corresponding features from the model, leading to a sparse model ( a model with less  parameters). 


L1 regularization performs a soft thresholding operation where small weights are driven  towards zero. When updating the weights during optimization, L1 applies a constant penalty,  which effectively pushes weights towards zero if they are below a certain threshold. 


However this become advantageous, when weight become zero to irrelevant features, so that  L1 regularization can focus more on important Features


• General Usefulness: 

L2 regularization is more generally applicable across different types of models and datasets,  making it a default choice in many machine learning frameworks. 


• Numerical Stability: 

In deep networks, large weights can lead to numerical instability, causing exploding gradients  and making optimization difficult. By keeping weights smaller, L2 regularization helps  maintain numerical stability, facilitating smoother training and convergence. 


L1 and L2 regularization
L1 and L2 regularization


L1 vs. L2 Regularization: Key Differences?


L1 and L2 regularization differ in several key aspects:


  1. Penalty Type: L1 regularization penalizes the absolute value of weights, while L2 regularization penalizes the squared values of weights.


  2. Sparsity: L1 regularization induces sparsity, while L2 regularization does not set weights exactly to zero.


  3. Feature Importance: L1 regularization performs feature selection, prioritizing important features, while L2 regularization retains all features.


  4. Computational Cost: L1 regularization is computationally more expensive due to the non-differentiability at zero weights.



code in Python -

from tensorflow.keras.layers import Dense

from tensorflow.keras.regularizers import l1, l2

model.add(Dense(64, activation='relu', kernel_regularizer=l1(0.01)))




Comments


bottom of page