Regularization and Early Stopping

Takes into account the model complexity when calculating the error. It’s a major field of ML research, but we are going to focus on L1 and L2 regularization.

L2 vs L1 Regularization

L1, and L2 regularization are so called parameter norm penalties. They both aim to penalize the loss function by introducing coefficients with different properties. Intuitively, L1 has the influence of pushing weights towards 0 and L2 does not.

The L1 Regularization, aims to penalize the loss function by adding the absolute value of the magnitude of the coefficient. This is useful for dropping poor features, which makes the model sparser. It helps reduce the feature space, an in turn improves the efficiency of the model. L1 regularization is also called Lasso Regression, which is an acronym for (L)east (A)bsolute (S)hrinkage and (S)election (O)perator.

The L2 Regularization, also called Ridge Regression aims to keep weight values smaller by adding the squared magnitude of the coefficient as the penalty term to the loss function. This will not reduce the feature space the same way as L1, however L2 can preserve features and accuracy, for an overall better estimator.

The difference between the L1 and L2 regularization is easily observed in their mathematical definitions, i.e.:

L1:

\[ loss = Error + \lambda\sum_{n=1}^{N}\lvert w_i\rvert\]

L2:

\[ loss = Error + \lambda\sum_{n=1}^{N} w_i^2 \]
Here, \( Error \) can be replaced with a loss function, such as MSE, RMSE, etc. \( \lambda \) denotes a tuning parameter, which controls he bias-variance trade-off and is commonly assigned a value via cross-validation. \( w_i \) are the coefficients (or weights) of the regression model.

The optimal L1, and L2, is found by searching for a point in the validation loss function to obtain the lowest value. Near the minimum, any less regularization may increase your variance and starts overfitting. As opposed to this, more regularization will increase bias and starts underfitting, which hurts the model accuracy.

Effects of zeroing coefficients using L1:

ActionImpact
Fewer coefficients to load/storeReduce memory, model size
Fewer multiplications neededIncrease prediction speed
Zeroing coefficients can help with performance

For reference, this is a great introduction to help explain L1 vs. L2 Regularization in more detail.

Early Stopping

Combine early stopping with L1 and L2 regularization, to find the best trade-off for model performance.

See more: https://cloud.google.com/bigquery-ml/docs/preventing-overfitting

Dropout Layers

By dropping certain unit activations / nodes between layers during training (only during training!), one can achieve a similar exercise as ensemble models (e.g. random forests, XGBoost and so on), meaning that the model will test out different paths. This is helpful to figure out which nodes or features that are less or more relevant for the output of the model.

The more you dropout, the stronger the Regularization! One can think of dropout ratios, i.e. within a range of 0 to 1, which determines the relative number of nodes to drop out. In other words, a dropout of 0 means no values will be dropped out (i.e. no regularization), and a value of 1 means everything will be dropped out. To achieve a good balance, a typical dropout is around 0.2.

Leave a comment