ML Activation and Loss Functions

Common Activation Functions


  • ReLU: Will stop working when the inputs are in the negative domain – will result in an activation of 0.
  • Sigmoid
  • Softmax: Classification problems

Failure modes for Gradient Descent

ProblemGradients can vanishGradients can explodeReLu layers can die
InsightEach Additional layer can reduce the signal vs. noiseLearning rates are important hereMonitor fraction of zero weights in TensorBoard
SolutionUsing ReLu instead of sigmoid/tanh can helpBatch normalization (useful knob) can helpLower your learning rates
Three common failure modes for gradient descent


Placeholder. Figure out how to configure LaTeX commands..

Using mean squared error, our loss is

\begin{equation} MSE = \frac{1}{m}\sum_{i=1}^{m}(\hat{Y}_i-Y_i)^2 \end{equation}


Main loss function considered here is Cross-Entropy Loss / Log-loss function. This applies to all kinds of classification problems: binary, single-label, multiple classes (1 to > 1), and multiple label, multiple classes (> 1 to > 1).

For sake of simplicity, let’s divide Classification problems into Binary, and Multi Class.


Binary Classification problems

For binary classification, a simple sigmoid activation function will do. The outcome is either yay or nay, meaning that within the range (0, 1) will determine whether the predicted outcome is considered a yay (> 0.5) or a nay (< 0.5).

Multi-class classification problems

Say we want to determine the outcome for a 1 hot encoded input, meaning each input only has 1 label attached to it, but more than one label exists (hence multi-class). For example, we have an image dataset containing passports, drivers licenses, and credit cards. Our task would be to correctly identify each input by these three different outcomes.

Suitable loss functions here could be the following:

  • categorical_crossentropy (CCE)
  • sparse_categorical_crossentropy (SCCE)

The CCE requires that the input is one-hot encoded. It will calculate the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.

The output layer in CCE has n nodes – one for each class, and softmax activation function is used to determine the outcome for each class.

The SCCE doesn’t require the target variable to be 1-hot encoded prior to training. The reasoning for this is that when classes scale, introducing many many labels, they will require significant memory, hence slowing down training. Hence, using SCCE, means you will be saving significant resources (memory and compute power) for training. Definitely helps with scaling.

In summary, if you use CCE you need one-hot encoding, and if you use SCCE you encode as normal integers. Use SCCE when your classes are mutually exclusive (when each sample belongs exactly to one class) and CCE when one sample can have multiple classes or labels.

Leave a comment