ML Activation and Loss Functions

Common Activation Functions

Placeholder

ReLU: Will stop working when the inputs are in the negative domain – will result in an activation of 0.
Sigmoid
Softmax: Classification problems

Failure modes for Gradient Descent

Problem	Gradients can vanish	Gradients can explode	ReLu layers can die
Insight	Each Additional layer can reduce the signal vs. noise	Learning rates are important here	Monitor fraction of zero weights in TensorBoard
Solution	Using ReLu instead of sigmoid/tanh can help	Batch normalization (useful knob) can help	Lower your learning rates

Three common failure modes for gradient descent

Regression

Placeholder. Figure out how to configure LaTeX commands..

Using mean squared error, our loss is

\begin{equation} MSE = \frac{1}{m}\sum_{i=1}^{m}(\hat{Y}_i-Y_i)^2 \end{equation}

Classification

Main loss function considered here is Cross-Entropy Loss / Log-loss function. This applies to all kinds of classification problems: binary, single-label, multiple classes (1 to > 1), and multiple label, multiple classes (> 1 to > 1).

For sake of simplicity, let’s divide Classification problems into Binary, and Multi Class.

Placeholder.

Binary Classification problems

For binary classification, a simple sigmoid activation function will do. The outcome is either yay or nay, meaning that within the range (0, 1) will determine whether the predicted outcome is considered a yay (> 0.5) or a nay (< 0.5).

Multi-class classification problems

Say we want to determine the outcome for a 1 hot encoded input, meaning each input only has 1 label attached to it, but more than one label exists (hence multi-class). For example, we have an image dataset containing passports, drivers licenses, and credit cards. Our task would be to correctly identify each input by these three different outcomes.

Suitable loss functions here could be the following:

categorical_crossentropy (CCE)
sparse_categorical_crossentropy (SCCE)

The CCE requires that the input is one-hot encoded. It will calculate the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.

The output layer in CCE has n nodes – one for each class, and softmax activation function is used to determine the outcome for each class.

The SCCE doesn’t require the target variable to be 1-hot encoded prior to training. The reasoning for this is that when classes scale, introducing many many labels, they will require significant memory, hence slowing down training. Hence, using SCCE, means you will be saving significant resources (memory and compute power) for training. Definitely helps with scaling.

In summary, if you use CCE you need one-hot encoding, and if you use SCCE you encode as normal integers. Use SCCE when your classes are mutually exclusive (when each sample belongs exactly to one class) and CCE when one sample can have multiple classes or labels.