## Common Activation Functions

Placeholder

- ReLU: Will stop working when the inputs are in the negative domain – will result in an activation of 0.
- Sigmoid
- Softmax: Classification problems

## Failure modes for Gradient Descent

Problem | Gradients can vanish | Gradients can explode | ReLu layers can die |

Insight | Each Additional layer can reduce the signal vs. noise | Learning rates are important here | Monitor fraction of zero weights in TensorBoard |

Solution | Using ReLu instead of sigmoid/tanh can help | Batch normalization (useful knob) can help | Lower your learning rates |

## Regression

Placeholder. Figure out how to configure LaTeX commands..

Using mean squared error, our loss is

## Classification

Main loss function considered here is *Cross-Entropy Loss / Log-loss function*. This applies to all kinds of classification problems: binary, single-label, multiple classes (1 to > 1), and multiple label, multiple classes (> 1 to > 1).

For sake of simplicity, let’s divide Classification problems into Binary, and Multi Class.

Placeholder.

### Binary Classification problems

For binary classification, a simple sigmoid activation function will do. The outcome is either yay or nay, meaning that within the range (0, 1) will determine whether the predicted outcome is considered a yay (> 0.5) or a nay (< 0.5).

### Multi-class classification problems

Say we want to determine the outcome for a 1 hot encoded input, meaning each input only has 1 label attached to it, but more than one label exists (hence multi-class). For example, we have an image dataset containing passports, drivers licenses, and credit cards. Our task would be to correctly identify each input by these three different outcomes.

Suitable loss functions here could be the following:

- categorical_crossentropy (CCE)
- sparse_categorical_crossentropy (SCCE)

The CCE requires that the input is one-hot encoded. It will calculate the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.

The output layer in CCE has *n* nodes – one for each class, and softmax activation function is used to determine the outcome for each class.

The SCCE doesn’t require the target variable to be 1-hot encoded prior to training. The reasoning for this is that when classes scale, introducing many many labels, they will require significant memory, hence slowing down training. Hence, using SCCE, means you will be saving significant resources (memory and compute power) for training. Definitely helps with scaling.

In summary, if you use CCE you need one-hot encoding, and if you use SCCE you encode as normal integers. Use SCCE when your classes are **mutually exclusive** (when each sample belongs exactly to one class) and CCE when one sample can have multiple classes or labels.