18.1 Activation Functions#
Activation functions are essential components of neural networks, introducing non-linearities that enable the model to learn complex patterns. Without them, a neural network would simply be a linear regression model, incapable of handling intricate data relationships. This section explores four widely used activation functions: Sigmoid, ReLU, Tanh, and Softmax, discussing their properties, advantages, and limitations. Each has unique properties and use cases depending on the architecture and goal of the model.
Function |
Formula |
Use Case |
|---|---|---|
Sigmoid |
\(\sigma(x) = \frac{1}{1 + e^{-x}}\) |
Binary classification (outputs between 0 and 1) |
ReLU |
\(f(x) = \max(0, x)\) |
Default choice for hidden layers (fast computation) |
Tanh |
\(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\) |
Similar to sigmoid but outputs between -1 and 1 |
Softmax |
\(\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\) |
Multi-class classification (outputs probabilities) |
The sigmoid function: maps any real-valued number to a range between 0 and 1, making it suitable for binary classification tasks where outputs represent probabilities. This function \(\sigma(x) = \frac{1}{1 + e^{-x}}\) maps any real-valued input into a probability-like output, making it useful in binary classification and output layers where probabilistic interpretation is needed.
Advantages:
Useful when outputs need to be interpreted as probabilities. *Differentiable, allowing gradient-based optimization.
Disadvantages:
Vanishing Gradients: For very large or small inputs, gradients become nearly zero, slowing down learning.
Not Zero-Centered: Outputs are always positive, leading to inefficient weight updates
Computationally Expensive: Involves exponentiation operations.
ReLU (Rectified Linear Unit) Function: is one of the most popular activation functions due to its simplicity and effectiveness. It outputs the input directly if positive; otherwise, it outputs zero (\(f(x) = \max(0, x)\)). It is non-linear but simple; computationally efficient.
Advantages:
Avoids Vanishing Gradient (for positive inputs): Unlike sigmoid, gradients remain strong for active neurons.
Fast Computation: No complex exponentials.
Sparsity: Can deactivate neurons (output zero), making the network more efficient.
Disadvantages:
Dying ReLU Problem: If many neurons output zero (due to negative inputs), they stop learning entirely.
Not Zero-Centered: Like sigmoid, can lead to slower convergence.
What is the Dying ReLU Problem? If a neuron consistently receives negative inputs, its output becomes zero, and its weights stop updating (since the gradient is also zero). Over time, this can cause some neurons to “die” and never activate again, reducing the model’s capacity to learn (neurons stop contributing to learning).
Solutions to Dying ReLU:
Leaky ReLU: Allows a small negative slope (e.g., 0.01) for negative inputs. $\( \text{Leaky ReLU}(x) = \begin{cases} x, & \text{if } x \geq 0 \\ \alpha x, & \text{if } x < 0 \end{cases} \)$
where \(\alpha\) is a small positive constant (e.g., 0.01).
Parametric ReLU (PReLU): Learns the negative slope during training.
Exponential Linear Unit (ELU): Smoothly handles negative inputs.
Tanh (Hyperbolic Tangent) Function: The tanh function is similar to sigmoid but maps inputs to a range between -1 and 1, making it zero-centered, \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\).
In neural networks, an activation function is zero-centered if its output values are symmetrically distributed around zero (i.e., they have a mean of zero). This property helps in maintaining stable and efficient training by preventing systematic weight updates in a single direction.
When activation outputs are not zero-centered (e.g., sigmoid outputs between 0 and 1), gradients during backpropagation tend to be either all positive or all negative, leading to inefficient weight updates: tend to update in the same direction (either always increasing or always decreasing), slowing down convergence.
Advantages:
Zero-centered output allows for better convergence during gradient descent.
Stronger gradients than sigmoid for inputs near 0.
Disadvantages:
Still suffers from the vanishing gradient problem for very large or very small inputs:: Like sigmoid, gradients become very small for extreme values.
Slightly More Computationally Expensive: Due to exponential operations.
The Softmax function is typically used in the output layer of a multi-class classification model. It converts a vector of raw scores (logits) into a probability distribution over predicted output classes: Softmax| \(\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\).
Advantages:
Ensures that the sum of the outputs is 1, making them interpretable as probabilities.
Highlights the highest-valued input while suppressing the rest, which helps in clear class predictions.
Disadvantages:
Exponentially sensitive to input scale—can cause numerical instability if logits are too large.
When classes are not mutually exclusive, Softmax is not ideal (use sigmoid instead for multi-label classification).
Click Here for Interactive Activation Function Visualization
** Loss Functions**#
Loss functions (or cost functions) measure how well a neural network’s predictions match the true target values. During training, the goal is to minimize the loss by adjusting the model’s parameters. The choice of loss function depends on the type of task:
Mean Squared Error (MSE): is widely used in regression problems, such as predicting house prices or temperature, where the output is continuous and goal is to minimize the average squared difference between predicted and actual values.$\(L_{MSE} = \frac{1}{N}\sum_{i=1}^N (y_i - \hat{y}_i)^2\)$
As it calculates the average squared difference between predicted values and actual values, which means larger errors are penalized more heavily. While MSE is straightforward and differentiable, making it compatible with gradient descent, it has notable drawbacks: it is highly sensitive to outliers due to the squaring operation, and it performs poorly in classification tasks, often leading to slow convergence.
Cross-Entropy Loss: On the other hand, Cross-Entropy Loss is widely used for classification tasks, both binary and multi-class. It measures the difference between the predicted probability distribution and the true label distribution.
For binary classification:
For multi-class classification:
For binary classification, it penalizes the model when it confidently predicts the wrong class, encouraging outputs closer to the true labels. In multi-class settings, cross-entropy works with softmax outputs to handle multiple classes simultaneously. One disadvantage of cross-entropy loss is that it can become very large when the model assigns near-zero probabilities to the true class, which may cause instability during training if not handled properly.