🔹 Sigmoid Function
“A smooth gate from 0 to 1 — where learning becomes probability.”
📍 Definition
The sigmoid function maps any real-valued number to the range (0, 1):
$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
- Input: \( x \in \mathbb{R} \)
- Output: \( \sigma(x) \in (0, 1) \)
- Shape: S-curve (logistic)
📘 Derivative
$$ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $$
- Used in backpropagation
- Maximum gradient at \( \sigma(x) = 0.5 \)
🧠 Intuition
- Compresses inputs into probabilities
- Smooth and differentiable → essential for gradient-based learning
- Common in logistic regression and binary classifiers
🧮 Use Cases
Application | Why Sigmoid? |
---|---|
Binary classification | Outputs probability \( P(y=1 \mid x) \) |
Logistic regression | Direct mapping from logits to probability |
Output neuron (1-class) | Used with sigmoid + BCE loss |
Gating (LSTM/GRU) | Soft control over memory update (0 to 1) |
🔄 Behavior
- \( \sigma(0) = 0.5 \)
- \( \lim_{x \to -\infty} \sigma(x) = 0 \)
- \( \lim_{x \to \infty} \sigma(x) = 1 \)
⚠️ Limitations
Issue | Effect |
---|---|
Vanishing gradients | \( \sigma'(x) \to 0 \) for large |x| → slow learning |
Not zero-centered | Positive-only outputs → bias in weight updates |
Slow convergence | Inefficient in deep nets |
📊 Visualization Tip
- Plot \( \sigma(x) \) with derivative \( \sigma'(x) \)
- Highlight regions of gradient saturation
📚 Related Topics
🔸 Tanh Activation Function
“Tanh doesn’t just scale — it centers. From input noise to clean signal.”
📍 Definition
$$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$
- Maps real numbers to the range \( (-1, 1) \)
- It's a rescaled sigmoid: \( \tanh(x) = 2 \cdot \sigma(2x) - 1 \)
📘 Derivative
$$ \frac{d}{dx} \tanh(x) = 1 - \tanh^2(x) $$
- Used during backpropagation
- Derivative peaks at \( x = 0 \)
🧠 Intuition
- Zero-centered → better gradient dynamics
- Balanced output: positive and negative values
- Great for signal symmetry and internal state representation
🧮 Use Cases
Application | Why Use Tanh? |
---|---|
RNN hidden states | Centered & bounded activations |
LSTM gates & activations | Core component for memory update |
Deep MLPs (historical) | Prevent bias toward only positive/negative flows |
Encoder-decoder models | Neutral latent representations |
🔄 Behavior
- \( \tanh(0) = 0 \)
- \( \lim_{x \to -\infty} \tanh(x) = -1 \)
- \( \lim_{x \to \infty} \tanh(x) = 1 \)
⚠️ Limitations
Issue | Effect |
---|---|
Vanishing gradients | For large \( |x| \), derivative → 0 |
Slower than ReLU | Due to exponentials in computation |
Not sparse | All neurons activate → less efficient |
🔍 Visualize Tanh
- Overlay \( \tanh(x) \) vs \( \sigma(x) \)
- Plot \( \tanh'(x) = 1 - \tanh^2(x) \)
- Highlight gradient saturation zones
🔬 Comparison Snapshot
Function | Range | Zero-Centered | Derivative Peak | Vanishing Gradients |
---|---|---|---|---|
Sigmoid | (0, 1) | ❌ No | At 0.5 | Yes |
Tanh | (−1, 1) | ✅ Yes | At 0 | Yes |
ReLU | [0, ∞) | ❌ No | Constant or 0 | No (mostly) |
📚 Related Topics
⚡ ReLU: Rectified Linear Unit
“Turn on if positive, stay off if negative — ReLU is a neural switch.”
📍 Definition
$$ \text{ReLU}(x) = \max(0, x) $$
- Outputs the input directly if it’s positive, else zero
- Nonlinear, yet computationally efficient
📘 Derivative
$$ \frac{d}{dx} \text{ReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} $$
- Enables fast and sparse gradients
- But zero gradient when \( x \leq 0 \) → risk of "dying ReLUs"
🧠 Intuition
- Encourages sparse activations
- Breaks linearity while preserving simplicity
- Ideal for deep architectures due to non-vanishing gradients
🧮 Use Cases
Application | Why ReLU? |
---|---|
CNNs, ResNets | Fast convergence, robust gradients |
MLPs, Transformers | Default activation in feedforward layers |
Autoencoders | Sharp thresholding in encoders |
🔄 Behavior
- \( \text{ReLU}(x) = x \) if \( x > 0 \)
- \( \text{ReLU}(x) = 0 \) if \( x \leq 0 \)
- Fast to compute, differentiable almost everywhere
⚠️ Limitations
Issue | Effect |
---|---|
Dying ReLU | Neuron stuck at 0, never activates |
No negative values | May cause bias in some tasks |
Non-zero mean | Can destabilize weight updates |
🔬 Visual Comparison
Activation | Range | Gradient | Zero-Centered | Sparsity |
---|---|---|---|---|
Sigmoid | (0, 1) | Vanishes | ❌ | ✖️ |
Tanh | (−1, 1) | Vanishes | ✅ | ✖️ |
ReLU | [0, ∞) | Stable | ❌ | ✅ |
🧪 Variants to Handle ReLU’s Issues
Variant | Formula | Benefit |
---|---|---|
Leaky ReLU | \( \max(0.01x, x) \) | Prevents dying neurons |
ELU | \( x \) if \( x > 0 \), else \( \alpha(e^x - 1) \) | Smooth transition below zero |
GELU | \( x \cdot \Phi(x) \) | Smooth, probabilistic gating |
📊 Visualization Tip
- Plot \( \text{ReLU}(x) \) and its derivative
- Overlay input distributions before/after ReLU to show sparsity
📚 Related Topics
🌊 Leaky ReLU: A Gentle Slope Below Zero
“If you’re down, keep learning — even if it’s just a little.”
📍 Definition
$$ \text{Leaky ReLU}(x) = \begin{cases} x & \text{if } x \geq 0 \\\\ \alpha x & \text{if } x < 0 \end{cases} $$
- \( \alpha \): small positive slope (e.g., 0.01)
- Allows non-zero output and gradient for negative inputs
📘 Derivative
$$ \frac{d}{dx} \text{Leaky ReLU}(x) = \begin{cases} 1 & \text{if } x \geq 0 \\\\ \alpha & \text{if } x < 0 \end{cases} $$
🧠 Intuition
- Acts like ReLU for positive inputs
- Prevents dying neurons when input stays negative
- Enables more robust gradient flow in deep networks
🧮 Use Cases
Application | Why Leaky ReLU? |
---|---|
Deep MLPs | Keeps all neurons trainable |
GANs | Helps avoid collapse in generators |
Edge-sensitive tasks | Retains slight negative signal |
🔄 Behavior
- Smooth variant of ReLU
- Output grows linearly for both positive and negative \( x \), but slower for negatives
- Maintains gradient diversity
⚠️ Trade-offs
Pro | Con |
---|---|
Prevents dying ReLUs | Still not zero-centered |
Simple and fast | Adds one hyperparameter \( \alpha \) |
🔍 Visualization
- Plot \( \text{Leaky ReLU}(x) \) and compare to ReLU:
- ReLU = sharp cutoff at 0
- Leaky = small slope to the left
🧪 Code Example (PyTorch)
import torch.nn as nn
activation = nn.LeakyReLU(negative_slope=0.01)
🔬 Related Variants
Function | Feature |
---|---|
ReLU | Simple, but can “die” |
Leaky ReLU | Resilient to zero-gradient problem |
PReLU | Learnable \( \alpha \) |
ELU, GELU | Smooth, differentiable alternatives |
📚 Related Topics
🧮 Softmax Activation
“From raw scores to a probability distribution — softmax makes the model speak clearly.”
📍 Definition
For a vector of scores \( \mathbf{z} = [z_1, z_2, ..., z_K] \), the softmax function maps them to probabilities:
$$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$
- Outputs a K-dimensional vector of real numbers in the range \( (0, 1) \)
- All values sum to 1, forming a valid probability distribution
📘 Gradient (Jacobian Matrix)
The derivative of softmax is:
$$ \frac{\partial \text{Softmax}_i}{\partial z_j} = \text{Softmax}_i (\delta_{ij} - \text{Softmax}_j) $$
- \( \delta_{ij} \): Kronecker delta
- This forms a Jacobian matrix — used in backpropagation
🧠 Intuition
- Converts logits into normalized probabilities
- Exponentiation amplifies differences between scores
- Used with Cross-Entropy Loss for classification tasks
🧮 Use Cases
Application | Why Softmax? |
---|---|
Multiclass classification | Gives per-class probabilities |
Language models | Probabilities over vocabulary |
Attention mechanisms | Normalize attention weights |
Policy learning in RL | Probabilistic action distributions |
🔄 Behavior
- Input: arbitrary real numbers
- Output: non-negative values that sum to 1
- Sensitive to relative differences between inputs
⚠️ Numerical Stability
To prevent overflow in the exponentials:
$$ \text{Softmax}(z_i) = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}} $$
This subtraction doesn’t change the result — just stabilizes the computation.
🔬 Softmax + Cross-Entropy = Fast Training
In practice, we fuse softmax + log + loss into one operation:
$$ \text{Loss} = -\log \left( \text{Softmax}_{\text{true class}} \right) $$
This is what PyTorch’s nn.CrossEntropyLoss()
does internally.
📦 PyTorch Example
import torch.nn as nn
logits = model(x)
probs = nn.Softmax(dim=1)(logits) # Apply along class dimension
📊 Visualization Tip
- Plot logits → softmax → output probabilities
- Show how increasing one score suppresses others
📚 Related Topics
🧠 GELU: Gaussian Error Linear Unit
“Let your neurons fire with probability — not just thresholds.”
📍 Definition
GELU blends the ideas of ReLU and a Gaussian distribution. It outputs the input scaled by the probability that a standard normal variable is less than the input.
Mathematically:
$$ \text{GELU}(x) = x \cdot \Phi(x) $$
Where:
- \( \Phi(x) \) is the CDF of the standard normal distribution
Approximation (faster to compute):
$$ \text{GELU}(x) \approx 0.5x \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}} \left(x + 0.044715x^3\right)\right]\right) $$
📘 Derivative
- The derivative is smooth and non-zero everywhere — great for gradient flow
- No sharp switch like ReLU
- Harder to compute by hand, but handled easily via auto-diff frameworks
🧠 Intuition
- Unlike ReLU (hard threshold) or sigmoid (squash), GELU weighs the input based on how “likely” it is to pass — as if each neuron makes a probabilistic decision
- Used in Transformer blocks (BERT, GPT, T5)
- Introduces non-linearity with softness
🧮 Use Cases
Application | Why GELU? |
---|---|
Transformers (BERT, GPT) | Stable training, better performance |
MLPs in vision/language | Smooth gradient flow |
Modern deep nets | Often outperforms ReLU in benchmarks |
🔄 Behavior
- Smooth, differentiable everywhere
- Slightly saturates for very negative values
- Non-monotonic: small negatives may slightly increase output
⚠️ Limitations
Issue | Effect |
---|---|
Heavier compute | Due to Gaussian-based approximation |
No zero mean | Can still bias activations |
Nonlinear but not sparse | Unlike ReLU |
🔬 Visual Comparison
Function | Range | Zero-Centered | Gradient Stability | Smooth? |
---|---|---|---|---|
ReLU | \[0, ∞) | ❌ | ✅ | ❌ |
Tanh | (−1, 1) | ✅ | ❌ (vanishes) | ✅ |
GELU | ℝ → ℝ | ✅ | ✅ | ✅ |
📦 PyTorch Example
import torch.nn as nn
activation = nn.GELU()
📊 Visualization Tip
- Plot GELU vs ReLU and sigmoid
- Overlay their gradients
- Highlight smooth “tail” behavior of GELU vs hard ReLU cutoff