🔹 Sigmoid Function

“A smooth gate from 0 to 1 — where learning becomes probability.”

📍 Definition

The sigmoid function maps any real-valued number to the range (0, 1):

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

  • Input: \( x \in \mathbb{R} \)
  • Output: \( \sigma(x) \in (0, 1) \)
  • Shape: S-curve (logistic)

📘 Derivative

$$ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $$

  • Used in backpropagation
  • Maximum gradient at \( \sigma(x) = 0.5 \)

🧠 Intuition

  • Compresses inputs into probabilities
  • Smooth and differentiable → essential for gradient-based learning
  • Common in logistic regression and binary classifiers

🧮 Use Cases

ApplicationWhy Sigmoid?
Binary classificationOutputs probability \( P(y=1 \mid x) \)
Logistic regressionDirect mapping from logits to probability
Output neuron (1-class)Used with sigmoid + BCE loss
Gating (LSTM/GRU)Soft control over memory update (0 to 1)

🔄 Behavior

  • \( \sigma(0) = 0.5 \)
  • \( \lim_{x \to -\infty} \sigma(x) = 0 \)
  • \( \lim_{x \to \infty} \sigma(x) = 1 \)

⚠️ Limitations

IssueEffect
Vanishing gradients\( \sigma'(x) \to 0 \) for large |x| → slow learning
Not zero-centeredPositive-only outputs → bias in weight updates
Slow convergenceInefficient in deep nets

📊 Visualization Tip

  • Plot \( \sigma(x) \) with derivative \( \sigma'(x) \)
  • Highlight regions of gradient saturation

📚 Related Topics


🔸 Tanh Activation Function

“Tanh doesn’t just scale — it centers. From input noise to clean signal.”

📍 Definition

$$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$

  • Maps real numbers to the range \( (-1, 1) \)
  • It's a rescaled sigmoid: \( \tanh(x) = 2 \cdot \sigma(2x) - 1 \)

📘 Derivative

$$ \frac{d}{dx} \tanh(x) = 1 - \tanh^2(x) $$

  • Used during backpropagation
  • Derivative peaks at \( x = 0 \)

🧠 Intuition

  • Zero-centered → better gradient dynamics
  • Balanced output: positive and negative values
  • Great for signal symmetry and internal state representation

🧮 Use Cases

ApplicationWhy Use Tanh?
RNN hidden statesCentered & bounded activations
LSTM gates & activationsCore component for memory update
Deep MLPs (historical)Prevent bias toward only positive/negative flows
Encoder-decoder modelsNeutral latent representations

🔄 Behavior

  • \( \tanh(0) = 0 \)
  • \( \lim_{x \to -\infty} \tanh(x) = -1 \)
  • \( \lim_{x \to \infty} \tanh(x) = 1 \)

⚠️ Limitations

IssueEffect
Vanishing gradientsFor large \( |x| \), derivative → 0
Slower than ReLUDue to exponentials in computation
Not sparseAll neurons activate → less efficient

🔍 Visualize Tanh

  • Overlay \( \tanh(x) \) vs \( \sigma(x) \)
  • Plot \( \tanh'(x) = 1 - \tanh^2(x) \)
  • Highlight gradient saturation zones

🔬 Comparison Snapshot

Function Range Zero-Centered Derivative Peak Vanishing Gradients
Sigmoid(0, 1)❌ NoAt 0.5Yes
Tanh(−1, 1)✅ YesAt 0Yes
ReLU[0, ∞)❌ NoConstant or 0No (mostly)

📚 Related Topics


⚡ ReLU: Rectified Linear Unit

“Turn on if positive, stay off if negative — ReLU is a neural switch.”

📍 Definition

$$ \text{ReLU}(x) = \max(0, x) $$

  • Outputs the input directly if it’s positive, else zero
  • Nonlinear, yet computationally efficient

📘 Derivative

$$ \frac{d}{dx} \text{ReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} $$

  • Enables fast and sparse gradients
  • But zero gradient when \( x \leq 0 \) → risk of "dying ReLUs"

🧠 Intuition

  • Encourages sparse activations
  • Breaks linearity while preserving simplicity
  • Ideal for deep architectures due to non-vanishing gradients

🧮 Use Cases

ApplicationWhy ReLU?
CNNs, ResNetsFast convergence, robust gradients
MLPs, TransformersDefault activation in feedforward layers
AutoencodersSharp thresholding in encoders

🔄 Behavior

  • \( \text{ReLU}(x) = x \) if \( x > 0 \)
  • \( \text{ReLU}(x) = 0 \) if \( x \leq 0 \)
  • Fast to compute, differentiable almost everywhere

⚠️ Limitations

IssueEffect
Dying ReLUNeuron stuck at 0, never activates
No negative valuesMay cause bias in some tasks
Non-zero meanCan destabilize weight updates

🔬 Visual Comparison

Activation Range Gradient Zero-Centered Sparsity
Sigmoid(0, 1)Vanishes✖️
Tanh(−1, 1)Vanishes✖️
ReLU[0, ∞)Stable

🧪 Variants to Handle ReLU’s Issues

VariantFormulaBenefit
Leaky ReLU\( \max(0.01x, x) \)Prevents dying neurons
ELU\( x \) if \( x > 0 \), else \( \alpha(e^x - 1) \)Smooth transition below zero
GELU\( x \cdot \Phi(x) \)Smooth, probabilistic gating

📊 Visualization Tip

  • Plot \( \text{ReLU}(x) \) and its derivative
  • Overlay input distributions before/after ReLU to show sparsity

📚 Related Topics


🌊 Leaky ReLU: A Gentle Slope Below Zero

“If you’re down, keep learning — even if it’s just a little.”

📍 Definition

$$ \text{Leaky ReLU}(x) = \begin{cases} x & \text{if } x \geq 0 \\\\ \alpha x & \text{if } x < 0 \end{cases} $$

  • \( \alpha \): small positive slope (e.g., 0.01)
  • Allows non-zero output and gradient for negative inputs

📘 Derivative

$$ \frac{d}{dx} \text{Leaky ReLU}(x) = \begin{cases} 1 & \text{if } x \geq 0 \\\\ \alpha & \text{if } x < 0 \end{cases} $$

🧠 Intuition

  • Acts like ReLU for positive inputs
  • Prevents dying neurons when input stays negative
  • Enables more robust gradient flow in deep networks

🧮 Use Cases

ApplicationWhy Leaky ReLU?
Deep MLPsKeeps all neurons trainable
GANsHelps avoid collapse in generators
Edge-sensitive tasksRetains slight negative signal

🔄 Behavior

  • Smooth variant of ReLU
  • Output grows linearly for both positive and negative \( x \), but slower for negatives
  • Maintains gradient diversity

⚠️ Trade-offs

ProCon
Prevents dying ReLUsStill not zero-centered
Simple and fastAdds one hyperparameter \( \alpha \)

🔍 Visualization

  • Plot \( \text{Leaky ReLU}(x) \) and compare to ReLU:
    • ReLU = sharp cutoff at 0
    • Leaky = small slope to the left

🧪 Code Example (PyTorch)


import torch.nn as nn
activation = nn.LeakyReLU(negative_slope=0.01)
  

🔬 Related Variants

FunctionFeature
ReLUSimple, but can “die”
Leaky ReLUResilient to zero-gradient problem
PReLULearnable \( \alpha \)
ELU, GELUSmooth, differentiable alternatives

📚 Related Topics


🧮 Softmax Activation

“From raw scores to a probability distribution — softmax makes the model speak clearly.”

📍 Definition

For a vector of scores \( \mathbf{z} = [z_1, z_2, ..., z_K] \), the softmax function maps them to probabilities:

$$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$

  • Outputs a K-dimensional vector of real numbers in the range \( (0, 1) \)
  • All values sum to 1, forming a valid probability distribution

📘 Gradient (Jacobian Matrix)

The derivative of softmax is:

$$ \frac{\partial \text{Softmax}_i}{\partial z_j} = \text{Softmax}_i (\delta_{ij} - \text{Softmax}_j) $$

  • \( \delta_{ij} \): Kronecker delta
  • This forms a Jacobian matrix — used in backpropagation

🧠 Intuition

  • Converts logits into normalized probabilities
  • Exponentiation amplifies differences between scores
  • Used with Cross-Entropy Loss for classification tasks

🧮 Use Cases

ApplicationWhy Softmax?
Multiclass classificationGives per-class probabilities
Language modelsProbabilities over vocabulary
Attention mechanismsNormalize attention weights
Policy learning in RLProbabilistic action distributions

🔄 Behavior

  • Input: arbitrary real numbers
  • Output: non-negative values that sum to 1
  • Sensitive to relative differences between inputs

⚠️ Numerical Stability

To prevent overflow in the exponentials:

$$ \text{Softmax}(z_i) = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}} $$

This subtraction doesn’t change the result — just stabilizes the computation.

🔬 Softmax + Cross-Entropy = Fast Training

In practice, we fuse softmax + log + loss into one operation:

$$ \text{Loss} = -\log \left( \text{Softmax}_{\text{true class}} \right) $$

This is what PyTorch’s nn.CrossEntropyLoss() does internally.

📦 PyTorch Example


import torch.nn as nn

logits = model(x)
probs = nn.Softmax(dim=1)(logits)  # Apply along class dimension
  

📊 Visualization Tip

  • Plot logits → softmax → output probabilities
  • Show how increasing one score suppresses others

📚 Related Topics


🧠 GELU: Gaussian Error Linear Unit

“Let your neurons fire with probability — not just thresholds.”

📍 Definition

GELU blends the ideas of ReLU and a Gaussian distribution. It outputs the input scaled by the probability that a standard normal variable is less than the input.

Mathematically:

$$ \text{GELU}(x) = x \cdot \Phi(x) $$

Where:

  • \( \Phi(x) \) is the CDF of the standard normal distribution

Approximation (faster to compute):

$$ \text{GELU}(x) \approx 0.5x \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}} \left(x + 0.044715x^3\right)\right]\right) $$

📘 Derivative

  • The derivative is smooth and non-zero everywhere — great for gradient flow
  • No sharp switch like ReLU
  • Harder to compute by hand, but handled easily via auto-diff frameworks

🧠 Intuition

  • Unlike ReLU (hard threshold) or sigmoid (squash), GELU weighs the input based on how “likely” it is to pass — as if each neuron makes a probabilistic decision
  • Used in Transformer blocks (BERT, GPT, T5)
  • Introduces non-linearity with softness

🧮 Use Cases

ApplicationWhy GELU?
Transformers (BERT, GPT)Stable training, better performance
MLPs in vision/languageSmooth gradient flow
Modern deep netsOften outperforms ReLU in benchmarks

🔄 Behavior

  • Smooth, differentiable everywhere
  • Slightly saturates for very negative values
  • Non-monotonic: small negatives may slightly increase output

⚠️ Limitations

IssueEffect
Heavier computeDue to Gaussian-based approximation
No zero meanCan still bias activations
Nonlinear but not sparseUnlike ReLU

🔬 Visual Comparison

Function Range Zero-Centered Gradient Stability Smooth?
ReLU\[0, ∞)
Tanh(−1, 1)❌ (vanishes)
GELUℝ → ℝ

📦 PyTorch Example


import torch.nn as nn
activation = nn.GELU()
  

📊 Visualization Tip

  • Plot GELU vs ReLU and sigmoid
  • Overlay their gradients
  • Highlight smooth “tail” behavior of GELU vs hard ReLU cutoff

📚 Related Topics