Activation Function Atlas — Neural Nonlinearities Explained

🔹 Sigmoid Function

“A smooth gate from 0 to 1 — where learning becomes probability.”

📍 Definition

The sigmoid function maps any real-valued number to the range (0, 1):

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

Input: $ x \in \mathbb{R} $
Output: $ \sigma(x) \in (0, 1) $
Shape: S-curve (logistic)

📘 Derivative

$$ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $$

Used in backpropagation
Maximum gradient at $ \sigma(x) = 0.5 $

🧠 Intuition

Compresses inputs into probabilities
Smooth and differentiable → essential for gradient-based learning
Common in logistic regression and binary classifiers

🧮 Use Cases

Application	Why Sigmoid?
Binary classification	Outputs probability $ P(y=1 \mid x) $
Logistic regression	Direct mapping from logits to probability
Output neuron (1-class)	Used with `sigmoid + BCE loss`
Gating (LSTM/GRU)	Soft control over memory update (0 to 1)

🔄 Behavior

$ \sigma(0) = 0.5 $
$ \lim_{x \to -\infty} \sigma(x) = 0 $
$ \lim_{x \to \infty} \sigma(x) = 1 $

⚠️ Limitations

Issue	Effect
Vanishing gradients	$ \sigma'(x) \to 0 $ for large \|x\| → slow learning
Not zero-centered	Positive-only outputs → bias in weight updates
Slow convergence	Inefficient in deep nets

📊 Visualization Tip

Plot $ \sigma(x) $ with derivative $ \sigma'(x) $
Highlight regions of gradient saturation

🔸 Tanh Activation Function

“Tanh doesn’t just scale — it centers. From input noise to clean signal.”

📍 Definition

$$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$

Maps real numbers to the range $ (-1, 1) $
It's a rescaled sigmoid: $ \tanh(x) = 2 \cdot \sigma(2x) - 1 $

📘 Derivative

$$ \frac{d}{dx} \tanh(x) = 1 - \tanh^2(x) $$

Used during backpropagation
Derivative peaks at $ x = 0 $

🧠 Intuition

Zero-centered → better gradient dynamics
Balanced output: positive and negative values
Great for signal symmetry and internal state representation

🧮 Use Cases

Application	Why Use Tanh?
RNN hidden states	Centered & bounded activations
LSTM gates & activations	Core component for memory update
Deep MLPs (historical)	Prevent bias toward only positive/negative flows
Encoder-decoder models	Neutral latent representations

🔄 Behavior

$ \tanh(0) = 0 $
$ \lim_{x \to -\infty} \tanh(x) = -1 $
$ \lim_{x \to \infty} \tanh(x) = 1 $

⚠️ Limitations

Issue	Effect
Vanishing gradients	For large $ \|x\| $, derivative → 0
Slower than ReLU	Due to exponentials in computation
Not sparse	All neurons activate → less efficient

🔍 Visualize Tanh

Overlay $ \tanh(x) $ vs $ \sigma(x) $
Plot $ \tanh'(x) = 1 - \tanh^2(x) $
Highlight gradient saturation zones

🔬 Comparison Snapshot

Function	Range	Zero-Centered	Derivative Peak	Vanishing Gradients
Sigmoid	(0, 1)	❌ No	At 0.5	Yes
Tanh	(−1, 1)	✅ Yes	At 0	Yes
ReLU	[0, ∞)	❌ No	Constant or 0	No (mostly)

📚 Related Topics

Sigmoid Activation
ReLU Activation
GELU Activation
LSTM Units

⚡ ReLU: Rectified Linear Unit

“Turn on if positive, stay off if negative — ReLU is a neural switch.”

📍 Definition

$$ \text{ReLU}(x) = \max(0, x) $$

Outputs the input directly if it’s positive, else zero
Nonlinear, yet computationally efficient

📘 Derivative

$$ \frac{d}{dx} \text{ReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} $$

Enables fast and sparse gradients
But zero gradient when $ x \leq 0 $ → risk of "dying ReLUs"

🧠 Intuition

Encourages sparse activations
Breaks linearity while preserving simplicity
Ideal for deep architectures due to non-vanishing gradients

🧮 Use Cases

Application	Why ReLU?
CNNs, ResNets	Fast convergence, robust gradients
MLPs, Transformers	Default activation in feedforward layers
Autoencoders	Sharp thresholding in encoders

🔄 Behavior

$ \text{ReLU}(x) = x $ if $ x > 0 $
$ \text{ReLU}(x) = 0 $ if $ x \leq 0 $
Fast to compute, differentiable almost everywhere

⚠️ Limitations

Issue	Effect
Dying ReLU	Neuron stuck at 0, never activates
No negative values	May cause bias in some tasks
Non-zero mean	Can destabilize weight updates

🔬 Visual Comparison

Activation	Range	Gradient	Zero-Centered	Sparsity
Sigmoid	(0, 1)	Vanishes	❌	✖️
Tanh	(−1, 1)	Vanishes	✅	✖️
ReLU	[0, ∞)	Stable	❌	✅

🧪 Variants to Handle ReLU’s Issues

Variant	Formula	Benefit
Leaky ReLU	$ \max(0.01x, x) $	Prevents dying neurons
ELU	$ x $ if $ x > 0 $, else $ \alpha(e^x - 1) $	Smooth transition below zero
GELU	$ x \cdot \Phi(x) $	Smooth, probabilistic gating

📊 Visualization Tip

Plot $ \text{ReLU}(x) $ and its derivative
Overlay input distributions before/after ReLU to show sparsity

📚 Related Topics

Sigmoid Activation
Tanh Activation
GELU Activation
Loss Functions
Leaky ReLU, ELU, Swish

🌊 Leaky ReLU: A Gentle Slope Below Zero

“If you’re down, keep learning — even if it’s just a little.”

📍 Definition

$$ \text{Leaky ReLU}(x) = \begin{cases} x & \text{if } x \geq 0 \\\\ \alpha x & \text{if } x < 0 \end{cases} $$

$ \alpha $: small positive slope (e.g., 0.01)
Allows non-zero output and gradient for negative inputs

📘 Derivative

$$ \frac{d}{dx} \text{Leaky ReLU}(x) = \begin{cases} 1 & \text{if } x \geq 0 \\\\ \alpha & \text{if } x < 0 \end{cases} $$

🧠 Intuition

Acts like ReLU for positive inputs
Prevents dying neurons when input stays negative
Enables more robust gradient flow in deep networks

🧮 Use Cases

Application	Why Leaky ReLU?
Deep MLPs	Keeps all neurons trainable
GANs	Helps avoid collapse in generators
Edge-sensitive tasks	Retains slight negative signal

🔄 Behavior

Smooth variant of ReLU
Output grows linearly for both positive and negative $ x $, but slower for negatives
Maintains gradient diversity

⚠️ Trade-offs

Pro	Con
Prevents dying ReLUs	Still not zero-centered
Simple and fast	Adds one hyperparameter $ \alpha $

🔍 Visualization

Plot $ \text{Leaky ReLU}(x) $ and compare to ReLU:

ReLU = sharp cutoff at 0
Leaky = small slope to the left

🧪 Code Example (PyTorch)


import torch.nn as nn
activation = nn.LeakyReLU(negative_slope=0.01)

🔬 Related Variants

Function	Feature
ReLU	Simple, but can “die”
Leaky ReLU	Resilient to zero-gradient problem
PReLU	Learnable $ \alpha $
ELU, GELU	Smooth, differentiable alternatives

📚 Related Topics

ReLU Activation
GELU Activation
ELU Activation
Loss Functions
PReLU

🧮 Softmax Activation

“From raw scores to a probability distribution — softmax makes the model speak clearly.”

📍 Definition

For a vector of scores $ \mathbf{z} = [z_1, z_2, ..., z_K] $, the softmax function maps them to probabilities:

$$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$

Outputs a K-dimensional vector of real numbers in the range $ (0, 1) $
All values sum to 1, forming a valid probability distribution

📘 Gradient (Jacobian Matrix)

The derivative of softmax is:

$$ \frac{\partial \text{Softmax}_i}{\partial z_j} = \text{Softmax}_i (\delta_{ij} - \text{Softmax}_j) $$

$ \delta_{ij} $: Kronecker delta
This forms a Jacobian matrix — used in backpropagation

🧠 Intuition

Converts logits into normalized probabilities
Exponentiation amplifies differences between scores
Used with Cross-Entropy Loss for classification tasks

🧮 Use Cases

Application	Why Softmax?
Multiclass classification	Gives per-class probabilities
Language models	Probabilities over vocabulary
Attention mechanisms	Normalize attention weights
Policy learning in RL	Probabilistic action distributions

🔄 Behavior

Input: arbitrary real numbers
Output: non-negative values that sum to 1
Sensitive to relative differences between inputs

⚠️ Numerical Stability

To prevent overflow in the exponentials:

$$ \text{Softmax}(z_i) = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}} $$

This subtraction doesn’t change the result — just stabilizes the computation.

🔬 Softmax + Cross-Entropy = Fast Training

In practice, we fuse softmax + log + loss into one operation:

$$ \text{Loss} = -\log \left( \text{Softmax}_{\text{true class}} \right) $$

This is what PyTorch’s nn.CrossEntropyLoss() does internally.

📦 PyTorch Example


import torch.nn as nn

logits = model(x)
probs = nn.Softmax(dim=1)(logits)  # Apply along class dimension

📊 Visualization Tip

Plot logits → softmax → output probabilities
Show how increasing one score suppresses others

📚 Related Topics

Cross-Entropy Loss
Attention Mechanisms
LogSoftmax & Numerical Tricks
Output Layer Design

🧠 GELU: Gaussian Error Linear Unit

“Let your neurons fire with probability — not just thresholds.”

📍 Definition

GELU blends the ideas of ReLU and a Gaussian distribution. It outputs the input scaled by the probability that a standard normal variable is less than the input.

Mathematically:

$$ \text{GELU}(x) = x \cdot \Phi(x) $$

Where:

$ \Phi(x) $ is the CDF of the standard normal distribution

Approximation (faster to compute):

$$ \text{GELU}(x) \approx 0.5x \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}} \left(x + 0.044715x^3\right)\right]\right) $$

📘 Derivative

The derivative is smooth and non-zero everywhere — great for gradient flow
No sharp switch like ReLU
Harder to compute by hand, but handled easily via auto-diff frameworks

🧠 Intuition

Unlike ReLU (hard threshold) or sigmoid (squash), GELU weighs the input based on how “likely” it is to pass — as if each neuron makes a probabilistic decision
Used in Transformer blocks (BERT, GPT, T5)
Introduces non-linearity with softness

🧮 Use Cases

Application	Why GELU?
Transformers (BERT, GPT)	Stable training, better performance
MLPs in vision/language	Smooth gradient flow
Modern deep nets	Often outperforms ReLU in benchmarks

🔄 Behavior

Smooth, differentiable everywhere
Slightly saturates for very negative values
Non-monotonic: small negatives may slightly increase output

⚠️ Limitations

Issue	Effect
Heavier compute	Due to Gaussian-based approximation
No zero mean	Can still bias activations
Nonlinear but not sparse	Unlike ReLU

🔬 Visual Comparison

Function	Range	Zero-Centered	Gradient Stability	Smooth?
ReLU	\[0, ∞)	❌	✅	❌
Tanh	(−1, 1)	✅	❌ (vanishes)	✅
GELU	ℝ → ℝ	✅	✅	✅

📦 PyTorch Example


import torch.nn as nn
activation = nn.GELU()

📊 Visualization Tip

Plot GELU vs ReLU and sigmoid
Overlay their gradients
Highlight smooth “tail” behavior of GELU vs hard ReLU cutoff

📚 Related Topics

ReLU Activation
Tanh Activation
Softmax & Output Layers
GELU in Transformer Architectures
Swish, Mish & Advanced Activations

Application	Why Sigmoid?
Binary classification	Outputs probability \( P(y=1 \mid x) \)
Logistic regression	Direct mapping from logits to probability
Output neuron (1-class)	Used with `sigmoid + BCE loss`
Gating (LSTM/GRU)	Soft control over memory update (0 to 1)

Issue	Effect
Vanishing gradients	\( \sigma'(x) \to 0 \) for large \|x\| → slow learning
Not zero-centered	Positive-only outputs → bias in weight updates
Slow convergence	Inefficient in deep nets

Issue	Effect
Vanishing gradients	For large \( \|x\| \), derivative → 0
Slower than ReLU	Due to exponentials in computation
Not sparse	All neurons activate → less efficient

Variant	Formula	Benefit
Leaky ReLU	\( \max(0.01x, x) \)	Prevents dying neurons
ELU	\( x \) if \( x > 0 \), else \( \alpha(e^x - 1) \)	Smooth transition below zero
GELU	\( x \cdot \Phi(x) \)	Smooth, probabilistic gating

Activation Functions Atlas

🔹 Sigmoid Function

📍 Definition

📘 Derivative

🧠 Intuition

🧮 Use Cases

🔄 Behavior

⚠️ Limitations

📊 Visualization Tip

📚 Related Topics

🔸 Tanh Activation Function

📍 Definition

📘 Derivative

🧠 Intuition

🧮 Use Cases

🔄 Behavior

⚠️ Limitations

🔍 Visualize Tanh

🔬 Comparison Snapshot

📚 Related Topics

⚡ ReLU: Rectified Linear Unit

📍 Definition

📘 Derivative

🧠 Intuition

🧮 Use Cases

🔄 Behavior

⚠️ Limitations

🔬 Visual Comparison

🧪 Variants to Handle ReLU’s Issues

📊 Visualization Tip

📚 Related Topics

🌊 Leaky ReLU: A Gentle Slope Below Zero

📍 Definition

📘 Derivative

🧠 Intuition

🧮 Use Cases

🔄 Behavior

⚠️ Trade-offs

🔍 Visualization

🧪 Code Example (PyTorch)

🔬 Related Variants

📚 Related Topics

🧮 Softmax Activation

📍 Definition

📘 Gradient (Jacobian Matrix)

🧠 Intuition

🧮 Use Cases

🔄 Behavior

⚠️ Numerical Stability

🔬 Softmax + Cross-Entropy = Fast Training

📦 PyTorch Example

📊 Visualization Tip

📚 Related Topics

🧠 GELU: Gaussian Error Linear Unit

📍 Definition

📘 Derivative

🧠 Intuition

🧮 Use Cases

🔄 Behavior

⚠️ Limitations

🔬 Visual Comparison

📦 PyTorch Example

📊 Visualization Tip

📚 Related Topics