Common Activation Functions

σ(z) = 1/(1+e⁻ᶻ)

Range: (0, 1) Smooth Probabilistic Vanishing Gradient

f(z) = max(0, z)

Range: [0, ∞) Fast Sparse Dead Neurons

tanh(z)

Range: (-1, 1) Zero-centered Symmetric Vanishing Gradient

f(z) = max(0.01z, z)

Range: (-∞, ∞) No Dead Neurons Fast Non-zero Gradient

f(z) = z if z>0 else α(eᶻ-1)

Smooth Negative Values Self-normalizing Robust

σ(z)ᵢ = eᶻⁱ/Σeᶻʲ

Multi-class Probability Dist. Sum to 1 Output Layer

Quick Comparison Table

Function	Range	Advantages	Disadvantages	Best Use Case
Sigmoid	(0, 1)	Probability interpretation	Vanishing gradient	Binary classification output
ReLU	[0, ∞)	Fast, no vanishing gradient	Dead neurons	Hidden layers (default)
Tanh	(-1, 1)	Zero-centered	Vanishing gradient	RNN/LSTM gates
Leaky ReLU	(-∞, ∞)	No dead neurons	Not zero-centered	Deep networks
Softmax	(0, 1) per class	Multi-class probability	Computationally expensive	Multi-class output

Start with ReLU, experiment with variants if needed

Match activation to your task type

Proper weight init for each activation

Monitor activations during training

Prepared by Dr. Gorkem Kar