CS5720 - Week 1
Slide 14 of 20

Common Activation Functions

Sigmoid

σ(z) = 1/(1+e⁻ᶻ)
Range: (0, 1) Smooth Probabilistic Vanishing Gradient

ReLU

f(z) = max(0, z)
Range: [0, ∞) Fast Sparse Dead Neurons

Tanh

tanh(z)
Range: (-1, 1) Zero-centered Symmetric Vanishing Gradient

Leaky ReLU

f(z) = max(0.01z, z)
Range: (-∞, ∞) No Dead Neurons Fast Non-zero Gradient

ELU

f(z) = z if z>0 else α(eᶻ-1)
Smooth Negative Values Self-normalizing Robust

Softmax

σ(z)ᵢ = eᶻⁱ/Σeᶻʲ
Multi-class Probability Dist. Sum to 1 Output Layer

Quick Comparison Table

Function Range Advantages Disadvantages Best Use Case
Sigmoid (0, 1) Probability interpretation Vanishing gradient Binary classification output
ReLU [0, ∞) Fast, no vanishing gradient Dead neurons Hidden layers (default)
Tanh (-1, 1) Zero-centered Vanishing gradient RNN/LSTM gates
Leaky ReLU (-∞, ∞) No dead neurons Not zero-centered Deep networks
Softmax (0, 1) per class Multi-class probability Computationally expensive Multi-class output

Best Practices & Guidelines

🏗️ Hidden Layers

Start with ReLU, experiment with variants if needed

🎯 Output Layer

Match activation to your task type

⚡ Initialization

Proper weight init for each activation

🔍 Debugging

Monitor activations during training

Prepared by Dr. Gorkem Kar