Activation Functions
Understand ReLU, Sigmoid, Tanh, and other activation functions that give neural networks their power.
Why Activation Functions?#
Without activation functions, neural networks would just be linear transformations - no matter how many layers you stack.
Key Insight
Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.
Common Activation Functions#
| Feature | Function | Range | Use Case |
|---|---|---|---|
| ReLU | [0, ∞) | Hidden layers (default) | Fast, sparse |
| Sigmoid | (0, 1) | Binary classification output | Probability |
| Tanh | (-1, 1) | Hidden layers (RNNs) | Zero-centered |
| Softmax | (0, 1), sums to 1 | Multi-class output | Probabilities |
| Leaky ReLU | (-∞, ∞) | Hidden layers | Prevents dead neurons |
ReLU (Rectified Linear Unit)#
The most popular activation function for hidden layers:
def relu(x):
return max(0, x)
# PyTorch
import torch.nn as nn
layer = nn.ReLU()
# Or inline
x = torch.relu(x)
Formula: f(x) = max(0, x)
- Computationally efficient
- No vanishing gradient for positive values
- Sparse activation (many zeros)
- Works well in practice
- Dead neurons (if input always negative)
- Not zero-centered
- Unbounded output
Sigmoid#
Maps any value to (0, 1) - perfect for probability outputs:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# PyTorch
layer = nn.Sigmoid()
Formula: σ(x) = 1 / (1 + e^(-x))
Vanishing Gradient
Sigmoid saturates for large positive/negative inputs, causing gradients near zero. Avoid in deep hidden layers.
Tanh (Hyperbolic Tangent)#
Zero-centered version of sigmoid:
def tanh(x):
return np.tanh(x)
# PyTorch
layer = nn.Tanh()
Formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Good for RNNs and when you need zero-centered outputs.
Leaky ReLU#
Fixes the "dying ReLU" problem:
def leaky_relu(x, alpha=0.01):
return x if x > 0 else alpha * x
# PyTorch
layer = nn.LeakyReLU(negative_slope=0.01)
Formula: f(x) = x if x > 0, else αx (typically α = 0.01)
Softmax#
For multi-class classification (outputs sum to 1):
def softmax(x):
exp_x = np.exp(x - np.max(x)) # Subtract max for numerical stability
return exp_x / exp_x.sum()
# PyTorch (usually combined with CrossEntropyLoss)
layer = nn.Softmax(dim=-1)
# Example
logits = [2.0, 1.0, 0.1]
probs = softmax(logits) # [0.659, 0.242, 0.099]
Choosing Activation Functions#
Use ReLU
Start with ReLU. Try Leaky ReLU if you see dead neurons.
Use Sigmoid
Single output neuron for yes/no classification.
Use Softmax
N output neurons for N classes. Probabilities sum to 1.
Use Linear (None)
No activation for continuous value prediction.
Modern Alternatives#
Implementation Example#
import torch
import torch.nn as nn
class NeuralNetwork(nn.Module):
def __init__(self, input_size, num_classes):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_size, 256),
nn.ReLU(), # Hidden layer 1
nn.Linear(256, 128),
nn.ReLU(), # Hidden layer 2
nn.Linear(128, num_classes),
# No activation here - handled by loss function
)
def forward(self, x):
return self.layers(x)
# For training with CrossEntropyLoss (includes softmax)
model = NeuralNetwork(784, 10)
criterion = nn.CrossEntropyLoss()
# For inference, apply softmax
with torch.no_grad():
logits = model(input)
probabilities = torch.softmax(logits, dim=-1)
Key Takeaways#
ReLU for Hidden
Default choice for hidden layers. Fast and effective.
Sigmoid for Binary
Output layer for binary classification.
Softmax for Multi-class
Output layer when predicting multiple classes.
No Activation for Regression
Linear output for continuous predictions.
Remember
The choice of activation function affects learning speed, gradient flow, and model capacity. Start with standard choices (ReLU + Softmax/Sigmoid) and experiment only when needed.
Ready to level up your skills?
Explore more guides and tutorials to deepen your understanding and become a better developer.