Activation Functions | Machine Learning Guide | Ephizen

Why Activation Functions?#

Without activation functions, neural networks would just be linear transformations - no matter how many layers you stack.

Key Insight

Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.

Common Activation Functions#

Feature	Function	Range	Use Case
ReLU	[0, ∞)	Hidden layers (default)	Fast, sparse
Sigmoid	(0, 1)	Binary classification output	Probability
Tanh	(-1, 1)	Hidden layers (RNNs)	Zero-centered
Softmax	(0, 1), sums to 1	Multi-class output	Probabilities
Leaky ReLU	(-∞, ∞)	Hidden layers	Prevents dead neurons

ReLU (Rectified Linear Unit)#

The most popular activation function for hidden layers:

python

def relu(x):
    return max(0, x)

# PyTorch
import torch.nn as nn
layer = nn.ReLU()

# Or inline
x = torch.relu(x)

Formula: f(x) = max(0, x)

Pros

Computationally efficient
No vanishing gradient for positive values
Sparse activation (many zeros)
Works well in practice

Cons

Dead neurons (if input always negative)
Not zero-centered
Unbounded output

Sigmoid#

Maps any value to (0, 1) - perfect for probability outputs:

python

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# PyTorch
layer = nn.Sigmoid()

Formula: σ(x) = 1 / (1 + e^(-x))

Vanishing Gradient

Sigmoid saturates for large positive/negative inputs, causing gradients near zero. Avoid in deep hidden layers.

Tanh (Hyperbolic Tangent)#

Zero-centered version of sigmoid:

python

def tanh(x):
    return np.tanh(x)

# PyTorch
layer = nn.Tanh()

Formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Good for RNNs and when you need zero-centered outputs.

Leaky ReLU#

Fixes the "dying ReLU" problem:

python

def leaky_relu(x, alpha=0.01):
    return x if x > 0 else alpha * x

# PyTorch
layer = nn.LeakyReLU(negative_slope=0.01)

Formula: f(x) = x if x > 0, else αx (typically α = 0.01)

Softmax#

For multi-class classification (outputs sum to 1):

python

def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Subtract max for numerical stability
    return exp_x / exp_x.sum()

# PyTorch (usually combined with CrossEntropyLoss)
layer = nn.Softmax(dim=-1)

# Example
logits = [2.0, 1.0, 0.1]
probs = softmax(logits)  # [0.659, 0.242, 0.099]

Choosing Activation Functions#

Hidden Layers

Use ReLU

Start with ReLU. Try Leaky ReLU if you see dead neurons.

Binary Output

Use Sigmoid

Single output neuron for yes/no classification.

Multi-class Output

Use Softmax

N output neurons for N classes. Probabilities sum to 1.

Regression Output

Use Linear (None)

No activation for continuous value prediction.

Modern Alternatives#

Implementation Example#

python

import torch
import torch.nn as nn

class NeuralNetwork(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()

        self.layers = nn.Sequential(
            nn.Linear(input_size, 256),
            nn.ReLU(),                    # Hidden layer 1
            nn.Linear(256, 128),
            nn.ReLU(),                    # Hidden layer 2
            nn.Linear(128, num_classes),
            # No activation here - handled by loss function
        )

    def forward(self, x):
        return self.layers(x)

# For training with CrossEntropyLoss (includes softmax)
model = NeuralNetwork(784, 10)
criterion = nn.CrossEntropyLoss()

# For inference, apply softmax
with torch.no_grad():
    logits = model(input)
    probabilities = torch.softmax(logits, dim=-1)

Key Takeaways#

⚡

ReLU for Hidden

Default choice for hidden layers. Fast and effective.

🎯

Sigmoid for Binary

Output layer for binary classification.

📊

Softmax for Multi-class

Output layer when predicting multiple classes.

📈

No Activation for Regression

Linear output for continuous predictions.

Remember

The choice of activation function affects learning speed, gradient flow, and model capacity. Start with standard choices (ReLU + Softmax/Sigmoid) and experiment only when needed.

Why Activation Functions?#

Without activation functions, neural networks would just be linear transformations - no matter how many layers you stack.

Key Insight

Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.

Common Activation Functions#

Feature	Function	Range	Use Case
ReLU	[0, ∞)	Hidden layers (default)	Fast, sparse
Sigmoid	(0, 1)	Binary classification output	Probability
Tanh	(-1, 1)	Hidden layers (RNNs)	Zero-centered
Softmax	(0, 1), sums to 1	Multi-class output	Probabilities
Leaky ReLU	(-∞, ∞)	Hidden layers	Prevents dead neurons

ReLU (Rectified Linear Unit)#

The most popular activation function for hidden layers:

python

def relu(x):
    return max(0, x)

# PyTorch
import torch.nn as nn
layer = nn.ReLU()

# Or inline
x = torch.relu(x)

Formula: f(x) = max(0, x)

Pros

Computationally efficient
No vanishing gradient for positive values
Sparse activation (many zeros)
Works well in practice

Cons

Dead neurons (if input always negative)
Not zero-centered
Unbounded output

Sigmoid#

Maps any value to (0, 1) - perfect for probability outputs:

python

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# PyTorch
layer = nn.Sigmoid()

Formula: σ(x) = 1 / (1 + e^(-x))

Vanishing Gradient

Sigmoid saturates for large positive/negative inputs, causing gradients near zero. Avoid in deep hidden layers.

Tanh (Hyperbolic Tangent)#

Zero-centered version of sigmoid:

python

def tanh(x):
    return np.tanh(x)

# PyTorch
layer = nn.Tanh()

Formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Good for RNNs and when you need zero-centered outputs.

Leaky ReLU#

Fixes the "dying ReLU" problem:

python

def leaky_relu(x, alpha=0.01):
    return x if x > 0 else alpha * x

# PyTorch
layer = nn.LeakyReLU(negative_slope=0.01)

Formula: f(x) = x if x > 0, else αx (typically α = 0.01)

Softmax#

For multi-class classification (outputs sum to 1):

python

def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Subtract max for numerical stability
    return exp_x / exp_x.sum()

# PyTorch (usually combined with CrossEntropyLoss)
layer = nn.Softmax(dim=-1)

# Example
logits = [2.0, 1.0, 0.1]
probs = softmax(logits)  # [0.659, 0.242, 0.099]

Choosing Activation Functions#

Hidden Layers

Use ReLU

Start with ReLU. Try Leaky ReLU if you see dead neurons.

Binary Output

Use Sigmoid

Single output neuron for yes/no classification.

Multi-class Output

Use Softmax

N output neurons for N classes. Probabilities sum to 1.

Regression Output

Use Linear (None)

No activation for continuous value prediction.

Modern Alternatives#

Implementation Example#

python

import torch
import torch.nn as nn

class NeuralNetwork(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()

        self.layers = nn.Sequential(
            nn.Linear(input_size, 256),
            nn.ReLU(),                    # Hidden layer 1
            nn.Linear(256, 128),
            nn.ReLU(),                    # Hidden layer 2
            nn.Linear(128, num_classes),
            # No activation here - handled by loss function
        )

    def forward(self, x):
        return self.layers(x)

# For training with CrossEntropyLoss (includes softmax)
model = NeuralNetwork(784, 10)
criterion = nn.CrossEntropyLoss()

# For inference, apply softmax
with torch.no_grad():
    logits = model(input)
    probabilities = torch.softmax(logits, dim=-1)

Key Takeaways#

⚡

ReLU for Hidden

Default choice for hidden layers. Fast and effective.

🎯

Sigmoid for Binary

Output layer for binary classification.

📊

Softmax for Multi-class

Output layer when predicting multiple classes.

📈

No Activation for Regression

Linear output for continuous predictions.

Remember

The choice of activation function affects learning speed, gradient flow, and model capacity. Start with standard choices (ReLU + Softmax/Sigmoid) and experiment only when needed.

Why Activation Functions?#

Common Activation Functions#

ReLU (Rectified Linear Unit)#

Sigmoid#

Tanh (Hyperbolic Tangent)#

Leaky ReLU#

Softmax#

Choosing Activation Functions#

Use ReLU

Use Sigmoid

Use Softmax

Use Linear (None)

Modern Alternatives#

Implementation Example#

Key Takeaways#

ReLU for Hidden

Sigmoid for Binary

Softmax for Multi-class

No Activation for Regression

Ready to level up your skills?

Why Activation Functions?#

Common Activation Functions#

ReLU (Rectified Linear Unit)#

Sigmoid#

Tanh (Hyperbolic Tangent)#

Leaky ReLU#

Softmax#

Choosing Activation Functions#

Use ReLU

Use Sigmoid

Use Softmax

Use Linear (None)

Modern Alternatives#

Implementation Example#

Key Takeaways#

ReLU for Hidden

Sigmoid for Binary

Softmax for Multi-class

No Activation for Regression

Ready to level up your skills?