CNN Architecture
Master Convolutional Neural Networks for image recognition and computer vision.
What are CNNs?#
Convolutional Neural Networks are specialized for processing grid-like data, especially images.
Key Insight
CNNs learn to see. They detect edges first, then shapes, then objects - just like human vision.
Why CNNs for Images?#
Regular neural networks don't scale for images:
# A 224x224 RGB image
pixels = 224 * 224 * 3 # 150,528 input neurons!
# Fully connected to 1000 neurons
parameters = 150528 * 1000 # 150 million parameters!
CNNs exploit image structure to use far fewer parameters.
Core Concepts#
Convolution
Small filters slide across the image, detecting patterns
Pooling
Reduces spatial dimensions, keeps important features
Feature Maps
Each layer produces maps highlighting detected features
Classification
Final layers flatten features and classify
Convolution Operation#
Input Image Filter (3x3) Output
โโโโโโโโโโโ โโโโโโโโโ โโโโโโโโโ
โ 1 0 1 0 โ โ 1 0 1 โ โ 4 3 โ
โ 0 1 0 1 โ * โ 0 1 0 โ = โ 2 4 โ
โ 1 0 1 0 โ โ 1 0 1 โ โโโโโโโโโ
โ 0 1 0 1 โ โโโโโโโโโ
โโโโโโโโโโโ
# The filter slides across the image, computing dot products
Building a CNN in PyTorch#
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Convolutional layers
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
# Pooling
self.pool = nn.MaxPool2d(2, 2)
# Fully connected layers
self.fc1 = nn.Linear(128 * 4 * 4, 512)
self.fc2 = nn.Linear(512, num_classes)
# Dropout for regularization
self.dropout = nn.Dropout(0.5)
def forward(self, x):
# Conv block 1: 32x32 -> 16x16
x = self.pool(F.relu(self.conv1(x)))
# Conv block 2: 16x16 -> 8x8
x = self.pool(F.relu(self.conv2(x)))
# Conv block 3: 8x8 -> 4x4
x = self.pool(F.relu(self.conv3(x)))
# Flatten
x = x.view(-1, 128 * 4 * 4)
# Fully connected
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
Layer Types#
Famous CNN Architectures#
| Feature | Architecture | Year | Key Innovation |
|---|---|---|---|
| LeNet | 1998 | Pioneering CNN | 7 layers |
| AlexNet | 2012 | ReLU, Dropout, GPU | 8 layers |
| VGG | 2014 | 3x3 filters only | 16-19 layers |
| ResNet | 2015 | Skip connections | 50-152 layers |
| EfficientNet | 2019 | Compound scaling | Optimal efficiency |
Transfer Learning#
Don't train from scratch - use pretrained models:
import torchvision.models as models
# Load pretrained ResNet
model = models.resnet50(pretrained=True)
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace final layer for your task
model.fc = nn.Linear(2048, num_classes)
# Only train the final layer
optimizer = torch.optim.Adam(model.fc.parameters())
Data Augmentation#
Increase dataset diversity:
from torchvision import transforms
train_transform = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
Training Tips#
Start Pretrained
Use transfer learning unless you have millions of images.
Augment Data
Flip, rotate, crop. More variety = better generalization.
Use BatchNorm
Stabilizes training, allows higher learning rates.
Monitor Overfitting
If training acc >> validation acc, add dropout or reduce model size.
What CNNs Learn#
Feature Hierarchy in Deep CNNs
Key Takeaways#
Remember
CNNs are the foundation of computer vision. Start with pretrained models for most tasks. Understand convolution, pooling, and feature maps. Modern architectures like ResNet and EfficientNet handle most use cases.
Ready to level up your skills?
Explore more guides and tutorials to deepen your understanding and become a better developer.