Supervised Learning
Master classification and regression - the most common types of machine learning tasks.
What is Supervised Learning?#
Supervised learning is like learning with a teacher. You have input-output pairs, and the goal is to learn the mapping between them.
The Concept
Given examples of (input, correct output), learn to predict outputs for new inputs.
Classification vs Regression#
| Feature | Aspect | Classification | Regression |
|---|---|---|---|
| Output Type | Discrete categories | Continuous values | |
| Example | Spam or Not Spam | House Price | |
| Metric | Accuracy, F1 | MSE, R² | |
| Algorithms | Logistic Reg, SVM, Trees | Linear Reg, Trees, NN |
Classification#
Predicting categories:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Email spam classification
X = [[100, 0.8], # [word_count, spam_word_ratio]
[50, 0.1],
[200, 0.9],
[80, 0.05]]
y = ['spam', 'not_spam', 'spam', 'not_spam']
model = LogisticRegression()
model.fit(X, y)
# Predict new email
new_email = [[150, 0.7]]
prediction = model.predict(new_email) # 'spam'
probability = model.predict_proba(new_email) # [0.15, 0.85]
Common Classification Algorithms#
Regression#
Predicting continuous values:
from sklearn.linear_model import LinearRegression
import numpy as np
# House price prediction
X = [[1500, 3, 2], # [sqft, bedrooms, bathrooms]
[2000, 4, 2],
[1200, 2, 1]]
y = [300000, 450000, 200000]
model = LinearRegression()
model.fit(X, y)
# Predict new house
new_house = [[1800, 3, 2]]
predicted_price = model.predict(new_house) # ~$375,000
# Model interpretation
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
Common Regression Algorithms#
Linear Regression
Assumes linear relationship. Fast, interpretable, baseline model.
Polynomial Regression
Captures non-linear relationships. Risk of overfitting.
Random Forest Regressor
Handles non-linearity. Less prone to outliers.
Gradient Boosting
Often best performance. XGBoost, LightGBM, CatBoost.
Evaluation Metrics#
Classification Metrics#
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix
)
# Accuracy: % correct predictions
accuracy = accuracy_score(y_true, y_pred)
# Precision: Of predicted positives, how many are correct?
precision = precision_score(y_true, y_pred)
# Recall: Of actual positives, how many did we find?
recall = recall_score(y_true, y_pred)
# F1: Harmonic mean of precision and recall
f1 = f1_score(y_true, y_pred)
# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
# [[TN, FP],
# [FN, TP]]
Regression Metrics#
from sklearn.metrics import (
mean_squared_error,
mean_absolute_error,
r2_score
)
# MSE: Average squared error (penalizes large errors)
mse = mean_squared_error(y_true, y_pred)
# RMSE: Square root of MSE (same unit as target)
rmse = np.sqrt(mse)
# MAE: Average absolute error
mae = mean_absolute_error(y_true, y_pred)
# R²: Proportion of variance explained (0-1)
r2 = r2_score(y_true, y_pred)
The Bias-Variance Tradeoff#
Model Complexity vs Error
Underfitting (High Bias)
Model too simple. Poor on training AND testing data.
Overfitting (High Variance)
Model too complex. Great on training, poor on testing.
Cross-Validation#
Don't rely on a single train-test split:
from sklearn.model_selection import cross_val_score
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Accuracy: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")
Key Takeaways#
Remember
Start with simple models (linear/logistic regression), understand the baseline, then try more complex models. Always use cross-validation. Focus on the right metric for your problem - accuracy isn't always the answer.
Ready to level up your skills?
Explore more guides and tutorials to deepen your understanding and become a better developer.