Supervised Learning | Machine Learning Guide | Ephizen

What is Supervised Learning?#

Supervised learning is like learning with a teacher. You have input-output pairs, and the goal is to learn the mapping between them.

The Concept

Given examples of (input, correct output), learn to predict outputs for new inputs.

Classification vs Regression#

Feature	Aspect	Classification
Output Type	Discrete categories	Continuous values
Example	Spam or Not Spam	House Price
Metric	Accuracy, F1	MSE, R²
Algorithms	Logistic Reg, SVM, Trees	Linear Reg, Trees, NN

Classification#

Predicting categories:

python

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Email spam classification
X = [[100, 0.8],   # [word_count, spam_word_ratio]
     [50, 0.1],
     [200, 0.9],
     [80, 0.05]]
y = ['spam', 'not_spam', 'spam', 'not_spam']

model = LogisticRegression()
model.fit(X, y)

# Predict new email
new_email = [[150, 0.7]]
prediction = model.predict(new_email)  # 'spam'
probability = model.predict_proba(new_email)  # [0.15, 0.85]

Common Classification Algorithms#

Regression#

Predicting continuous values:

python

from sklearn.linear_model import LinearRegression
import numpy as np

# House price prediction
X = [[1500, 3, 2],  # [sqft, bedrooms, bathrooms]
     [2000, 4, 2],
     [1200, 2, 1]]
y = [300000, 450000, 200000]

model = LinearRegression()
model.fit(X, y)

# Predict new house
new_house = [[1800, 3, 2]]
predicted_price = model.predict(new_house)  # ~$375,000

# Model interpretation
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Common Regression Algorithms#

Simple

Linear Regression

Assumes linear relationship. Fast, interpretable, baseline model.

Flexible

Polynomial Regression

Captures non-linear relationships. Risk of overfitting.

Robust

Random Forest Regressor

Handles non-linearity. Less prone to outliers.

Advanced

Gradient Boosting

Often best performance. XGBoost, LightGBM, CatBoost.

Evaluation Metrics#

Classification Metrics#

python

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix
)

# Accuracy: % correct predictions
accuracy = accuracy_score(y_true, y_pred)

# Precision: Of predicted positives, how many are correct?
precision = precision_score(y_true, y_pred)

# Recall: Of actual positives, how many did we find?
recall = recall_score(y_true, y_pred)

# F1: Harmonic mean of precision and recall
f1 = f1_score(y_true, y_pred)

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
# [[TN, FP],
#  [FN, TP]]

Regression Metrics#

python

from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score
)

# MSE: Average squared error (penalizes large errors)
mse = mean_squared_error(y_true, y_pred)

# RMSE: Square root of MSE (same unit as target)
rmse = np.sqrt(mse)

# MAE: Average absolute error
mae = mean_absolute_error(y_true, y_pred)

# R²: Proportion of variance explained (0-1)
r2 = r2_score(y_true, y_pred)

The Bias-Variance Tradeoff#

Model Complexity vs Error

Loading chart...

📉

Underfitting (High Bias)

Model too simple. Poor on training AND testing data.

📈

Overfitting (High Variance)

Model too complex. Great on training, poor on testing.

Cross-Validation#

Don't rely on a single train-test split:

python

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

print(f"Accuracy: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")

Key Takeaways#

Remember

Start with simple models (linear/logistic regression), understand the baseline, then try more complex models. Always use cross-validation. Focus on the right metric for your problem - accuracy isn't always the answer.

What is Supervised Learning?#

Supervised learning is like learning with a teacher. You have input-output pairs, and the goal is to learn the mapping between them.

The Concept

Given examples of (input, correct output), learn to predict outputs for new inputs.

Classification vs Regression#

Feature	Aspect	Classification
Output Type	Discrete categories	Continuous values
Example	Spam or Not Spam	House Price
Metric	Accuracy, F1	MSE, R²
Algorithms	Logistic Reg, SVM, Trees	Linear Reg, Trees, NN

Classification#

Predicting categories:

python

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Email spam classification
X = [[100, 0.8],   # [word_count, spam_word_ratio]
     [50, 0.1],
     [200, 0.9],
     [80, 0.05]]
y = ['spam', 'not_spam', 'spam', 'not_spam']

model = LogisticRegression()
model.fit(X, y)

# Predict new email
new_email = [[150, 0.7]]
prediction = model.predict(new_email)  # 'spam'
probability = model.predict_proba(new_email)  # [0.15, 0.85]

Common Classification Algorithms#

Regression#

Predicting continuous values:

python

from sklearn.linear_model import LinearRegression
import numpy as np

# House price prediction
X = [[1500, 3, 2],  # [sqft, bedrooms, bathrooms]
     [2000, 4, 2],
     [1200, 2, 1]]
y = [300000, 450000, 200000]

model = LinearRegression()
model.fit(X, y)

# Predict new house
new_house = [[1800, 3, 2]]
predicted_price = model.predict(new_house)  # ~$375,000

# Model interpretation
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Common Regression Algorithms#

Simple

Linear Regression

Assumes linear relationship. Fast, interpretable, baseline model.

Flexible

Polynomial Regression

Captures non-linear relationships. Risk of overfitting.

Robust

Random Forest Regressor

Handles non-linearity. Less prone to outliers.

Advanced

Gradient Boosting

Often best performance. XGBoost, LightGBM, CatBoost.

Evaluation Metrics#

Classification Metrics#

python

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix
)

# Accuracy: % correct predictions
accuracy = accuracy_score(y_true, y_pred)

# Precision: Of predicted positives, how many are correct?
precision = precision_score(y_true, y_pred)

# Recall: Of actual positives, how many did we find?
recall = recall_score(y_true, y_pred)

# F1: Harmonic mean of precision and recall
f1 = f1_score(y_true, y_pred)

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
# [[TN, FP],
#  [FN, TP]]

Regression Metrics#

python

from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score
)

# MSE: Average squared error (penalizes large errors)
mse = mean_squared_error(y_true, y_pred)

# RMSE: Square root of MSE (same unit as target)
rmse = np.sqrt(mse)

# MAE: Average absolute error
mae = mean_absolute_error(y_true, y_pred)

# R²: Proportion of variance explained (0-1)
r2 = r2_score(y_true, y_pred)

The Bias-Variance Tradeoff#

Model Complexity vs Error

Loading chart...

📉

Underfitting (High Bias)

Model too simple. Poor on training AND testing data.

📈

Overfitting (High Variance)

Model too complex. Great on training, poor on testing.

Cross-Validation#

Don't rely on a single train-test split:

python

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

print(f"Accuracy: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")

Key Takeaways#

Remember

What is Supervised Learning?#

Classification vs Regression#

Classification#

Common Classification Algorithms#

Regression#

Common Regression Algorithms#

Linear Regression

Polynomial Regression

Random Forest Regressor

Gradient Boosting

Evaluation Metrics#

Classification Metrics#

Regression Metrics#

The Bias-Variance Tradeoff#

Model Complexity vs Error

Underfitting (High Bias)

Overfitting (High Variance)

Cross-Validation#

Key Takeaways#

Ready to level up your skills?

What is Supervised Learning?#

Classification vs Regression#

Classification#

Common Classification Algorithms#

Regression#

Common Regression Algorithms#

Linear Regression

Polynomial Regression

Random Forest Regressor

Gradient Boosting

Evaluation Metrics#

Classification Metrics#

Regression Metrics#

The Bias-Variance Tradeoff#

Model Complexity vs Error

Underfitting (High Bias)

Overfitting (High Variance)

Cross-Validation#

Key Takeaways#

Ready to level up your skills?