S L Notebook: Machine Learning

Introduction to Machine Learning

What is Machine Learning?

Machine learning is a field of artificial intelligence (AI) in which computers learn from data to perform tasks without being explicitly programmed. Instead of being given explicit rules, an algorithm identifies patterns in a large dataset to make predictions or decisions.

Fig 1: How Machine Learning differs from standard coding.

"A computer program is said to learn from experience (E) with respect to some class of tasks (T) and performance measure (P), if its performance at tasks in T, as measured by P, improves with experience E."

- Tom Mitchell

Breaking Down the Definition (T, E, P)

To make this definition easier to understand, let's apply it to a real-world example: Building a Spam Filter.

🎯 Task (T): The problem the computer needs to solve.
Example: Identify and flag spam emails.
📚 Experience (E): The historical data the model learns from.
Example: A dataset of 10,000 emails labeled as "Spam" or "Not Spam."
📈 Performance (P): How we check if the model is working.
Example: The percentage of emails correctly classified (e.g., 98% accuracy).

Why do we need Machine Learning?

For problems requiring complex rules and extensive hand-tuning.
Machine Learning is ideal for replacing traditional systems that rely on a long, complicated list of manually crafted rules.

Example: Spam Filtering

Traditional (Rule-Based) Approach	Machine Learning Approach
Manual Rules: Developers create rules to block specific keywords (e.g., "free," "viagra").	Automatic Learning: The algorithm learns from examples of spam and legitimate emails.
Complexity: The list of rules becomes extremely long and hard to manage.	Pattern Recognition: It automatically detects subtle patterns beyond just keywords.
Result: Brittle and difficult to maintain.	Result: More accurate, robust, and easier to maintain.

🗂️

How do these systems learn?

Machine Learning systems are typically classified into four major categories based on the amount and type of supervision they receive during their training.

↓

1. Supervised Learning

🧠 How it learns: The algorithm is trained on data that has been labeled with the correct answers. It's like learning with a teacher who provides an answer key.
🎯 Goal: To predict a target value or classify data.
💡 Common Examples: Spam detection, predicting house prices.

2. Unsupervised Learning

🔍 How it learns: The algorithm is given unlabeled data and must find hidden patterns and structures on its own, without any answers provided.
🎯 Goal: To discover underlying groupings or data structures.
💡 Common Examples: Customer segmentation, anomaly detection.

3. Semi-supervised Learning

🤝 How it learns: A combination of the two above. It trains on a dataset that contains a small amount of labeled data and a large amount of unlabeled data.
🎯 Goal: To improve learning accuracy when labeling is expensive.
💡 Common Examples: Google Photos face recognition.

4. Reinforcement Learning

🎮 How it learns: An "agent" learns by performing actions and receiving rewards or penalties. It learns the best strategy through trial and error.
🎯 Goal: To make a sequence of optimal decisions.
💡 Common Examples: Chess AI, robotics, resource management.

Batch Learning vs. Online Learning

📦 Batch Learning

Often called "Offline Learning." Trained on all data at once.

Static: Predicts based only on what it already knows.
Updates: Must be retrained from scratch to learn new things.
✅ Pros: Stable and simple.
❌ Cons: Slow to train, cannot adapt in real-time.

⚡ Online Learning

Trained "on the fly" incrementally.

Dynamic: Learns continuously as new data arrives.
Updates: Adapts immediately to changes.
✅ Pros: Fast and resource-efficient.
❌ Cons: Risk of bad data affecting performance instantly.

Instance-Based vs. Model-Based Learning

📖 Instance-Based

"Like an open-book exam. You look for similar examples in the book to solve the problem."

Stores training examples in memory.
Predicts by finding "similar" past examples.

🔹 Algorithm: k-Nearest Neighbors (k-NN)

⚙️ Model-Based

"Like a closed-book exam. You study the rules and principles, then use them without the book."

Builds a model (rules) from data.
Original data is not needed after training.

🔹 Algorithm: Linear Regression, Neural Networks

Main Challenges in Machine Learning

⚠️ "Bad Data" (Data Quality Issues)

The performance of any ML model is fundamentally limited by the quality of its data.

📉 Insufficient Quantity:
Complex models (like image recognition) need thousands or millions of examples.
📊 Nonrepresentative Data (Bias):
If data doesn't reflect reality, predictions will be wrong. (e.g., Training a voice assistant only on adult voices makes it fail with children).
🗑️ Poor-Quality Data:
Data full of errors, outliers, and noise makes it hard to find patterns. *Note: Data cleaning is often 80% of the work.*
🚫 Irrelevant Features:
"Garbage in, garbage out." The model needs relevant features to learn successfully.

"Bad Model" (Algorithm Issues)

🔥 Overfitting

The model learns the training data too well, including the noise. It works great on past data but fails on new data.

"Like a student who memorizes the answers to a practice test but fails the real exam."

Cause: The model is too complex for the amount of data.
Solution: Simplify the model or gather more data.

❄️ Underfitting

The model is too simple to capture the underlying structure. It performs poorly on everything.

"Like trying to fit a straight line to complex, curved data."

Cause: The model is not powerful enough.
Solution: Use a more complex model or better features.

🌟 The Golden Rule: The Train-Test Split

You never test your model on the same data it was trained on. That's like giving a student an exam with the exact same questions they used to study. Instead, we typically use three sets:

STEP 1

📚 Training Set

The largest portion of the data. The model learns the underlying patterns from this set.

"This is the textbook and practice problems a student uses to learn the subject."

STEP 2

🛠️ Validation Set

Used to tune "hyperparameters" (complexity) and select the best version of the model. Prevents overfitting.

"This is the mock exam. The student takes it to check their progress and adjust study strategies before the real test."

STEP 3

🎓 Test Set

Kept completely separate. Used only once at the very end to evaluate final performance.

⚠️ CRITICAL RULE:
You must never tune the model based on the test set results.

"This is the final, proctored exam. The score here represents how well the student will do in the real world."

Hyperparameters & Tuning

🎛️ What is a Hyperparameter?

A hyperparameter is a configuration setting for a model that is set before the training process begins. Unlike standard parameters, it is not learned from the data itself; rather, it controls how the model learns.

💡 Examples:

The k in k-Nearest Neighbors.
The maximum depth of a Decision Tree.
The learning rate in a Neural Network.

The Tuning & Selection Workflow

🧪 Experiment

Train several different models (or one model with various hyperparameter settings) on the Training Set.

📏 Evaluate

Measure the performance of each trained model using the Validation Set.

🏆 Select

Choose the model and hyperparameter combination that achieved the best score on the validation set.

🏁 Final Test

This single, best model is evaluated one last time on the Test Set to get a realistic measure of its performance.

Common Tuning Strategies

🧊 Grid Search

You define a "grid" of all hyperparameter combinations you want to test.

Exhaustively tries every single combination to find the absolute best.
Pros: Very thorough.
Cons: Can be very slow computationally.

🎲 Randomized Search

You specify a range for each hyperparameter and the system tests random combinations.

Tests a fixed number of random combinations.
Pros: Much faster and often finds a result that is just as good.
Cons: Might miss the absolute "perfect" combination (but rarely matters).

Data Mismatch: When Models Fail in the Real World

Even with perfect training, a model can fail spectacuarly if the data it encounters in production has a different distribution than the data it was trained on.

🚗 The Self-Driving Car Analogy

"It's like training a self-driving car in sunny, daytime conditions and then deploying it for the first time in a snowy blizzard at night. The training environment does not match the real world."

⚠️ Common Causes

Training on Outdated Data (Data Drift)
The world changes constantly. A model trained on pre-pandemic shopping behavior will likely fail on post-pandemic data.
Lab vs. Production Environment
Data collected in a controlled setting is often too "clean."

Example: A plant app trained on professional HD photos, but users upload blurry, dark phone photos.
Sampling Bias
The data collection method may have accidentally missed certain groups or scenarios.

🛠️ How to Address It

✅ Create Representative Test Sets
This is the most critical step. Your validation/test sets must reflect real-world data to act as an early warning system.
📊 Monitor Live Performance
Track accuracy after deployment. If it drops, something has changed in the environment.
🔄 Regular Retraining
Implement a pipeline to periodically retrain your model on fresh, recent data.
🎨 Data Augmentation
Artificially increase diversity. For the plant app, add blur and noise to your clean training images to simulate bad phone cameras.

The "No Free Lunch" (NFL) Theorem

"There is no single machine learning algorithm that is universally the best for every problem."

🧰 The Toolbox Analogy

Think of algorithms as tools. A hammer is perfect for nails but useless for screws. A screwdriver is great for screws but can't handle a bolt. The best tool always depends entirely on the specific job.

🚫 There is No "Master Algorithm"

Deep Learning is amazing for images, but might be terrible for a simple spreadsheet. Sometimes, a simpler model works better.

🧩 Assumptions are Key

Every algorithm makes assumptions (e.g., Linear Regression assumes data is a straight line). It only works if those assumptions match your data.

🧪 Experimentation is Essential

This is the theoretical reason why we tune models. You can't know which tool is right until you try a few promising ones.

🎯 Focus on the Problem

Don't obsess over the algorithm first. Understanding your data and business problem will guide you to the right set of tools to test.

Thanks for reading! If you found this guide helpful, feel free to share it or leave a comment below.

📚 References & Bibliography

Key Textbooks

Machine Learning

Tom M. Mitchell (1997), McGraw-Hill.

*Source of the formal definition of Machine Learning (T, E, P) used in this article.
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

Aurélien Géron (3rd Edition, 2022), O'Reilly Media.

*Excellent practical guide for the concepts of Overfitting, Underfitting, and Validation.

Research Papers & Theorems

"No Free Lunch Theorems for Optimization"

D.H. Wolpert & W.G. Macready (1997), IEEE Transactions on Evolutionary Computation.

*The mathematical foundation for the "No Master Algorithm" concept.
"A Few Useful Things to Know About Machine Learning"

Pedro Domingos (2012), Communications of the ACM.

*A classic paper discussing feature engineering, overfitting, and data intuition.

Documentation & Online Resources

🔗 Scikit-Learn Documentation

Official User Guide
🔗 Google Machine Learning Crash Course

Free interactive course with visualizations.

About the Author

L.C. Sankalpa Lokuliyanage

I am currently pursuing a Masters in Software at Kyungpook National University, South Korea.

My background includes a BSc. (Hons) in Computer Science (First Class) from the University of Wolverhampton, UK, and a PGD in Cybersecurity (Class Top) from the London School of Business and Finance, Singapore.

LinkedIn Portfolio

From Blogger iPhone client

About Me

Thursday, November 20, 2025

Machine Learning