Thursday, November 20, 2025

Machine Learning

Introduction to Machine Learning

What is Machine Learning?

Machine learning is a field of artificial intelligence (AI) in which computers learn from data to perform tasks without being explicitly programmed. Instead of being given explicit rules, an algorithm identifies patterns in a large dataset to make predictions or decisions.

Fig 1: How Machine Learning differs from standard coding.

"A computer program is said to learn from experience (E) with respect to some class of tasks (T) and performance measure (P), if its performance at tasks in T, as measured by P, improves with experience E."

- Tom Mitchell

Breaking Down the Definition (T, E, P)

To make this definition easier to understand, let's apply it to a real-world example: Building a Spam Filter.

  • ๐ŸŽฏ Task (T): The problem the computer needs to solve.
    Example: Identify and flag spam emails.
  • ๐Ÿ“š Experience (E): The historical data the model learns from.
    Example: A dataset of 10,000 emails labeled as "Spam" or "Not Spam."
  • ๐Ÿ“ˆ Performance (P): How we check if the model is working.
    Example: The percentage of emails correctly classified (e.g., 98% accuracy).

Why do we need Machine Learning?

For problems requiring complex rules and extensive hand-tuning.
Machine Learning is ideal for replacing traditional systems that rely on a long, complicated list of manually crafted rules.

Example: Spam Filtering
Traditional (Rule-Based) Approach Machine Learning Approach
Manual Rules: Developers create rules to block specific keywords (e.g., "free," "viagra"). Automatic Learning: The algorithm learns from examples of spam and legitimate emails.
Complexity: The list of rules becomes extremely long and hard to manage. Pattern Recognition: It automatically detects subtle patterns beyond just keywords.
Result: Brittle and difficult to maintain. Result: More accurate, robust, and easier to maintain.
๐Ÿ—‚️

How do these systems learn?

Machine Learning systems are typically classified into four major categories based on the amount and type of supervision they receive during their training.

1. Supervised Learning

  • ๐Ÿง  How it learns: The algorithm is trained on data that has been labeled with the correct answers. It's like learning with a teacher who provides an answer key.
  • ๐ŸŽฏ Goal: To predict a target value or classify data.
  • ๐Ÿ’ก Common Examples: Spam detection, predicting house prices.

2. Unsupervised Learning

  • ๐Ÿ” How it learns: The algorithm is given unlabeled data and must find hidden patterns and structures on its own, without any answers provided.
  • ๐ŸŽฏ Goal: To discover underlying groupings or data structures.
  • ๐Ÿ’ก Common Examples: Customer segmentation, anomaly detection.

3. Semi-supervised Learning

  • ๐Ÿค How it learns: A combination of the two above. It trains on a dataset that contains a small amount of labeled data and a large amount of unlabeled data.
  • ๐ŸŽฏ Goal: To improve learning accuracy when labeling is expensive.
  • ๐Ÿ’ก Common Examples: Google Photos face recognition.

4. Reinforcement Learning

  • ๐ŸŽฎ How it learns: An "agent" learns by performing actions and receiving rewards or penalties. It learns the best strategy through trial and error.
  • ๐ŸŽฏ Goal: To make a sequence of optimal decisions.
  • ๐Ÿ’ก Common Examples: Chess AI, robotics, resource management.

Batch Learning vs. Online Learning

๐Ÿ“ฆ Batch Learning

Often called "Offline Learning." Trained on all data at once.

  • Static: Predicts based only on what it already knows.
  • Updates: Must be retrained from scratch to learn new things.
  • ✅ Pros: Stable and simple.
  • ❌ Cons: Slow to train, cannot adapt in real-time.

⚡ Online Learning

Trained "on the fly" incrementally.

  • Dynamic: Learns continuously as new data arrives.
  • Updates: Adapts immediately to changes.
  • ✅ Pros: Fast and resource-efficient.
  • ❌ Cons: Risk of bad data affecting performance instantly.

Instance-Based vs. Model-Based Learning

๐Ÿ“– Instance-Based

"Like an open-book exam. You look for similar examples in the book to solve the problem."
  • Stores training examples in memory.
  • Predicts by finding "similar" past examples.
๐Ÿ”น Algorithm: k-Nearest Neighbors (k-NN)

⚙️ Model-Based

"Like a closed-book exam. You study the rules and principles, then use them without the book."
  • Builds a model (rules) from data.
  • Original data is not needed after training.
๐Ÿ”น Algorithm: Linear Regression, Neural Networks

Main Challenges in Machine Learning

⚠️ "Bad Data" (Data Quality Issues)

The performance of any ML model is fundamentally limited by the quality of its data.

  • ๐Ÿ“‰ Insufficient Quantity:
    Complex models (like image recognition) need thousands or millions of examples.
  • ๐Ÿ“Š Nonrepresentative Data (Bias):
    If data doesn't reflect reality, predictions will be wrong. (e.g., Training a voice assistant only on adult voices makes it fail with children).
  • ๐Ÿ—‘️ Poor-Quality Data:
    Data full of errors, outliers, and noise makes it hard to find patterns. *Note: Data cleaning is often 80% of the work.*
  • ๐Ÿšซ Irrelevant Features:
    "Garbage in, garbage out." The model needs relevant features to learn successfully.

"Bad Model" (Algorithm Issues)

๐Ÿ”ฅ Overfitting

The model learns the training data too well, including the noise. It works great on past data but fails on new data.

"Like a student who memorizes the answers to a practice test but fails the real exam."
  • Cause: The model is too complex for the amount of data.
  • Solution: Simplify the model or gather more data.

❄️ Underfitting

The model is too simple to capture the underlying structure. It performs poorly on everything.

"Like trying to fit a straight line to complex, curved data."
  • Cause: The model is not powerful enough.
  • Solution: Use a more complex model or better features.

๐ŸŒŸ The Golden Rule: The Train-Test Split

You never test your model on the same data it was trained on. That's like giving a student an exam with the exact same questions they used to study. Instead, we typically use three sets:

STEP 1

๐Ÿ“š Training Set

The largest portion of the data. The model learns the underlying patterns from this set.

"This is the textbook and practice problems a student uses to learn the subject."
STEP 2

๐Ÿ› ️ Validation Set

Used to tune "hyperparameters" (complexity) and select the best version of the model. Prevents overfitting.

"This is the mock exam. The student takes it to check their progress and adjust study strategies before the real test."
STEP 3

๐ŸŽ“ Test Set

Kept completely separate. Used only once at the very end to evaluate final performance.

⚠️ CRITICAL RULE:
You must never tune the model based on the test set results.
"This is the final, proctored exam. The score here represents how well the student will do in the real world."

Hyperparameters & Tuning

๐ŸŽ›️ What is a Hyperparameter?

A hyperparameter is a configuration setting for a model that is set before the training process begins. Unlike standard parameters, it is not learned from the data itself; rather, it controls how the model learns.

๐Ÿ’ก Examples:
  • The k in k-Nearest Neighbors.
  • The maximum depth of a Decision Tree.
  • The learning rate in a Neural Network.

The Tuning & Selection Workflow

1
๐Ÿงช Experiment

Train several different models (or one model with various hyperparameter settings) on the Training Set.

2
๐Ÿ“ Evaluate

Measure the performance of each trained model using the Validation Set.

3
๐Ÿ† Select

Choose the model and hyperparameter combination that achieved the best score on the validation set.

4
๐Ÿ Final Test

This single, best model is evaluated one last time on the Test Set to get a realistic measure of its performance.

Common Tuning Strategies

๐ŸงŠ Grid Search

You define a "grid" of all hyperparameter combinations you want to test.

  • Exhaustively tries every single combination to find the absolute best.
  • Pros: Very thorough.
  • Cons: Can be very slow computationally.

๐ŸŽฒ Randomized Search

You specify a range for each hyperparameter and the system tests random combinations.

  • Tests a fixed number of random combinations.
  • Pros: Much faster and often finds a result that is just as good.
  • Cons: Might miss the absolute "perfect" combination (but rarely matters).

Data Mismatch: When Models Fail in the Real World

Even with perfect training, a model can fail spectacuarly if the data it encounters in production has a different distribution than the data it was trained on.

๐Ÿš— The Self-Driving Car Analogy

"It's like training a self-driving car in sunny, daytime conditions and then deploying it for the first time in a snowy blizzard at night. The training environment does not match the real world."

⚠️ Common Causes

  • Training on Outdated Data (Data Drift)

    The world changes constantly. A model trained on pre-pandemic shopping behavior will likely fail on post-pandemic data.

  • Lab vs. Production Environment

    Data collected in a controlled setting is often too "clean."

    Example: A plant app trained on professional HD photos, but users upload blurry, dark phone photos.
  • Sampling Bias

    The data collection method may have accidentally missed certain groups or scenarios.

๐Ÿ› ️ How to Address It

  • ✅ Create Representative Test Sets

    This is the most critical step. Your validation/test sets must reflect real-world data to act as an early warning system.

  • ๐Ÿ“Š Monitor Live Performance

    Track accuracy after deployment. If it drops, something has changed in the environment.

  • ๐Ÿ”„ Regular Retraining

    Implement a pipeline to periodically retrain your model on fresh, recent data.

  • ๐ŸŽจ Data Augmentation

    Artificially increase diversity. For the plant app, add blur and noise to your clean training images to simulate bad phone cameras.

The "No Free Lunch" (NFL) Theorem

"There is no single machine learning algorithm that is universally the best for every problem."

๐Ÿงฐ The Toolbox Analogy

Think of algorithms as tools. A hammer is perfect for nails but useless for screws. A screwdriver is great for screws but can't handle a bolt. The best tool always depends entirely on the specific job.

๐Ÿšซ There is No "Master Algorithm"

Deep Learning is amazing for images, but might be terrible for a simple spreadsheet. Sometimes, a simpler model works better.

๐Ÿงฉ Assumptions are Key

Every algorithm makes assumptions (e.g., Linear Regression assumes data is a straight line). It only works if those assumptions match your data.

๐Ÿงช Experimentation is Essential

This is the theoretical reason why we tune models. You can't know which tool is right until you try a few promising ones.

๐ŸŽฏ Focus on the Problem

Don't obsess over the algorithm first. Understanding your data and business problem will guide you to the right set of tools to test.


Thanks for reading! If you found this guide helpful, feel free to share it or leave a comment below.


๐Ÿ“š References & Bibliography

Key Textbooks

  • Machine Learning
    Tom M. Mitchell (1997), McGraw-Hill.

    *Source of the formal definition of Machine Learning (T, E, P) used in this article.

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
    Aurรฉlien Gรฉron (3rd Edition, 2022), O'Reilly Media.

    *Excellent practical guide for the concepts of Overfitting, Underfitting, and Validation.

Research Papers & Theorems

  • "No Free Lunch Theorems for Optimization"
    D.H. Wolpert & W.G. Macready (1997), IEEE Transactions on Evolutionary Computation.

    *The mathematical foundation for the "No Master Algorithm" concept.

  • "A Few Useful Things to Know About Machine Learning"
    Pedro Domingos (2012), Communications of the ACM.

    *A classic paper discussing feature engineering, overfitting, and data intuition.

Documentation & Online Resources

About the Author

L.C. Sankalpa Lokuliyanage

I am currently pursuing a Masters in Software at Kyungpook National University, South Korea.

My background includes a BSc. (Hons) in Computer Science (First Class) from the University of Wolverhampton, UK, and a PGD in Cybersecurity (Class Top) from the London School of Business and Finance, Singapore.

From Blogger iPhone client

2 comments:

  1. This content is very informative and explained in a simple and clear way. I recommend everyone to follow and learn like a pro. This is truly great work. Keep it up.

    ReplyDelete
    Replies
    1. Thank you very much for your valuable comment ❤️

      Delete

Project Teammates: The Future of AI-Driven Warfare

Ubisoft Teammates Ubisoft "Teammates" Ubisoft is testing a new game format u...