Machine learning - Training Theory and Overfitting
Understand generalization vs. overfitting, computational learning theory basics, and key strategies to mitigate overfitting and bias.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the definition of generalization in the context of machine learning?
1 of 10
Summary
Theory of Machine Learning: Generalization, Overfitting, and Learning Theory
Introduction
When we train a machine learning model, our ultimate goal isn't just to perform well on training data—it's to make accurate predictions on new, unseen examples. This document explores the fundamental concepts that determine whether a model will succeed or fail at this goal, covering the theory behind why some models generalize better than others and what determines whether a learning problem is even feasible to solve.
Generalization: The Core Goal
Generalization is the ability of a machine learning model to perform accurately on new examples it has never seen before. This is the ultimate measure of success for any machine learning system. A model that memorizes the training data perfectly but fails on new data is useless in practice.
Think of generalization like learning to drive. You don't practice driving only on the exact same stretch of highway you'll use forever. You practice on various roads so you can handle new roads you've never seen. The same principle applies to ML models.
Understanding Underfitting and Overfitting
When we train a model, there are two opposite ways it can fail to generalize well:
Underfitting
Underfitting occurs when the model is too simple to capture the underlying relationship in the data. The hypothesis (the function the model learns) is not flexible enough, so it performs poorly on both training data and new data.
For example, if you try to fit a straight line to data that actually follows a curved pattern, your line will make systematic errors everywhere. The model hasn't learned the true structure of the problem.
Underfitting is typically solved by using a more complex model.
Overfitting
Overfitting is the opposite problem: the model is too complex and learns to fit the noise and peculiarities of the training data rather than the underlying pattern. While an overfit model performs very well on training data, it performs poorly on new data because it has essentially memorized specific training examples rather than learning the general pattern.
Consider fitting a wiggly curve through every single data point, including the measurement errors. This curve might perfectly connect your training points, but when you encounter new data with its own measurement errors in different places, the curve will make terrible predictions.
The visualization above shows this trade-off: the red line represents a reasonable model that captures the general trend, while overfitting would mean creating a model that zigzags through every point, capturing noise rather than the true relationship.
The key insight: there's a sweet spot between underfitting and overfitting—we want a model complex enough to capture the true pattern, but not so complex that it fits noise.
Computational Learning Theory: Is Learning Feasible?
Beyond the bias-variance trade-off, we must ask a more fundamental question: Is a learning problem even feasible to solve? This is what computational learning theory addresses.
Computational learning theory studies whether learning problems can be solved efficiently—specifically, whether they can be solved in polynomial time (reasonable computational time).
What Makes a Problem Feasible?
A learning problem is considered feasible if we can find an accurate hypothesis in polynomial time. This means the number of operations needed grows reasonably with the input size (not exponentially).
Positive and Negative Results
Learning theory produces two types of results:
Positive results prove that certain function classes can be learned in polynomial time. For example, we know that linear classifiers can be learned efficiently using algorithms like perceptron learning or logistic regression.
Negative results prove that other function classes cannot be learned in polynomial time (unless P=NP, a major unsolved problem in computer science). For instance, learning arbitrary Boolean formulas is NP-hard, meaning there's likely no efficient algorithm for it.
Sample and Computational Complexity
Learning theory analyzes two complementary aspects of feasibility:
Sample Complexity
Sample complexity measures how many training examples you need to learn an accurate hypothesis. This answers: "How much data do I need?"
Even if an algorithm can learn efficiently in time, it might require an impractical number of training examples. Learning theory provides bounds on this, often showing that sample complexity grows with:
The complexity of the hypothesis class (more complex models need more data)
The desired accuracy level (higher accuracy requires more data)
Computational Complexity
Computational complexity measures how much computation time the learning algorithm requires. This answers: "How fast can I learn?"
A learning algorithm might theoretically work but be impractically slow. Computational complexity analysis identifies whether algorithms run in polynomial time or exponential time.
Important: Learning theory typically provides probabilistic performance bounds rather than absolute guarantees. These bounds say something like: "With high probability, this algorithm will find an accurate hypothesis if it has enough data and computation time." The bounds often aren't tight in practice, but they're essential for understanding fundamental limits.
Data Requirements: The Foundation
Before learning theory applies, there's a practical requirement: you need appropriate training data.
Machine learning models require:
Large datasets: Most modern models need thousands or millions of examples to learn effectively. The exact amount depends on model complexity and problem difficulty.
Reliable data: Measurement errors, mislabeled examples, or corrupted data directly harm model quality. You cannot train a good model from bad data.
Representative data: The training data must represent the distribution of examples the model will encounter in practice. If training data is skewed toward certain types of examples, the model will perform poorly on other types.
This last point is critical and leads directly to the problem of bias.
Bias and Ethical Risks
Algorithmic bias occurs when models trained on biased or unrepresentative data produce skewed predictions that are unfair or discriminatory.
Consider a model trained mostly on data from one demographic group. It may learn patterns specific to that group that don't generalize. More problematically, if historical data reflects past discrimination, the model will perpetuate that discrimination.
For example, if a hiring algorithm is trained on historical hiring decisions that were biased against certain groups, the algorithm will learn and amplify that bias.
The connection to overfitting is subtle: an overfit model amplifies whatever patterns (even biased ones) exist in the training data, rather than learning robust general principles.
<extrainfo>
Federated Learning: Privacy-Preserving Training
Federated learning is a distributed learning approach that decentralizes the training process. Instead of collecting all data in a central location, training happens on users' local devices (like phones or computers), and only model updates are shared.
This approach combines two benefits: privacy (user data never leaves their device) and model quality (the global model still improves by learning from diverse data sources).
</extrainfo>
Mitigation Strategies: Preventing Overfitting
The most practical solution to overfitting is regularization—incorporating a penalty for model complexity into the learning algorithm.
The core idea: penalize the algorithm for creating complex models. Formally, instead of minimizing just the training error, we minimize:
$$\text{Training Error} + \lambda \cdot \text{Model Complexity}$$
where $\lambda$ is a tuning parameter that controls how much we penalize complexity.
How this works:
If $\lambda$ is large, the algorithm avoids complex models (risk of underfitting)
If $\lambda$ is small, complexity isn't penalized much (risk of overfitting)
The right value of $\lambda$ balances these risks
Common complexity penalties include:
L2 regularization (Ridge regression): Penalizes the sum of squared weights
L1 regularization (Lasso): Penalizes the sum of absolute weights (also performs feature selection)
The visualization shows how regularization smooths model predictions: you can see how the orange regularized model is smoother than a fully unregularized version would be, avoiding the wiggles that would fit noise.
Other overfitting prevention techniques include:
Using simpler models
Early stopping during training
Cross-validation to tune hyperparameters
Data augmentation to increase training data diversity
Flashcards
What is the definition of generalization in the context of machine learning?
The ability of a learning machine to perform accurately on unseen examples after training.
When does underfitting occur in a machine learning model?
When the hypothesis is too simple to capture the underlying function.
When does overfitting occur in a machine learning model?
When the hypothesis is too complex, causing it to fit noise and reducing performance on new data.
When is a learning problem considered feasible within computational learning theory?
If it can be solved in polynomial time.
What do positive and negative results in learning theory indicate about function classes?
Positive results show they can be learned in polynomial time; negative results prove they cannot.
What type of performance bounds does learning theory typically provide?
Probabilistic performance bounds (rather than deterministic guarantees).
Which two types of complexity are evaluated in the theoretical analysis of learning?
Sample complexity (number of training examples needed)
Computational complexity (time required for learning)
What three characteristics must training datasets have for machine learning models to achieve accurate predictions?
Large
Reliable
Representative
What is the primary cause of algorithmic bias in machine learning models?
Training models on biased or unrepresentative data.
How does federated learning preserve user privacy during model training?
By decentralizing training and keeping user data on local devices.
Quiz
Machine learning - Training Theory and Overfitting Quiz Question 1: Which strategy helps avoid overfitting according to mitigation techniques?
- Penalising model complexity while rewarding good fit (correct)
- Increasing the number of parameters without regularisation
- Training longer on the same data without any constraints
- Removing regularisation methods to allow full learning
Machine learning - Training Theory and Overfitting Quiz Question 2: What is a likely outcome of training a model on biased or unrepresentative data?
- The model may produce skewed or undesirable predictions (correct)
- The model will automatically correct the bias during training
- The model will become invariant to all input variations
- The training process will speed up significantly
Machine learning - Training Theory and Overfitting Quiz Question 3: What primary aspect does computational learning theory investigate?
- The time complexity and feasibility of learning algorithms. (correct)
- The hardware requirements for neural networks.
- The ethical implications of AI deployment.
- The visual design of user interfaces for ML tools.
Machine learning - Training Theory and Overfitting Quiz Question 4: When is a learning problem classified as feasible?
- When it can be solved in polynomial time. (correct)
- When it requires exponential memory.
- When it needs an infinite amount of data.
- When it can only be solved by quantum computers.
Machine learning - Training Theory and Overfitting Quiz Question 5: What type of guarantees does learning theory usually provide?
- Probabilistic performance bounds. (correct)
- Deterministic guarantees of zero error.
- Exact predictions for any future example.
- Unlimited scalability without data constraints.
Machine learning - Training Theory and Overfitting Quiz Question 6: Which two complexities are evaluated in theoretical analysis of learning?
- Sample complexity and computational complexity. (correct)
- Energy consumption and storage complexity.
- Network latency and bandwidth complexity.
- User interface complexity and design complexity.
Machine learning - Training Theory and Overfitting Quiz Question 7: What is a key characteristic of federated learning?
- Training occurs on local devices, keeping user data private. (correct)
- All data must be centralized in a single server.
- Models are trained without any communication between devices.
- It requires only a single training example to build a model.
Machine learning - Training Theory and Overfitting Quiz Question 8: Which of the following is a typical sign that a learning model is underfitting?
- Low accuracy on both training and validation data (correct)
- High accuracy on training data but low accuracy on validation data
- Very high accuracy on both training and validation data
- Training loss does not decrease during training
Machine learning - Training Theory and Overfitting Quiz Question 9: What is the main consequence of overfitting a machine‑learning model?
- Its ability to generalise to new, unseen data is reduced (correct)
- It achieves perfect accuracy on both training and test data
- Training time is significantly shortened
- The model becomes simpler and more interpretable
Machine learning - Training Theory and Overfitting Quiz Question 10: Which set of attributes best describes the training data required for a machine‑learning model to achieve accurate predictions?
- Large size, reliable quality, and representativeness of the problem domain (correct)
- Small size, high variability, and synthetic generation
- Highly imbalanced classes, noisy labels, and limited features
- Random sampling, low dimensionality, and exclusive focus on outliers
Which strategy helps avoid overfitting according to mitigation techniques?
1 of 10
Key Concepts
Model Performance Issues
Generalization (machine learning)
Overfitting
Underfitting
Regularization (machine learning)
Learning Theory and Complexity
Computational learning theory
Sample complexity
Data and Ethics in Learning
Federated learning
Algorithmic bias
Definitions
Generalization (machine learning)
The ability of a model to perform accurately on unseen data after training.
Overfitting
When a model is too complex and captures noise in the training data, harming performance on new data.
Underfitting
When a model is too simple to capture the underlying pattern, resulting in poor performance even on training data.
Computational learning theory
The study of the time and resource feasibility of learning algorithms, including polynomial‑time learnability.
Sample complexity
The number of training examples required for a learning algorithm to achieve a desired level of performance.
Federated learning
A decentralized training approach that keeps data on local devices while collaboratively improving a global model.
Algorithmic bias
Systematic and unfair discrimination in model predictions caused by biased or unrepresentative training data.
Regularization (machine learning)
Techniques that penalize model complexity to prevent overfitting while maintaining good fit.