RemNote Community
Community

Machine learning - Training Theory and Overfitting

Understand generalization vs. overfitting, computational learning theory basics, and key strategies to mitigate overfitting and bias.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is the definition of generalization in the context of machine learning?
1 of 10

Summary

Theory of Machine Learning: Generalization, Overfitting, and Learning Theory Introduction When we train a machine learning model, our ultimate goal isn't just to perform well on training data—it's to make accurate predictions on new, unseen examples. This document explores the fundamental concepts that determine whether a model will succeed or fail at this goal, covering the theory behind why some models generalize better than others and what determines whether a learning problem is even feasible to solve. Generalization: The Core Goal Generalization is the ability of a machine learning model to perform accurately on new examples it has never seen before. This is the ultimate measure of success for any machine learning system. A model that memorizes the training data perfectly but fails on new data is useless in practice. Think of generalization like learning to drive. You don't practice driving only on the exact same stretch of highway you'll use forever. You practice on various roads so you can handle new roads you've never seen. The same principle applies to ML models. Understanding Underfitting and Overfitting When we train a model, there are two opposite ways it can fail to generalize well: Underfitting Underfitting occurs when the model is too simple to capture the underlying relationship in the data. The hypothesis (the function the model learns) is not flexible enough, so it performs poorly on both training data and new data. For example, if you try to fit a straight line to data that actually follows a curved pattern, your line will make systematic errors everywhere. The model hasn't learned the true structure of the problem. Underfitting is typically solved by using a more complex model. Overfitting Overfitting is the opposite problem: the model is too complex and learns to fit the noise and peculiarities of the training data rather than the underlying pattern. While an overfit model performs very well on training data, it performs poorly on new data because it has essentially memorized specific training examples rather than learning the general pattern. Consider fitting a wiggly curve through every single data point, including the measurement errors. This curve might perfectly connect your training points, but when you encounter new data with its own measurement errors in different places, the curve will make terrible predictions. The visualization above shows this trade-off: the red line represents a reasonable model that captures the general trend, while overfitting would mean creating a model that zigzags through every point, capturing noise rather than the true relationship. The key insight: there's a sweet spot between underfitting and overfitting—we want a model complex enough to capture the true pattern, but not so complex that it fits noise. Computational Learning Theory: Is Learning Feasible? Beyond the bias-variance trade-off, we must ask a more fundamental question: Is a learning problem even feasible to solve? This is what computational learning theory addresses. Computational learning theory studies whether learning problems can be solved efficiently—specifically, whether they can be solved in polynomial time (reasonable computational time). What Makes a Problem Feasible? A learning problem is considered feasible if we can find an accurate hypothesis in polynomial time. This means the number of operations needed grows reasonably with the input size (not exponentially). Positive and Negative Results Learning theory produces two types of results: Positive results prove that certain function classes can be learned in polynomial time. For example, we know that linear classifiers can be learned efficiently using algorithms like perceptron learning or logistic regression. Negative results prove that other function classes cannot be learned in polynomial time (unless P=NP, a major unsolved problem in computer science). For instance, learning arbitrary Boolean formulas is NP-hard, meaning there's likely no efficient algorithm for it. Sample and Computational Complexity Learning theory analyzes two complementary aspects of feasibility: Sample Complexity Sample complexity measures how many training examples you need to learn an accurate hypothesis. This answers: "How much data do I need?" Even if an algorithm can learn efficiently in time, it might require an impractical number of training examples. Learning theory provides bounds on this, often showing that sample complexity grows with: The complexity of the hypothesis class (more complex models need more data) The desired accuracy level (higher accuracy requires more data) Computational Complexity Computational complexity measures how much computation time the learning algorithm requires. This answers: "How fast can I learn?" A learning algorithm might theoretically work but be impractically slow. Computational complexity analysis identifies whether algorithms run in polynomial time or exponential time. Important: Learning theory typically provides probabilistic performance bounds rather than absolute guarantees. These bounds say something like: "With high probability, this algorithm will find an accurate hypothesis if it has enough data and computation time." The bounds often aren't tight in practice, but they're essential for understanding fundamental limits. Data Requirements: The Foundation Before learning theory applies, there's a practical requirement: you need appropriate training data. Machine learning models require: Large datasets: Most modern models need thousands or millions of examples to learn effectively. The exact amount depends on model complexity and problem difficulty. Reliable data: Measurement errors, mislabeled examples, or corrupted data directly harm model quality. You cannot train a good model from bad data. Representative data: The training data must represent the distribution of examples the model will encounter in practice. If training data is skewed toward certain types of examples, the model will perform poorly on other types. This last point is critical and leads directly to the problem of bias. Bias and Ethical Risks Algorithmic bias occurs when models trained on biased or unrepresentative data produce skewed predictions that are unfair or discriminatory. Consider a model trained mostly on data from one demographic group. It may learn patterns specific to that group that don't generalize. More problematically, if historical data reflects past discrimination, the model will perpetuate that discrimination. For example, if a hiring algorithm is trained on historical hiring decisions that were biased against certain groups, the algorithm will learn and amplify that bias. The connection to overfitting is subtle: an overfit model amplifies whatever patterns (even biased ones) exist in the training data, rather than learning robust general principles. <extrainfo> Federated Learning: Privacy-Preserving Training Federated learning is a distributed learning approach that decentralizes the training process. Instead of collecting all data in a central location, training happens on users' local devices (like phones or computers), and only model updates are shared. This approach combines two benefits: privacy (user data never leaves their device) and model quality (the global model still improves by learning from diverse data sources). </extrainfo> Mitigation Strategies: Preventing Overfitting The most practical solution to overfitting is regularization—incorporating a penalty for model complexity into the learning algorithm. The core idea: penalize the algorithm for creating complex models. Formally, instead of minimizing just the training error, we minimize: $$\text{Training Error} + \lambda \cdot \text{Model Complexity}$$ where $\lambda$ is a tuning parameter that controls how much we penalize complexity. How this works: If $\lambda$ is large, the algorithm avoids complex models (risk of underfitting) If $\lambda$ is small, complexity isn't penalized much (risk of overfitting) The right value of $\lambda$ balances these risks Common complexity penalties include: L2 regularization (Ridge regression): Penalizes the sum of squared weights L1 regularization (Lasso): Penalizes the sum of absolute weights (also performs feature selection) The visualization shows how regularization smooths model predictions: you can see how the orange regularized model is smoother than a fully unregularized version would be, avoiding the wiggles that would fit noise. Other overfitting prevention techniques include: Using simpler models Early stopping during training Cross-validation to tune hyperparameters Data augmentation to increase training data diversity
Flashcards
What is the definition of generalization in the context of machine learning?
The ability of a learning machine to perform accurately on unseen examples after training.
When does underfitting occur in a machine learning model?
When the hypothesis is too simple to capture the underlying function.
When does overfitting occur in a machine learning model?
When the hypothesis is too complex, causing it to fit noise and reducing performance on new data.
When is a learning problem considered feasible within computational learning theory?
If it can be solved in polynomial time.
What do positive and negative results in learning theory indicate about function classes?
Positive results show they can be learned in polynomial time; negative results prove they cannot.
What type of performance bounds does learning theory typically provide?
Probabilistic performance bounds (rather than deterministic guarantees).
Which two types of complexity are evaluated in the theoretical analysis of learning?
Sample complexity (number of training examples needed) Computational complexity (time required for learning)
What three characteristics must training datasets have for machine learning models to achieve accurate predictions?
Large Reliable Representative
What is the primary cause of algorithmic bias in machine learning models?
Training models on biased or unrepresentative data.
How does federated learning preserve user privacy during model training?
By decentralizing training and keeping user data on local devices.

Quiz

Which strategy helps avoid overfitting according to mitigation techniques?
1 of 10
Key Concepts
Model Performance Issues
Generalization (machine learning)
Overfitting
Underfitting
Regularization (machine learning)
Learning Theory and Complexity
Computational learning theory
Sample complexity
Data and Ethics in Learning
Federated learning
Algorithmic bias