Supervised Learning Algorithms
Understand the core ideas and distinctions of common supervised learning algorithms, including SVMs, regression models, decision trees, and neural networks.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the primary goal of Support Vector Machines (SVMs) regarding class separation?
1 of 10
Summary
Common Supervised Learning Algorithms
Introduction
Supervised learning algorithms form the foundation of machine learning. These algorithms learn from labeled training data—where we know both the input features and the correct output—to make predictions on new, unseen data. This outline covers eight major supervised learning algorithms you'll encounter throughout your machine learning studies. Each has distinct strengths, works best on different types of problems, and makes different assumptions about the data.
The algorithms fall into two main categories: classification algorithms (which predict discrete categories) and regression algorithms (which predict continuous values). Many algorithms can be adapted for both tasks. Understanding how and when to use each is essential for practical machine learning.
Linear Regression
CRITICALCOVEREDONEXAM
Linear regression models the relationship between input features and a continuous output as a linear combination of those features.
Core Concept
Linear regression assumes that the output $y$ can be expressed as:
$$y = w0 + w1x1 + w2x2 + ... + wnxn + \epsilon$$
where $xi$ are the input features, $wi$ are weights (parameters) the algorithm learns, $w0$ is the intercept, and $\epsilon$ represents random error.
The goal is to find the weights that minimize the error between predicted values and actual values on the training data. Typically, we minimize the sum of squared errors—the sum of $(y{predicted} - y{actual})^2$ across all training examples.
Why Use It?
Linear regression is simple, interpretable, and computationally efficient. It's ideal when you expect a linear relationship between features and output. The learned weights directly tell you how much each feature influences the prediction: larger weights mean stronger influence.
Key Limitation
Linear regression assumes the relationship is actually linear. If your data has nonlinear patterns, linear regression will perform poorly. This is a critical point to understand—the algorithm itself doesn't adapt to capture curves or complex interactions.
Logistic Regression
CRITICALCOVEREDONEXAM
Logistic regression models the probability of a binary class (two possible outcomes) using a logistic function applied to a linear combination of features.
Core Concept
Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It works by:
Computing a linear combination of features: $z = w0 + w1x1 + w2x2 + ... + wnxn$
Applying the logistic function (also called sigmoid): $P(\text{class}=1) = \frac{1}{1 + e^{-z}}$
This function squashes any value between 0 and 1, making it perfect for expressing probabilities. If the probability is greater than 0.5, we predict class 1; otherwise, we predict class 0.
Why Use It?
Logistic regression is widely used because it:
Provides probability estimates, not just class labels
Is interpretable (like linear regression, the weights show feature importance)
Works well when classes are roughly linearly separable
Is computationally efficient
Common Confusion Point
Students often confuse logistic regression with linear regression. Remember: linear regression predicts continuous values; logistic regression predicts probabilities for a binary outcome. They're related (both use weighted combinations of features) but serve different purposes.
Support Vector Machines
CRITICALCOVEREDONEXAM
Support Vector Machines (SVMs) find a hyperplane that maximally separates different classes by identifying support vectors—the most informative training examples.
Core Concept
In its simplest form, an SVM for binary classification finds a line (or more generally, a hyperplane) that separates the two classes with the maximum margin. The margin is the distance between the separating line and the nearest training examples on each side.
The "support vectors" are the training points that lie closest to this separating line—these points alone determine the position and orientation of the line. This is actually an elegant property: SVMs ignore many training examples and focus only on the critical ones.
The Kernel Trick
A key strength of SVMs is their ability to handle nonlinear problems through the kernel trick. Instead of trying to find a line that separates the classes in the original feature space (where it might be impossible), SVMs can implicitly transform the data into a higher-dimensional space where a separating line exists.
Common kernels include:
Linear kernel: for linearly separable data
Polynomial kernel: for moderately nonlinear boundaries
RBF (Radial Basis Function) kernel: for complex, nonlinear boundaries
Why Use It?
SVMs work well in high-dimensional spaces and with relatively small datasets. They're particularly powerful for binary classification and image recognition tasks.
Key Limitation
SVMs don't naturally extend to multi-class problems (though methods exist) and can be computationally expensive on very large datasets. Additionally, the choice of kernel and its parameters significantly affects performance, requiring careful tuning.
Naive Bayes Classifier
CRITICALCOVEREDONEXAM
Naive Bayes assumes that features are conditionally independent given the class and computes class probabilities using Bayes' theorem.
Core Concept
Naive Bayes applies Bayes' theorem from probability theory to compute the probability of each class given the features:
$$P(\text{class}|x1, x2, ..., xn) = \frac{P(x1, x2, ..., xn|\text{class}) \cdot P(\text{class})}{P(x1, x2, ..., xn)}$$
The "naive" assumption is that all features are independent given the class—that is, knowing the value of one feature tells you nothing additional about another feature once you know the class.
While this assumption is almost never true in practice, the algorithm works surprisingly well despite this simplification. It works by:
Estimating the probability of each class in the training data
For each feature, estimating the probability distribution of that feature within each class
For a new example, multiplying these probabilities together to predict the most likely class
Why Use It?
Naive Bayes is:
Extremely fast to train and predict
Requires relatively little training data compared to other algorithms
Robust and surprisingly effective despite its oversimplifying assumption
Particularly popular for text classification and spam filtering
When Does It Fail?
When features are heavily dependent on each other, Naive Bayes can perform poorly because its independence assumption is severely violated. For example, in image recognition where neighboring pixels are highly correlated, Naive Bayes is not ideal.
Linear Discriminant Analysis
CRITICALCOVEREDONEXAM
Linear Discriminant Analysis (LDA) models the distribution of features within each class as a multivariate normal distribution and finds a decision boundary that best separates the classes.
Core Concept
LDA assumes that:
The features within each class follow a normal (Gaussian) distribution
All classes share the same covariance matrix (they have the same "shape")
Given these assumptions, LDA estimates the mean and covariance for each class from the training data, then uses these to determine the most likely class for a new example.
The decision boundary between two classes is linear, which is why it's called "linear" discriminant analysis. This linear boundary emerges naturally from the probabilistic model and the normality assumption.
LDA vs. Logistic Regression
Both LDA and logistic regression produce linear decision boundaries, but they approach the problem differently:
LDA models the distribution of features within each class
Logistic regression models the probability of the class directly
When features are approximately normally distributed, LDA often works better because it's using the true generative model. When the normality assumption is violated, logistic regression can be more robust.
Why Use It?
LDA is computationally efficient, interpretable, and works well for multi-class problems (unlike some classifiers). It also provides probabilistic predictions.
Key Limitation
The assumption of shared covariance across classes is restrictive. If different classes have very different spreads in feature space, LDA may not work well. A related technique called Quadratic Discriminant Analysis (QDA) relaxes this assumption but requires more data to estimate.
Decision Trees
CRITICALCOVEREDONEXAM
Decision trees recursively split the input space based on feature thresholds, creating a tree of decision rules that guide predictions.
Core Concept
A decision tree works like a flowchart. At each node, the algorithm selects a feature and a threshold value, then splits the data into two groups: examples where the feature value is below the threshold and examples where it's above.
This process repeats recursively on each split until reaching a stopping condition (like a maximum tree depth or minimum examples per leaf). The final groups at the bottom of the tree are called leaves, and each leaf is assigned a predicted class (for classification) or value (for regression).
How Are Splits Chosen?
The algorithm greedily selects splits that best separate the classes. "Best" typically means maximizing information gain—how much the split reduces uncertainty in the class labels. A split is good if it groups similar examples together.
Advantages
Decision trees are:
Interpretable: you can explain decisions by following paths down the tree
Nonlinear: they naturally capture nonlinear relationships and interactions between features
Non-parametric: they make no assumptions about the data distribution
Efficient: both training and prediction are fast
Critical Limitation: Overfitting
Decision trees tend to overfit severely—they can grow so deep that they memorize the training data, including its noise, rather than learning the underlying pattern. A tree that perfectly classifies every training example might perform terribly on new data.
This is addressed through pruning: removing branches that don't improve performance on a validation set. Without pruning, deep decision trees almost always overfit.
Multi-way Splits
When a feature is categorical with many possible values, a decision tree can split into multiple branches (one per category) rather than just two. However, binary splits (splitting into two groups) are more common in practice.
k Nearest Neighbor Algorithm
CRITICALCOVEREDONEXAM
The k-Nearest Neighbor algorithm predicts the output of a new instance by examining the k closest training examples, using their majority class (classification) or average value (regression).
Core Concept
kNN is one of the simplest algorithms: to make a prediction for a new example, find the k training examples that are closest to it, then:
For classification: predict the class that appears most frequently among those k examples
For regression: predict the average of their values
The notion of "closest" typically means Euclidean distance in the feature space, though other distance metrics can be used.
Worked Example
Imagine classifying emails as spam or not spam, and k=3. For a new email, you find the 3 most similar emails in your training set. If 2 are spam and 1 is not, you predict the new email is spam.
Why Use It?
kNN is:
Simple to understand and implement
Naturally nonlinear—it captures complex decision boundaries
Effective with relatively small, clean datasets
Adaptable to both classification and regression
Critical Limitations
Computational cost: With kNN, you must store all training data and compute distances to all examples for each prediction. For large datasets, this is slow.
Curse of dimensionality: In high-dimensional feature spaces, all points become roughly equally distant from each other, making kNN ineffective. As the number of features increases, you need exponentially more training data for kNN to work well.
Sensitivity to k: The choice of k dramatically affects performance. Small k (like k=1) can cause overfitting; large k can cause underfitting. There's no universal best value—it depends on your data.
Feature scaling: kNN relies on distances, so features must be scaled appropriately. A feature with values 0–1000 will dominate distance calculations over a feature with values 0–1.
Neural Networks (Multilayer Perceptron)
CRITICALCOVEREDONEXAM
Neural networks consist of layers of interconnected units (neurons) that learn nonlinear mappings between inputs and outputs through backpropagation.
Core Concept
A neural network architecture is organized in layers:
Input layer: one neuron per feature
Hidden layers: intermediate layers (zero or more) that transform the data
Output layer: one neuron per output (class or continuous value)
Each neuron performs a simple computation: it takes a weighted sum of its inputs, adds a bias term, and applies an activation function—typically a nonlinear function like ReLU or sigmoid. This nonlinearity is crucial; without it, stacking layers would be equivalent to a single linear transformation.
Learning Through Backpropagation
Neural networks learn their weights through backpropagation, an efficient algorithm for computing gradients. In essence:
Forward pass: compute predictions for training examples
Compute the error between predictions and true values
Backward pass: propagate error back through layers to update weights
Repeat until convergence
Why Use Neural Networks?
Neural networks can:
Learn highly nonlinear, complex patterns in data
Handle high-dimensional input (images, text)
Achieve state-of-the-art performance on many tasks
Transfer learned representations from one task to another
Important Limitations
Computational intensity: Training requires many passes through data and significant computation, especially on large networks.
Hyperparameter sensitivity: You must choose architecture (number of layers, layer sizes), learning rate, activation functions, and more. Poor choices cause the network to fail to learn.
Data requirements: Neural networks typically need much more training data than simpler algorithms to avoid overfitting.
Black box nature: Unlike decision trees, it's hard to interpret why a neural network made a particular prediction. Understanding the learned patterns requires visualization and analysis.
Training instability: Neural networks can be difficult to train. Training can get stuck in bad local minima, and the learning process requires careful tuning of parameters like learning rate.
<extrainfo>
Additional Algorithms
Beyond the eight main algorithms, machine learning offers several other important approaches:
Random Forests: An ensemble method that trains many decision trees independently (on random subsets of data and features) and aggregates their predictions. This dramatically reduces overfitting compared to single trees.
Ensemble Methods: General techniques that combine multiple simple models to make better predictions. Besides random forests, this includes boosting (adaptively training models to focus on hard examples) and bagging (training on bootstrap samples).
Similarity Learning: Algorithms that learn distance metrics or similarity functions between examples, useful for ranking, recommendation, and clustering applications.
These are powerful techniques but are often built on top of the core algorithms already covered.
</extrainfo>
Summary: Choosing the Right Algorithm
You now understand eight fundamental supervised learning algorithms. Each has different strengths:
Linear Regression and Logistic Regression: Simple, interpretable, fast—use when you suspect linear relationships
Support Vector Machines: Powerful for nonlinear boundaries, works well in high dimensions
Naive Bayes: Fast, data-efficient, ideal for text classification
Linear Discriminant Analysis: Probabilistic approach, efficient, good for multi-class
Decision Trees: Interpretable, naturally nonlinear, prone to overfitting
k-Nearest Neighbor: Simple nonlinear method, computationally expensive, needs careful tuning
Neural Networks: Powerful and flexible but requires significant data and computation
In practice, you'll typically try several algorithms and select based on validation performance. Understanding the assumptions, strengths, and limitations of each gives you the foundation to make informed choices.
Flashcards
What is the primary goal of Support Vector Machines (SVMs) regarding class separation?
Finding a hyperplane that maximally separates classes in a transformed feature space.
How does Linear Regression model a continuous output?
As a linear combination of input features.
What function is applied to a linear combination of features in Logistic Regression to model binary class probability?
Logistic function.
What key assumption does the Naive Bayes Classifier make about features?
Conditional independence.
How does Linear Discriminant Analysis (LDA) model class conditional densities?
As multivariate normal distributions with shared covariance.
By what process do Decision Trees create a tree of decision rules?
Recursively splitting the input space based on feature thresholds.
How does the $k$-nearest neighbor algorithm predict the output for classification tasks?
Majority vote of the $k$ closest training instances.
How does the $k$-nearest neighbor algorithm predict the output for regression tasks?
Average of the $k$ closest training instances.
Through what process do Multilayer Perceptrons learn non-linear mappings?
Backpropagation.
What is the basic structure of a Neural Network?
Layers of interconnected units.
Quiz
Supervised Learning Algorithms Quiz Question 1: In linear regression, the predicted output is expressed as what type of combination of the input features?
- A linear combination (correct)
- A polynomial combination
- A logistic function of the features
- A set of decision rules
Supervised Learning Algorithms Quiz Question 2: In Linear Discriminant Analysis, what is assumed about the covariance matrices of different classes?
- All classes share a common covariance matrix (correct)
- Each class has its own distinct covariance matrix
- Covariance matrices are diagonal
- Covariance matrices are zero
Supervised Learning Algorithms Quiz Question 3: How do decision trees partition the input space?
- By recursively splitting on feature thresholds (correct)
- By projecting data onto a separating hyperplane
- By averaging neighboring data points
- By clustering data into groups
Supervised Learning Algorithms Quiz Question 4: What learning algorithm is commonly used to train multilayer perceptron neural networks?
- Backpropagation (correct)
- Gradient ascent
- K‑means clustering
- Expectation‑maximization
Supervised Learning Algorithms Quiz Question 5: Which of the following is an example of an ensemble method widely used in supervised learning?
- Random forests (correct)
- Logistic regression
- Naive Bayes
- K‑nearest neighbor
Supervised Learning Algorithms Quiz Question 6: Naive Bayes classifiers are examples of which type of learning model?
- Generative (correct)
- Discriminative
- Instance‑based
- Ensemble
Supervised Learning Algorithms Quiz Question 7: How are the input features combined in logistic regression before the logistic function is applied?
- a linear combination (correct)
- their product
- the maximum value
- a nonlinear transformation
Supervised Learning Algorithms Quiz Question 8: When making a prediction, which training instances does the k‑nearest neighbor algorithm use?
- the k closest training instances (correct)
- randomly selected training points
- all training points
- the k farthest training instances
Supervised Learning Algorithms Quiz Question 9: What does a support vector machine maximize when selecting the separating hyperplane in the transformed feature space?
- The margin (distance) between the classes (correct)
- The number of support vectors used
- The training accuracy of the model
- The complexity (number of parameters) of the model
In linear regression, the predicted output is expressed as what type of combination of the input features?
1 of 9
Key Concepts
Supervised Learning Models
Support Vector Machine
Linear Regression
Logistic Regression
Naive Bayes Classifier
Linear Discriminant Analysis
Decision Tree
k-Nearest Neighbor
Multilayer Perceptron
Random Forest
Ensemble Method
Definitions
Support Vector Machine
A supervised learning model that finds the optimal hyperplane separating classes in a transformed feature space.
Linear Regression
A statistical method that predicts a continuous target as a linear combination of input features.
Logistic Regression
A classification technique that models binary class probabilities using a logistic function applied to a linear predictor.
Naive Bayes Classifier
A probabilistic classifier that applies Bayes’ theorem with the assumption of feature independence.
Linear Discriminant Analysis
A method that models class-conditional densities as multivariate normals with shared covariance to achieve dimensionality reduction and classification.
Decision Tree
A flowchart-like model that recursively splits data based on feature thresholds to form a hierarchy of decision rules.
k-Nearest Neighbor
An instance-based algorithm that classifies or regresses a query point by aggregating the labels of its k closest training examples.
Multilayer Perceptron
A type of feedforward neural network composed of multiple layers of interconnected neurons trained via backpropagation.
Random Forest
An ensemble learning technique that builds multiple decision trees on bootstrapped samples and aggregates their predictions.
Ensemble Method
A strategy that combines the outputs of several base learners to improve predictive performance and robustness.