Subjects/Technology/Data and AI/Machine Learning/Supervised learning

Supervised Learning Algorithms

Understand the core ideas and distinctions of common supervised learning algorithms, including SVMs, regression models, decision trees, and neural networks.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the primary goal of Support Vector Machines (SVMs) regarding class separation?

1 of 10

Summary

Common Supervised Learning Algorithms Introduction Supervised learning algorithms form the foundation of machine learning. These algorithms learn from labeled training data—where we know both the input features and the correct output—to make predictions on new, unseen data. This outline covers eight major supervised learning algorithms you'll encounter throughout your machine learning studies. Each has distinct strengths, works best on different types of problems, and makes different assumptions about the data. The algorithms fall into two main categories: classification algorithms (which predict discrete categories) and regression algorithms (which predict continuous values). Many algorithms can be adapted for both tasks. Understanding how and when to use each is essential for practical machine learning. Linear Regression CRITICALCOVEREDONEXAM Linear regression models the relationship between input features and a continuous output as a linear combination of those features. Core Concept Linear regression assumes that the output $y$ can be expressed as: $$y = w0 + w1x1 + w2x2 + ... + wnxn + \epsilon$$ where $xi$ are the input features, $wi$ are weights (parameters) the algorithm learns, $w0$ is the intercept, and $\epsilon$ represents random error. The goal is to find the weights that minimize the error between predicted values and actual values on the training data. Typically, we minimize the sum of squared errors—the sum of $(y{predicted} - y{actual})^2$ across all training examples. Why Use It? Linear regression is simple, interpretable, and computationally efficient. It's ideal when you expect a linear relationship between features and output. The learned weights directly tell you how much each feature influences the prediction: larger weights mean stronger influence. Key Limitation Linear regression assumes the relationship is actually linear. If your data has nonlinear patterns, linear regression will perform poorly. This is a critical point to understand—the algorithm itself doesn't adapt to capture curves or complex interactions. Logistic Regression CRITICALCOVEREDONEXAM Logistic regression models the probability of a binary class (two possible outcomes) using a logistic function applied to a linear combination of features. Core Concept Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It works by: Computing a linear combination of features: $z = w0 + w1x1 + w2x2 + ... + wnxn$ Applying the logistic function (also called sigmoid): $P(\text{class}=1) = \frac{1}{1 + e^{-z}}$ This function squashes any value between 0 and 1, making it perfect for expressing probabilities. If the probability is greater than 0.5, we predict class 1; otherwise, we predict class 0. Why Use It? Logistic regression is widely used because it: Provides probability estimates, not just class labels Is interpretable (like linear regression, the weights show feature importance) Works well when classes are roughly linearly separable Is computationally efficient Common Confusion Point Students often confuse logistic regression with linear regression. Remember: linear regression predicts continuous values; logistic regression predicts probabilities for a binary outcome. They're related (both use weighted combinations of features) but serve different purposes. Support Vector Machines CRITICALCOVEREDONEXAM Support Vector Machines (SVMs) find a hyperplane that maximally separates different classes by identifying support vectors—the most informative training examples. Core Concept In its simplest form, an SVM for binary classification finds a line (or more generally, a hyperplane) that separates the two classes with the maximum margin. The margin is the distance between the separating line and the nearest training examples on each side. The "support vectors" are the training points that lie closest to this separating line—these points alone determine the position and orientation of the line. This is actually an elegant property: SVMs ignore many training examples and focus only on the critical ones. The Kernel Trick A key strength of SVMs is their ability to handle nonlinear problems through the kernel trick. Instead of trying to find a line that separates the classes in the original feature space (where it might be impossible), SVMs can implicitly transform the data into a higher-dimensional space where a separating line exists. Common kernels include: Linear kernel: for linearly separable data Polynomial kernel: for moderately nonlinear boundaries RBF (Radial Basis Function) kernel: for complex, nonlinear boundaries Why Use It? SVMs work well in high-dimensional spaces and with relatively small datasets. They're particularly powerful for binary classification and image recognition tasks. Key Limitation SVMs don't naturally extend to multi-class problems (though methods exist) and can be computationally expensive on very large datasets. Additionally, the choice of kernel and its parameters significantly affects performance, requiring careful tuning. Naive Bayes Classifier CRITICALCOVEREDONEXAM Naive Bayes assumes that features are conditionally independent given the class and computes class probabilities using Bayes' theorem. Core Concept Naive Bayes applies Bayes' theorem from probability theory to compute the probability of each class given the features: $$P(\text{class}|x1, x2, ..., xn) = \frac{P(x1, x2, ..., xn|\text{class}) \cdot P(\text{class})}{P(x1, x2, ..., xn)}$$ The "naive" assumption is that all features are independent given the class—that is, knowing the value of one feature tells you nothing additional about another feature once you know the class. While this assumption is almost never true in practice, the algorithm works surprisingly well despite this simplification. It works by: Estimating the probability of each class in the training data For each feature, estimating the probability distribution of that feature within each class For a new example, multiplying these probabilities together to predict the most likely class Why Use It? Naive Bayes is: Extremely fast to train and predict Requires relatively little training data compared to other algorithms Robust and surprisingly effective despite its oversimplifying assumption Particularly popular for text classification and spam filtering When Does It Fail? When features are heavily dependent on each other, Naive Bayes can perform poorly because its independence assumption is severely violated. For example, in image recognition where neighboring pixels are highly correlated, Naive Bayes is not ideal. Linear Discriminant Analysis CRITICALCOVEREDONEXAM Linear Discriminant Analysis (LDA) models the distribution of features within each class as a multivariate normal distribution and finds a decision boundary that best separates the classes. Core Concept LDA assumes that: The features within each class follow a normal (Gaussian) distribution All classes share the same covariance matrix (they have the same "shape") Given these assumptions, LDA estimates the mean and covariance for each class from the training data, then uses these to determine the most likely class for a new example. The decision boundary between two classes is linear, which is why it's called "linear" discriminant analysis. This linear boundary emerges naturally from the probabilistic model and the normality assumption. LDA vs. Logistic Regression Both LDA and logistic regression produce linear decision boundaries, but they approach the problem differently: LDA models the distribution of features within each class Logistic regression models the probability of the class directly When features are approximately normally distributed, LDA often works better because it's using the true generative model. When the normality assumption is violated, logistic regression can be more robust. Why Use It? LDA is computationally efficient, interpretable, and works well for multi-class problems (unlike some classifiers). It also provides probabilistic predictions. Key Limitation The assumption of shared covariance across classes is restrictive. If different classes have very different spreads in feature space, LDA may not work well. A related technique called Quadratic Discriminant Analysis (QDA) relaxes this assumption but requires more data to estimate. Decision Trees CRITICALCOVEREDONEXAM Decision trees recursively split the input space based on feature thresholds, creating a tree of decision rules that guide predictions. Core Concept A decision tree works like a flowchart. At each node, the algorithm selects a feature and a threshold value, then splits the data into two groups: examples where the feature value is below the threshold and examples where it's above. This process repeats recursively on each split until reaching a stopping condition (like a maximum tree depth or minimum examples per leaf). The final groups at the bottom of the tree are called leaves, and each leaf is assigned a predicted class (for classification) or value (for regression). How Are Splits Chosen? The algorithm greedily selects splits that best separate the classes. "Best" typically means maximizing information gain—how much the split reduces uncertainty in the class labels. A split is good if it groups similar examples together. Advantages Decision trees are: Interpretable: you can explain decisions by following paths down the tree Nonlinear: they naturally capture nonlinear relationships and interactions between features Non-parametric: they make no assumptions about the data distribution Efficient: both training and prediction are fast Critical Limitation: Overfitting Decision trees tend to overfit severely—they can grow so deep that they memorize the training data, including its noise, rather than learning the underlying pattern. A tree that perfectly classifies every training example might perform terribly on new data. This is addressed through pruning: removing branches that don't improve performance on a validation set. Without pruning, deep decision trees almost always overfit. Multi-way Splits When a feature is categorical with many possible values, a decision tree can split into multiple branches (one per category) rather than just two. However, binary splits (splitting into two groups) are more common in practice. k Nearest Neighbor Algorithm CRITICALCOVEREDONEXAM The k-Nearest Neighbor algorithm predicts the output of a new instance by examining the k closest training examples, using their majority class (classification) or average value (regression). Core Concept kNN is one of the simplest algorithms: to make a prediction for a new example, find the k training examples that are closest to it, then: For classification: predict the class that appears most frequently among those k examples For regression: predict the average of their values The notion of "closest" typically means Euclidean distance in the feature space, though other distance metrics can be used. Worked Example Imagine classifying emails as spam or not spam, and k=3. For a new email, you find the 3 most similar emails in your training set. If 2 are spam and 1 is not, you predict the new email is spam. Why Use It? kNN is: Simple to understand and implement Naturally nonlinear—it captures complex decision boundaries Effective with relatively small, clean datasets Adaptable to both classification and regression Critical Limitations Computational cost: With kNN, you must store all training data and compute distances to all examples for each prediction. For large datasets, this is slow. Curse of dimensionality: In high-dimensional feature spaces, all points become roughly equally distant from each other, making kNN ineffective. As the number of features increases, you need exponentially more training data for kNN to work well. Sensitivity to k: The choice of k dramatically affects performance. Small k (like k=1) can cause overfitting; large k can cause underfitting. There's no universal best value—it depends on your data. Feature scaling: kNN relies on distances, so features must be scaled appropriately. A feature with values 0–1000 will dominate distance calculations over a feature with values 0–1. Neural Networks (Multilayer Perceptron) CRITICALCOVEREDONEXAM Neural networks consist of layers of interconnected units (neurons) that learn nonlinear mappings between inputs and outputs through backpropagation. Core Concept A neural network architecture is organized in layers: Input layer: one neuron per feature Hidden layers: intermediate layers (zero or more) that transform the data Output layer: one neuron per output (class or continuous value) Each neuron performs a simple computation: it takes a weighted sum of its inputs, adds a bias term, and applies an activation function—typically a nonlinear function like ReLU or sigmoid. This nonlinearity is crucial; without it, stacking layers would be equivalent to a single linear transformation. Learning Through Backpropagation Neural networks learn their weights through backpropagation, an efficient algorithm for computing gradients. In essence: Forward pass: compute predictions for training examples Compute the error between predictions and true values Backward pass: propagate error back through layers to update weights Repeat until convergence Why Use Neural Networks? Neural networks can: Learn highly nonlinear, complex patterns in data Handle high-dimensional input (images, text) Achieve state-of-the-art performance on many tasks Transfer learned representations from one task to another Important Limitations Computational intensity: Training requires many passes through data and significant computation, especially on large networks. Hyperparameter sensitivity: You must choose architecture (number of layers, layer sizes), learning rate, activation functions, and more. Poor choices cause the network to fail to learn. Data requirements: Neural networks typically need much more training data than simpler algorithms to avoid overfitting. Black box nature: Unlike decision trees, it's hard to interpret why a neural network made a particular prediction. Understanding the learned patterns requires visualization and analysis. Training instability: Neural networks can be difficult to train. Training can get stuck in bad local minima, and the learning process requires careful tuning of parameters like learning rate. <extrainfo> Additional Algorithms Beyond the eight main algorithms, machine learning offers several other important approaches: Random Forests: An ensemble method that trains many decision trees independently (on random subsets of data and features) and aggregates their predictions. This dramatically reduces overfitting compared to single trees. Ensemble Methods: General techniques that combine multiple simple models to make better predictions. Besides random forests, this includes boosting (adaptively training models to focus on hard examples) and bagging (training on bootstrap samples). Similarity Learning: Algorithms that learn distance metrics or similarity functions between examples, useful for ranking, recommendation, and clustering applications. These are powerful techniques but are often built on top of the core algorithms already covered. </extrainfo> Summary: Choosing the Right Algorithm You now understand eight fundamental supervised learning algorithms. Each has different strengths: Linear Regression and Logistic Regression: Simple, interpretable, fast—use when you suspect linear relationships Support Vector Machines: Powerful for nonlinear boundaries, works well in high dimensions Naive Bayes: Fast, data-efficient, ideal for text classification Linear Discriminant Analysis: Probabilistic approach, efficient, good for multi-class Decision Trees: Interpretable, naturally nonlinear, prone to overfitting k-Nearest Neighbor: Simple nonlinear method, computationally expensive, needs careful tuning Neural Networks: Powerful and flexible but requires significant data and computation In practice, you'll typically try several algorithms and select based on validation performance. Understanding the assumptions, strengths, and limitations of each gives you the foundation to make informed choices.

Flashcards

What is the primary goal of Support Vector Machines (SVMs) regarding class separation?

Finding a hyperplane that maximally separates classes in a transformed feature space.

How does Linear Regression model a continuous output?

As a linear combination of input features.

What function is applied to a linear combination of features in Logistic Regression to model binary class probability?

Logistic function.

What key assumption does the Naive Bayes Classifier make about features?

Conditional independence.

How does Linear Discriminant Analysis (LDA) model class conditional densities?

As multivariate normal distributions with shared covariance.

By what process do Decision Trees create a tree of decision rules?

Recursively splitting the input space based on feature thresholds.

How does the $k$-nearest neighbor algorithm predict the output for classification tasks?

Majority vote of the $k$ closest training instances.

How does the $k$-nearest neighbor algorithm predict the output for regression tasks?

Average of the $k$ closest training instances.

Through what process do Multilayer Perceptrons learn non-linear mappings?

Backpropagation.

What is the basic structure of a Neural Network?

Layers of interconnected units.

Quiz

In linear regression, the predicted output is expressed as what type of combination of the input features?

1 of 9

Key Concepts

Supervised Learning Models

Support Vector Machine

Linear Regression

Logistic Regression

Naive Bayes Classifier

Linear Discriminant Analysis

Decision Tree

k-Nearest Neighbor

Multilayer Perceptron

Random Forest

Ensemble Method

Definitions

Support Vector Machine

A supervised learning model that finds the optimal hyperplane separating classes in a transformed feature space.

Linear Regression

A statistical method that predicts a continuous target as a linear combination of input features.

Logistic Regression

A classification technique that models binary class probabilities using a logistic function applied to a linear predictor.

Naive Bayes Classifier

A probabilistic classifier that applies Bayes’ theorem with the assumption of feature independence.

Linear Discriminant Analysis

A method that models class-conditional densities as multivariate normals with shared covariance to achieve dimensionality reduction and classification.

Decision Tree

A flowchart-like model that recursively splits data based on feature thresholds to form a hierarchy of decision rules.

k-Nearest Neighbor

An instance-based algorithm that classifies or regresses a query point by aggregating the labels of its k closest training examples.

Multilayer Perceptron

A type of feedforward neural network composed of multiple layers of interconnected neurons trained via backpropagation.

Random Forest

An ensemble learning technique that builds multiple decision trees on bootstrapped samples and aggregates their predictions.

Ensemble Method

A strategy that combines the outputs of several base learners to improve predictive performance and robustness.