Machine learning - Fundamental Model Families
Understand the core concepts, structures, and typical uses of neural networks, decision trees, support vector machines, regression models, Bayesian networks, and Gaussian processes.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
Under what condition does an artificial neuron transmit a signal?
1 of 21
Summary
Machine Learning Algorithms and Methods
Introduction to Machine Learning
Machine learning is a subset of artificial intelligence focused on systems that can learn from data and make predictions or decisions without being explicitly programmed. Several key algorithms have become foundational in this field, each with distinct strengths and applications. This guide covers the most important algorithmic approaches you need to understand: artificial neural networks, decision trees, support vector machines, and regression analysis.
Artificial Neural Networks
How Artificial Neurons Work
Artificial neural networks are inspired by how biological brains process information. At the heart of these systems is the artificial neuron, a simple computational unit that mimics the behavior of biological neurons.
An artificial neuron receives multiple input signals and processes them through a specific mechanism. The neuron calculates an aggregate input (a weighted sum of all inputs) and compares this sum against a threshold value. The key insight is that the neuron only produces an output signal when this aggregate input exceeds the threshold. This all-or-nothing behavior is called activation.
The strength of the signal at each connection between neurons is called a weight. These weights determine how much influence one neuron has on another. If a weight is large, that input has a strong effect; if small, it has little effect. By adjusting these weights during training, the network learns patterns in data.
Organization into Layers
Artificial neurons are never used in isolation. Instead, they are organized into layers, with each layer performing a specific transformation on its inputs. A typical neural network has three types of layers:
Input layer: receives the raw data
Hidden layers: perform intermediate computations and transformations
Output layer: produces the final predictions or classifications
Signals travel forward through the network from the input layer, passing through hidden layers, and finally reaching the output layer. This unidirectional flow is why this process is called forward propagation. In some networks, signals may traverse the network multiple times to refine the output.
Deep Learning
When an artificial neural network contains multiple hidden layers, it becomes a deep neural network, and training such networks is called deep learning. The motivation for using deep networks is that they can model complex patterns in data—particularly in how the brain processes sensory information like light into vision and sound into hearing. Each hidden layer learns to represent the data at a different level of abstraction, with early layers detecting simple patterns and deeper layers combining these into more complex concepts.
Decision Tree Methods
The Basic Concept
Decision trees represent a fundamentally different approach to machine learning. Rather than using continuous mathematical functions like neural networks, decision trees use a tree structure to map observations about an item to conclusions about its target value. Think of a decision tree as a flowchart: you start at the root and follow branches based on the features of your data, eventually reaching a leaf that gives you the answer.
The power of decision trees lies in their interpretability—you can actually trace through the logic and understand why the tree made a particular prediction.
Classification Trees
In classification trees, the goal is to predict which category (class) an item belongs to. The tree works as follows:
Branches represent conditions on features (like "Is age > 30?" or "Is income > $50,000?")
Leaves represent class labels (the categories you're predicting, such as "approved" or "denied")
Each path from the root to a leaf represents a conjunction of features—multiple conditions that must all be true. For example, a path might represent "age > 30 AND income > $50,000 AND credit score > 700" leading to the "loan approved" class.
The example above shows a classification tree for predicting Titanic passenger survival. You can see how the tree splits on gender first, then uses age and sibling/spouse count to make further predictions.
Regression Trees
Regression trees solve a different problem: predicting continuous numerical values rather than categories. Instead of leaves containing class labels, the leaves contain real numbers (like predicted house prices or temperature values). The tree structure remains similar, but now each path leads to a numerical prediction rather than a category.
Support Vector Machines
Linear Classification
Support Vector Machines (SVMs) are powerful algorithms for classification problems, particularly when you need to separate two groups of data. An SVM creates a non-probabilistic binary linear classifier, meaning it draws a dividing line (or hyperplane in higher dimensions) that separates two categories of training examples.
The key insight of SVMs is that they find the line that provides the maximum margin—the largest possible distance between the dividing line and the closest points from each class. This approach tends to generalize well to new data because it finds the most "stable" separation.
The diagram above illustrates this concept: the solid line is the decision boundary, and the dashed lines show the margin around it. The circled points are the support vectors—the critical points closest to the boundary that determine where the dividing line is placed.
Non-Linear Classification with the Kernel Trick
The main limitation of basic SVMs is that they assume the data can be separated by a straight line (or hyperplane). Real-world data is often not linearly separable. To handle this, SVMs use a clever technique called the kernel trick.
The kernel trick works by mapping the input data into a higher-dimensional feature space where a linear separation becomes possible. For example, imagine you have data that forms a spiral pattern in 2D space—impossible to separate with a line. By mapping this into 3D space, a 2D plane might now separate the two classes perfectly.
The beauty of the kernel trick is that this transformation happens implicitly—you don't need to explicitly create all the new dimensions, which would be computationally expensive. Instead, kernel functions calculate relationships between data points in the high-dimensional space directly.
Probabilistic Outputs
By default, SVMs produce only class predictions (category A or category B). However, if you need probabilistic outputs (confidence scores between 0 and 1 representing the probability of each class), Platt scaling can be applied to convert SVM outputs into probabilities.
Regression Analysis
Linear Regression Fundamentals
Linear regression is one of the most fundamental techniques in statistics and machine learning. The goal is simple: find the best straight line that fits your data. "Best" is defined mathematically as the line that minimizes the sum of squared differences between the actual data points and the predicted values on the line. This quantity is called the sum of squared errors or residual sum of squares.
$$\text{SSE} = \sum{i=1}^{n} (yi - \hat{y}i)^2$$
where $yi$ is the observed value and $\hat{y}i$ is the predicted value.
In the figure above, the red line is the best-fit linear regression line, and each blue dot is a data point. The vertical distances from points to the line are the errors that the regression minimizes.
Why Use Regularization?
When fitting a model to data, there's always a risk of overfitting: the model learns the noise and peculiarities of the training data rather than the true underlying pattern. This causes poor performance on new, unseen data.
Regularization techniques address this by adding a penalty term to the objective function, discouraging the model from becoming too complex. Ridge regression is one common approach that adds a penalty based on the squared magnitude of the model parameters:
$$\text{Objective} = \sum{i=1}^{n} (yi - \hat{y}i)^2 + \lambda \sum{j=1}^{p} \betaj^2$$
The $\lambda$ parameter controls the strength of the penalty—higher values push the model toward simpler solutions, while lower values allow the model to fit the data more closely.
Extensions Beyond Simple Linear Regression
Linear regression assumes a linear relationship between inputs and outputs, which isn't always realistic. Several important extensions exist:
Polynomial regression generalizes linear regression by fitting a polynomial curve to the data instead of a line. For example, a quadratic regression might fit a parabola: $y = \beta0 + \beta1 x + \beta2 x^2$.
Logistic regression handles a different kind of problem: binary classification (predicting yes/no outcomes). Despite its name, logistic regression is a classification method, not a regression method. It models the probability of an item belonging to the positive class using an S-shaped curve.
Kernel regression uses the kernel trick (similar to kernel-based SVMs) to introduce non-linearity. By mapping input variables into a higher-dimensional space, kernel regression can fit curved relationships in data while remaining computationally efficient.
Multiple Dependent Variables
So far we've discussed predicting a single output variable. Multivariate linear regression extends this to estimate relationships between multiple input variables and multiple dependent variables simultaneously. This is useful when you need to predict several related outputs at once, and the outputs may influence each other through the shared input space.
<extrainfo>
Bayesian Networks
Probabilistic Graphical Models
A Bayesian network is a directed acyclic graph (a network with arrows pointing in one direction and no cycles) that represents random variables and their conditional dependencies. Each node represents a random variable, and arrows indicate which variables directly influence which other variables.
The power of Bayesian networks is that they can represent complex probability relationships in an intuitive visual form. Instead of specifying all possible joint probabilities (which grows exponentially with the number of variables), you only need to specify conditional probabilities for each variable given its parents in the network.
Inference and Learning
Two fundamental tasks with Bayesian networks are:
Inference: Given some observed evidence (you know certain variables' values), compute the probability distribution over unknown variables. Efficient algorithms exist to perform this computation even in large networks.
Learning: Given data, discover both the structure of the network (which variables influence which others) and the probability values. This is more complex but essential when you don't know the network structure in advance.
Dynamic Bayesian Networks
Dynamic Bayesian networks extend the basic framework to handle sequences of variables over time, such as speech signals, protein sequences, or stock prices. These networks represent how the state at time $t$ depends on the state at time $t-1$, enabling prediction of future states based on observed past behavior.
</extrainfo>
<extrainfo>
Gaussian Processes
Understanding Gaussian Processes
A Gaussian process is a type of stochastic process (a mathematical object representing randomness evolving over time or space) with a special property: any finite collection of random variables follows a multivariate normal distribution (a generalization of the bell curve to multiple dimensions).
This might sound abstract, but the practical implication is powerful: Gaussian processes provide a flexible way to model uncertainty in predictions.
The Covariance Function
The behavior of a Gaussian process is entirely determined by two functions: a mean function and a covariance function (also called a kernel). The covariance function models how pairs of points relate to each other based on their distance in the input space. Points that are close together tend to have outputs that are highly correlated, while distant points have outputs that are less correlated.
Making Predictions
The key strength of Gaussian processes appears when making predictions: Given a set of observed input-output examples, a Gaussian process doesn't just produce a single point prediction. Instead, it computes the entire distribution of possible outputs for a new input, incorporating uncertainty from observed data and their covariances.
Application to Bayesian Optimisation
Gaussian processes serve as surrogate models in Bayesian optimization, a technique for finding the best hyperparameter settings for machine learning models. Rather than training a model many times with different hyperparameters (expensive), Bayesian optimization uses a Gaussian process to predict which hyperparameters might work best, intelligently exploring the hyperparameter space.
</extrainfo>
Flashcards
Under what condition does an artificial neuron transmit a signal?
When the aggregate input exceeds a specified threshold.
What determines the amount of influence one artificial neuron has on another?
The strength of the signal at the connection.
What structural feature of an artificial neural network defines deep learning?
The use of multiple hidden layers.
In a classification tree, what do the leaves represent?
Class labels.
What do the leaves of a regression tree represent?
Continuous target values (e.g., real numbers).
What type of non-probabilistic classifier is originally created by a support vector machine?
A binary linear classifier.
Which technique allows support vector machines to perform efficient non-linear classification?
The kernel trick.
How does the kernel trick enable non-linear classification in support vector machines?
By mapping inputs into a high-dimensional feature space.
Which method can be applied to support vector machines to produce probabilistic outputs?
Platt scaling.
By what criteria does linear regression fit a line to a set of data?
Minimizing the sum of squared differences.
What does ridge regression add to the linear regression objective function?
A penalty term.
Which regression model is used to model binary outcomes for statistical classification?
Logistic regression.
How does kernel regression introduce non-linearity into a model?
By using the kernel trick to map input variables into a higher-dimensional space.
What is estimated by multivariate linear regression?
Relationships between multiple input variables and multiple dependent variables.
What mathematical structure defines a Bayesian network?
A directed acyclic graph (DAG).
What two things does a Bayesian network represent?
Random variables
Conditional independencies
What are the two primary computational tasks performed with Bayesian networks?
Inference (computing posterior probabilities)
Learning the network structure
What is the primary purpose of a dynamic Bayesian network?
To model sequences of variables.
What distribution must any finite collection of random variables in a Gaussian process follow?
A multivariate normal distribution.
In a Gaussian process, what is the role of the covariance function (kernel)?
To model how pairs of points relate based on their locations.
What role do Gaussian processes play in Bayesian optimisation?
They serve as surrogate models for hyperparameter tuning.
Quiz
Machine learning - Fundamental Model Families Quiz Question 1: What type of model does a support vector machine initially construct?
- A binary linear classifier that separates two categories. (correct)
- A probabilistic multi‑class classifier.
- A regression model for continuous outcomes.
- A decision tree that partitions the feature space.
Machine learning - Fundamental Model Families Quiz Question 2: What structural form does a Bayesian network use to represent random variables and their dependencies?
- A directed acyclic graph (DAG). (correct)
- An undirected cyclic graph.
- A hierarchical tree without direction.
- A flat list of independent variables.
Machine learning - Fundamental Model Families Quiz Question 3: Dynamic Bayesian networks are particularly suited for modeling which type of data?
- Sequences of variables that evolve over time (correct)
- Independent and identically distributed observations
- Hierarchical categorical data
- Static relational structures
Machine learning - Fundamental Model Families Quiz Question 4: What do the leaf nodes of a regression tree contain?
- Continuous numeric target values (correct)
- Discrete class labels
- Feature‑splitting rules
- Probability distributions over classes
Machine learning - Fundamental Model Families Quiz Question 5: What technique can be applied to a support vector machine to obtain calibrated probability estimates for its classifications?
- Platt scaling (correct)
- Kernel PCA
- Cross‑validation
- Bagging
Machine learning - Fundamental Model Families Quiz Question 6: Which regression model is specifically used to predict binary outcomes?
- Logistic regression (correct)
- Polynomial regression
- Linear regression
- Ridge regression
Machine learning - Fundamental Model Families Quiz Question 7: Which data structure does a decision‑tree model employ to encode decision rules?
- A hierarchical tree of nodes (correct)
- A flat list of independent rules
- A directed acyclic graph without a root
- A circular linked list
Machine learning - Fundamental Model Families Quiz Question 8: When an SVM uses the kernel trick, what is computed instead of explicit coordinates in the high‑dimensional space?
- The inner product via a kernel function (correct)
- The Euclidean distance between original inputs
- The gradient of the loss function
- The exact coordinates after explicit transformation
Machine learning - Fundamental Model Families Quiz Question 9: In ordinary linear regression, the predicted value is a linear combination of what?
- The input features (predictor variables) (correct)
- The residual errors of previous predictions
- The hyperparameters of the learning algorithm
- The output values of a separate model
Machine learning - Fundamental Model Families Quiz Question 10: What advantage does kernel regression provide over ordinary linear regression?
- It can capture non‑linear relationships between variables (correct)
- It guarantees a perfect fit to the training data
- It eliminates the need for any hyperparameters
- It requires only a single data point to train
Machine learning - Fundamental Model Families Quiz Question 11: In Bayesian optimisation, what role does a Gaussian process play?
- It serves as a surrogate model of the objective function (correct)
- It directly searches the hyperparameter space without modeling
- It replaces the acquisition function entirely
- It optimises the gradient of the loss function analytically
Machine learning - Fundamental Model Families Quiz Question 12: What does the magnitude of a connection weight indicate in a neural network?
- How strongly the source neuron influences the target neuron (correct)
- The number of layers the signal will travel through
- The type of activation function used by the receiving neuron
- The probability that the connection will be removed during pruning
Machine learning - Fundamental Model Families Quiz Question 13: How are artificial neurons organized within a neural network architecture?
- Into layers, each performing a distinct transformation on its inputs (correct)
- Randomly scattered without any grouping
- In a circular chain where each neuron connects only to its predecessor
- All neurons directly connect to the output layer
Machine learning - Fundamental Model Families Quiz Question 14: What is the typical direction of information flow in a feed‑forward neural network?
- From the input layer through hidden layers to the output layer (correct)
- From the output layer back to the input layer
- Bidirectional between input and output simultaneously
- Only within hidden layers, skipping input and output
Machine learning - Fundamental Model Families Quiz Question 15: What purpose does a decision tree serve in decision analysis?
- It visualizes choices and their possible outcomes (correct)
- It optimizes continuous objective functions using gradient descent
- It clusters unlabeled data into groups
- It reduces the dimensionality of the feature space
Machine learning - Fundamental Model Families Quiz Question 16: What is the effect of the penalty term added by ridge regression to the ordinary least squares objective?
- It shrinks coefficient estimates toward zero, reducing variance (correct)
- It forces coefficients to be exactly zero, performing variable selection
- It increases the magnitude of coefficients to improve fit
- It replaces the loss function with a classification error
Machine learning - Fundamental Model Families Quiz Question 17: In Bayesian networks, which computational task updates probability estimates when new evidence is observed?
- Performing inference to compute posterior probabilities (correct)
- Clustering variables into groups
- Optimizing hyperparameters via gradient descent
- Reducing data dimensionality through PCA
Machine learning - Fundamental Model Families Quiz Question 18: To which broader class of processes does a Gaussian process belong?
- Stochastic processes (correct)
- Deterministic processes
- Markov processes
- Poisson processes
Machine learning - Fundamental Model Families Quiz Question 19: How does a Gaussian process produce a predictive distribution for a new input point?
- By conditioning on the observed data and their covariances (correct)
- By selecting the nearest neighbor's output as the prediction
- By fitting a straight line through all training points
- By drawing a random value from a uniform prior
Machine learning - Fundamental Model Families Quiz Question 20: In Gaussian processes, the covariance function is also known as what?
- Kernel function (correct)
- Mean function
- Likelihood function
- Activation function
What type of model does a support vector machine initially construct?
1 of 20
Key Concepts
Neural Networks and Deep Learning
Artificial neural network
Deep learning
Supervised Learning Techniques
Decision tree
Support vector machine
Kernel trick
Linear regression
Ridge regression
Probabilistic Models
Bayesian network
Dynamic Bayesian network
Gaussian process
Definitions
Artificial neural network
A computational model composed of interconnected artificial neurons that process information in layers to perform tasks such as classification or regression.
Deep learning
A subfield of machine learning that employs neural networks with many hidden layers to automatically learn hierarchical representations of data.
Decision tree
A flowchart‑like model that splits data based on feature values to predict categorical or continuous outcomes.
Support vector machine
A supervised learning algorithm that finds the optimal hyperplane separating classes in a high‑dimensional feature space.
Kernel trick
A technique that implicitly maps inputs into a higher‑dimensional space using a kernel function, enabling linear algorithms to perform non‑linear classification or regression.
Linear regression
A statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a straight line.
Ridge regression
A regularized version of linear regression that adds an L2 penalty to the loss function to reduce overfitting.
Bayesian network
A directed acyclic graph representing random variables and their conditional dependencies, used for probabilistic inference.
Dynamic Bayesian network
An extension of Bayesian networks that models temporal sequences of variables, capturing time‑dependent probabilistic relationships.
Gaussian process
A non‑parametric stochastic process where any finite set of points follows a multivariate normal distribution, commonly used for regression and Bayesian optimisation.