Subjects/Technology/Data and AI/Machine Learning/Machine learning

Machine learning - Fundamental Model Families

Understand the core concepts, structures, and typical uses of neural networks, decision trees, support vector machines, regression models, Bayesian networks, and Gaussian processes.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

Under what condition does an artificial neuron transmit a signal?

1 of 21

Summary

Machine Learning Algorithms and Methods Introduction to Machine Learning Machine learning is a subset of artificial intelligence focused on systems that can learn from data and make predictions or decisions without being explicitly programmed. Several key algorithms have become foundational in this field, each with distinct strengths and applications. This guide covers the most important algorithmic approaches you need to understand: artificial neural networks, decision trees, support vector machines, and regression analysis. Artificial Neural Networks How Artificial Neurons Work Artificial neural networks are inspired by how biological brains process information. At the heart of these systems is the artificial neuron, a simple computational unit that mimics the behavior of biological neurons. An artificial neuron receives multiple input signals and processes them through a specific mechanism. The neuron calculates an aggregate input (a weighted sum of all inputs) and compares this sum against a threshold value. The key insight is that the neuron only produces an output signal when this aggregate input exceeds the threshold. This all-or-nothing behavior is called activation. The strength of the signal at each connection between neurons is called a weight. These weights determine how much influence one neuron has on another. If a weight is large, that input has a strong effect; if small, it has little effect. By adjusting these weights during training, the network learns patterns in data. Organization into Layers Artificial neurons are never used in isolation. Instead, they are organized into layers, with each layer performing a specific transformation on its inputs. A typical neural network has three types of layers: Input layer: receives the raw data Hidden layers: perform intermediate computations and transformations Output layer: produces the final predictions or classifications Signals travel forward through the network from the input layer, passing through hidden layers, and finally reaching the output layer. This unidirectional flow is why this process is called forward propagation. In some networks, signals may traverse the network multiple times to refine the output. Deep Learning When an artificial neural network contains multiple hidden layers, it becomes a deep neural network, and training such networks is called deep learning. The motivation for using deep networks is that they can model complex patterns in data—particularly in how the brain processes sensory information like light into vision and sound into hearing. Each hidden layer learns to represent the data at a different level of abstraction, with early layers detecting simple patterns and deeper layers combining these into more complex concepts. Decision Tree Methods The Basic Concept Decision trees represent a fundamentally different approach to machine learning. Rather than using continuous mathematical functions like neural networks, decision trees use a tree structure to map observations about an item to conclusions about its target value. Think of a decision tree as a flowchart: you start at the root and follow branches based on the features of your data, eventually reaching a leaf that gives you the answer. The power of decision trees lies in their interpretability—you can actually trace through the logic and understand why the tree made a particular prediction. Classification Trees In classification trees, the goal is to predict which category (class) an item belongs to. The tree works as follows: Branches represent conditions on features (like "Is age > 30?" or "Is income > $50,000?") Leaves represent class labels (the categories you're predicting, such as "approved" or "denied") Each path from the root to a leaf represents a conjunction of features—multiple conditions that must all be true. For example, a path might represent "age > 30 AND income > $50,000 AND credit score > 700" leading to the "loan approved" class. The example above shows a classification tree for predicting Titanic passenger survival. You can see how the tree splits on gender first, then uses age and sibling/spouse count to make further predictions. Regression Trees Regression trees solve a different problem: predicting continuous numerical values rather than categories. Instead of leaves containing class labels, the leaves contain real numbers (like predicted house prices or temperature values). The tree structure remains similar, but now each path leads to a numerical prediction rather than a category. Support Vector Machines Linear Classification Support Vector Machines (SVMs) are powerful algorithms for classification problems, particularly when you need to separate two groups of data. An SVM creates a non-probabilistic binary linear classifier, meaning it draws a dividing line (or hyperplane in higher dimensions) that separates two categories of training examples. The key insight of SVMs is that they find the line that provides the maximum margin—the largest possible distance between the dividing line and the closest points from each class. This approach tends to generalize well to new data because it finds the most "stable" separation. The diagram above illustrates this concept: the solid line is the decision boundary, and the dashed lines show the margin around it. The circled points are the support vectors—the critical points closest to the boundary that determine where the dividing line is placed. Non-Linear Classification with the Kernel Trick The main limitation of basic SVMs is that they assume the data can be separated by a straight line (or hyperplane). Real-world data is often not linearly separable. To handle this, SVMs use a clever technique called the kernel trick. The kernel trick works by mapping the input data into a higher-dimensional feature space where a linear separation becomes possible. For example, imagine you have data that forms a spiral pattern in 2D space—impossible to separate with a line. By mapping this into 3D space, a 2D plane might now separate the two classes perfectly. The beauty of the kernel trick is that this transformation happens implicitly—you don't need to explicitly create all the new dimensions, which would be computationally expensive. Instead, kernel functions calculate relationships between data points in the high-dimensional space directly. Probabilistic Outputs By default, SVMs produce only class predictions (category A or category B). However, if you need probabilistic outputs (confidence scores between 0 and 1 representing the probability of each class), Platt scaling can be applied to convert SVM outputs into probabilities. Regression Analysis Linear Regression Fundamentals Linear regression is one of the most fundamental techniques in statistics and machine learning. The goal is simple: find the best straight line that fits your data. "Best" is defined mathematically as the line that minimizes the sum of squared differences between the actual data points and the predicted values on the line. This quantity is called the sum of squared errors or residual sum of squares. $$\text{SSE} = \sum{i=1}^{n} (yi - \hat{y}i)^2$$ where $yi$ is the observed value and $\hat{y}i$ is the predicted value. In the figure above, the red line is the best-fit linear regression line, and each blue dot is a data point. The vertical distances from points to the line are the errors that the regression minimizes. Why Use Regularization? When fitting a model to data, there's always a risk of overfitting: the model learns the noise and peculiarities of the training data rather than the true underlying pattern. This causes poor performance on new, unseen data. Regularization techniques address this by adding a penalty term to the objective function, discouraging the model from becoming too complex. Ridge regression is one common approach that adds a penalty based on the squared magnitude of the model parameters: $$\text{Objective} = \sum{i=1}^{n} (yi - \hat{y}i)^2 + \lambda \sum{j=1}^{p} \betaj^2$$ The $\lambda$ parameter controls the strength of the penalty—higher values push the model toward simpler solutions, while lower values allow the model to fit the data more closely. Extensions Beyond Simple Linear Regression Linear regression assumes a linear relationship between inputs and outputs, which isn't always realistic. Several important extensions exist: Polynomial regression generalizes linear regression by fitting a polynomial curve to the data instead of a line. For example, a quadratic regression might fit a parabola: $y = \beta0 + \beta1 x + \beta2 x^2$. Logistic regression handles a different kind of problem: binary classification (predicting yes/no outcomes). Despite its name, logistic regression is a classification method, not a regression method. It models the probability of an item belonging to the positive class using an S-shaped curve. Kernel regression uses the kernel trick (similar to kernel-based SVMs) to introduce non-linearity. By mapping input variables into a higher-dimensional space, kernel regression can fit curved relationships in data while remaining computationally efficient. Multiple Dependent Variables So far we've discussed predicting a single output variable. Multivariate linear regression extends this to estimate relationships between multiple input variables and multiple dependent variables simultaneously. This is useful when you need to predict several related outputs at once, and the outputs may influence each other through the shared input space. <extrainfo> Bayesian Networks Probabilistic Graphical Models A Bayesian network is a directed acyclic graph (a network with arrows pointing in one direction and no cycles) that represents random variables and their conditional dependencies. Each node represents a random variable, and arrows indicate which variables directly influence which other variables. The power of Bayesian networks is that they can represent complex probability relationships in an intuitive visual form. Instead of specifying all possible joint probabilities (which grows exponentially with the number of variables), you only need to specify conditional probabilities for each variable given its parents in the network. Inference and Learning Two fundamental tasks with Bayesian networks are: Inference: Given some observed evidence (you know certain variables' values), compute the probability distribution over unknown variables. Efficient algorithms exist to perform this computation even in large networks. Learning: Given data, discover both the structure of the network (which variables influence which others) and the probability values. This is more complex but essential when you don't know the network structure in advance. Dynamic Bayesian Networks Dynamic Bayesian networks extend the basic framework to handle sequences of variables over time, such as speech signals, protein sequences, or stock prices. These networks represent how the state at time $t$ depends on the state at time $t-1$, enabling prediction of future states based on observed past behavior. </extrainfo> <extrainfo> Gaussian Processes Understanding Gaussian Processes A Gaussian process is a type of stochastic process (a mathematical object representing randomness evolving over time or space) with a special property: any finite collection of random variables follows a multivariate normal distribution (a generalization of the bell curve to multiple dimensions). This might sound abstract, but the practical implication is powerful: Gaussian processes provide a flexible way to model uncertainty in predictions. The Covariance Function The behavior of a Gaussian process is entirely determined by two functions: a mean function and a covariance function (also called a kernel). The covariance function models how pairs of points relate to each other based on their distance in the input space. Points that are close together tend to have outputs that are highly correlated, while distant points have outputs that are less correlated. Making Predictions The key strength of Gaussian processes appears when making predictions: Given a set of observed input-output examples, a Gaussian process doesn't just produce a single point prediction. Instead, it computes the entire distribution of possible outputs for a new input, incorporating uncertainty from observed data and their covariances. Application to Bayesian Optimisation Gaussian processes serve as surrogate models in Bayesian optimization, a technique for finding the best hyperparameter settings for machine learning models. Rather than training a model many times with different hyperparameters (expensive), Bayesian optimization uses a Gaussian process to predict which hyperparameters might work best, intelligently exploring the hyperparameter space. </extrainfo>

Flashcards

Under what condition does an artificial neuron transmit a signal?

When the aggregate input exceeds a specified threshold.

What determines the amount of influence one artificial neuron has on another?

The strength of the signal at the connection.

What structural feature of an artificial neural network defines deep learning?

The use of multiple hidden layers.

In a classification tree, what do the leaves represent?

Class labels.

What do the leaves of a regression tree represent?

Continuous target values (e.g., real numbers).

What type of non-probabilistic classifier is originally created by a support vector machine?

A binary linear classifier.

Which technique allows support vector machines to perform efficient non-linear classification?

The kernel trick.

How does the kernel trick enable non-linear classification in support vector machines?

By mapping inputs into a high-dimensional feature space.

Which method can be applied to support vector machines to produce probabilistic outputs?

Platt scaling.

By what criteria does linear regression fit a line to a set of data?

Minimizing the sum of squared differences.

What does ridge regression add to the linear regression objective function?

A penalty term.

Which regression model is used to model binary outcomes for statistical classification?

Logistic regression.

How does kernel regression introduce non-linearity into a model?

By using the kernel trick to map input variables into a higher-dimensional space.

What is estimated by multivariate linear regression?

Relationships between multiple input variables and multiple dependent variables.

What mathematical structure defines a Bayesian network?

A directed acyclic graph (DAG).

What two things does a Bayesian network represent?

Random variables Conditional independencies

What are the two primary computational tasks performed with Bayesian networks?

Inference (computing posterior probabilities) Learning the network structure

What is the primary purpose of a dynamic Bayesian network?

To model sequences of variables.

What distribution must any finite collection of random variables in a Gaussian process follow?

A multivariate normal distribution.

In a Gaussian process, what is the role of the covariance function (kernel)?

To model how pairs of points relate based on their locations.

What role do Gaussian processes play in Bayesian optimisation?

They serve as surrogate models for hyperparameter tuning.

Quiz

Machine learning - Fundamental Model Families Quiz Question 1: What type of model does a support vector machine initially construct?

A binary linear classifier that separates two categories. (correct)
A probabilistic multi‑class classifier.
A regression model for continuous outcomes.
A decision tree that partitions the feature space.

Machine learning - Fundamental Model Families Quiz Question 2: What structural form does a Bayesian network use to represent random variables and their dependencies?

A directed acyclic graph (DAG). (correct)
An undirected cyclic graph.
A hierarchical tree without direction.
A flat list of independent variables.

Machine learning - Fundamental Model Families Quiz Question 3: Dynamic Bayesian networks are particularly suited for modeling which type of data?

Sequences of variables that evolve over time (correct)
Independent and identically distributed observations
Hierarchical categorical data
Static relational structures

Machine learning - Fundamental Model Families Quiz Question 4: What do the leaf nodes of a regression tree contain?

Continuous numeric target values (correct)
Discrete class labels
Feature‑splitting rules
Probability distributions over classes

Machine learning - Fundamental Model Families Quiz Question 5: What technique can be applied to a support vector machine to obtain calibrated probability estimates for its classifications?

Platt scaling (correct)
Kernel PCA
Cross‑validation
Bagging

Machine learning - Fundamental Model Families Quiz Question 6: Which regression model is specifically used to predict binary outcomes?

Logistic regression (correct)
Polynomial regression
Linear regression
Ridge regression

Machine learning - Fundamental Model Families Quiz Question 7: Which data structure does a decision‑tree model employ to encode decision rules?

A hierarchical tree of nodes (correct)
A flat list of independent rules
A directed acyclic graph without a root
A circular linked list

Machine learning - Fundamental Model Families Quiz Question 8: When an SVM uses the kernel trick, what is computed instead of explicit coordinates in the high‑dimensional space?

The inner product via a kernel function (correct)
The Euclidean distance between original inputs
The gradient of the loss function
The exact coordinates after explicit transformation

Machine learning - Fundamental Model Families Quiz Question 9: In ordinary linear regression, the predicted value is a linear combination of what?

The input features (predictor variables) (correct)
The residual errors of previous predictions
The hyperparameters of the learning algorithm
The output values of a separate model

Machine learning - Fundamental Model Families Quiz Question 10: What advantage does kernel regression provide over ordinary linear regression?

It can capture non‑linear relationships between variables (correct)
It guarantees a perfect fit to the training data
It eliminates the need for any hyperparameters
It requires only a single data point to train

Machine learning - Fundamental Model Families Quiz Question 11: In Bayesian optimisation, what role does a Gaussian process play?

It serves as a surrogate model of the objective function (correct)
It directly searches the hyperparameter space without modeling
It replaces the acquisition function entirely
It optimises the gradient of the loss function analytically

Machine learning - Fundamental Model Families Quiz Question 12: What does the magnitude of a connection weight indicate in a neural network?

How strongly the source neuron influences the target neuron (correct)
The number of layers the signal will travel through
The type of activation function used by the receiving neuron
The probability that the connection will be removed during pruning

Machine learning - Fundamental Model Families Quiz Question 13: How are artificial neurons organized within a neural network architecture?

Into layers, each performing a distinct transformation on its inputs (correct)
Randomly scattered without any grouping
In a circular chain where each neuron connects only to its predecessor
All neurons directly connect to the output layer

Machine learning - Fundamental Model Families Quiz Question 14: What is the typical direction of information flow in a feed‑forward neural network?

From the input layer through hidden layers to the output layer (correct)
From the output layer back to the input layer
Bidirectional between input and output simultaneously
Only within hidden layers, skipping input and output

Machine learning - Fundamental Model Families Quiz Question 15: What purpose does a decision tree serve in decision analysis?

It visualizes choices and their possible outcomes (correct)
It optimizes continuous objective functions using gradient descent
It clusters unlabeled data into groups
It reduces the dimensionality of the feature space

Machine learning - Fundamental Model Families Quiz Question 16: What is the effect of the penalty term added by ridge regression to the ordinary least squares objective?

It shrinks coefficient estimates toward zero, reducing variance (correct)
It forces coefficients to be exactly zero, performing variable selection
It increases the magnitude of coefficients to improve fit
It replaces the loss function with a classification error

Machine learning - Fundamental Model Families Quiz Question 17: In Bayesian networks, which computational task updates probability estimates when new evidence is observed?

Performing inference to compute posterior probabilities (correct)
Clustering variables into groups
Optimizing hyperparameters via gradient descent
Reducing data dimensionality through PCA

Machine learning - Fundamental Model Families Quiz Question 18: To which broader class of processes does a Gaussian process belong?

Stochastic processes (correct)
Deterministic processes
Markov processes
Poisson processes

Machine learning - Fundamental Model Families Quiz Question 19: How does a Gaussian process produce a predictive distribution for a new input point?

By conditioning on the observed data and their covariances (correct)
By selecting the nearest neighbor's output as the prediction
By fitting a straight line through all training points
By drawing a random value from a uniform prior

Machine learning - Fundamental Model Families Quiz Question 20: In Gaussian processes, the covariance function is also known as what?

Kernel function (correct)
Mean function
Likelihood function
Activation function

What type of model does a support vector machine initially construct?

1 of 20

Key Concepts

Neural Networks and Deep Learning

Artificial neural network

Deep learning

Supervised Learning Techniques

Decision tree

Support vector machine

Kernel trick

Linear regression

Ridge regression

Probabilistic Models

Bayesian network

Dynamic Bayesian network

Gaussian process

Definitions

Artificial neural network

A computational model composed of interconnected artificial neurons that process information in layers to perform tasks such as classification or regression.

Deep learning

A subfield of machine learning that employs neural networks with many hidden layers to automatically learn hierarchical representations of data.

Decision tree

A flowchart‑like model that splits data based on feature values to predict categorical or continuous outcomes.

Support vector machine

A supervised learning algorithm that finds the optimal hyperplane separating classes in a high‑dimensional feature space.

Kernel trick

A technique that implicitly maps inputs into a higher‑dimensional space using a kernel function, enabling linear algorithms to perform non‑linear classification or regression.

Linear regression

A statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a straight line.

Ridge regression

A regularized version of linear regression that adds an L2 penalty to the loss function to reduce overfitting.

Bayesian network

A directed acyclic graph representing random variables and their conditional dependencies, used for probabilistic inference.

Dynamic Bayesian network

An extension of Bayesian networks that models temporal sequences of variables, capturing time‑dependent probabilistic relationships.

Gaussian process

A non‑parametric stochastic process where any finite set of points follows a multivariate normal distribution, commonly used for regression and Bayesian optimisation.