RemNote Community
Community

Machine learning - Fundamental Model Families

Understand the core concepts, structures, and typical uses of neural networks, decision trees, support vector machines, regression models, Bayesian networks, and Gaussian processes.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

Under what condition does an artificial neuron transmit a signal?
1 of 21

Summary

Machine Learning Algorithms and Methods Introduction to Machine Learning Machine learning is a subset of artificial intelligence focused on systems that can learn from data and make predictions or decisions without being explicitly programmed. Several key algorithms have become foundational in this field, each with distinct strengths and applications. This guide covers the most important algorithmic approaches you need to understand: artificial neural networks, decision trees, support vector machines, and regression analysis. Artificial Neural Networks How Artificial Neurons Work Artificial neural networks are inspired by how biological brains process information. At the heart of these systems is the artificial neuron, a simple computational unit that mimics the behavior of biological neurons. An artificial neuron receives multiple input signals and processes them through a specific mechanism. The neuron calculates an aggregate input (a weighted sum of all inputs) and compares this sum against a threshold value. The key insight is that the neuron only produces an output signal when this aggregate input exceeds the threshold. This all-or-nothing behavior is called activation. The strength of the signal at each connection between neurons is called a weight. These weights determine how much influence one neuron has on another. If a weight is large, that input has a strong effect; if small, it has little effect. By adjusting these weights during training, the network learns patterns in data. Organization into Layers Artificial neurons are never used in isolation. Instead, they are organized into layers, with each layer performing a specific transformation on its inputs. A typical neural network has three types of layers: Input layer: receives the raw data Hidden layers: perform intermediate computations and transformations Output layer: produces the final predictions or classifications Signals travel forward through the network from the input layer, passing through hidden layers, and finally reaching the output layer. This unidirectional flow is why this process is called forward propagation. In some networks, signals may traverse the network multiple times to refine the output. Deep Learning When an artificial neural network contains multiple hidden layers, it becomes a deep neural network, and training such networks is called deep learning. The motivation for using deep networks is that they can model complex patterns in data—particularly in how the brain processes sensory information like light into vision and sound into hearing. Each hidden layer learns to represent the data at a different level of abstraction, with early layers detecting simple patterns and deeper layers combining these into more complex concepts. Decision Tree Methods The Basic Concept Decision trees represent a fundamentally different approach to machine learning. Rather than using continuous mathematical functions like neural networks, decision trees use a tree structure to map observations about an item to conclusions about its target value. Think of a decision tree as a flowchart: you start at the root and follow branches based on the features of your data, eventually reaching a leaf that gives you the answer. The power of decision trees lies in their interpretability—you can actually trace through the logic and understand why the tree made a particular prediction. Classification Trees In classification trees, the goal is to predict which category (class) an item belongs to. The tree works as follows: Branches represent conditions on features (like "Is age > 30?" or "Is income > $50,000?") Leaves represent class labels (the categories you're predicting, such as "approved" or "denied") Each path from the root to a leaf represents a conjunction of features—multiple conditions that must all be true. For example, a path might represent "age > 30 AND income > $50,000 AND credit score > 700" leading to the "loan approved" class. The example above shows a classification tree for predicting Titanic passenger survival. You can see how the tree splits on gender first, then uses age and sibling/spouse count to make further predictions. Regression Trees Regression trees solve a different problem: predicting continuous numerical values rather than categories. Instead of leaves containing class labels, the leaves contain real numbers (like predicted house prices or temperature values). The tree structure remains similar, but now each path leads to a numerical prediction rather than a category. Support Vector Machines Linear Classification Support Vector Machines (SVMs) are powerful algorithms for classification problems, particularly when you need to separate two groups of data. An SVM creates a non-probabilistic binary linear classifier, meaning it draws a dividing line (or hyperplane in higher dimensions) that separates two categories of training examples. The key insight of SVMs is that they find the line that provides the maximum margin—the largest possible distance between the dividing line and the closest points from each class. This approach tends to generalize well to new data because it finds the most "stable" separation. The diagram above illustrates this concept: the solid line is the decision boundary, and the dashed lines show the margin around it. The circled points are the support vectors—the critical points closest to the boundary that determine where the dividing line is placed. Non-Linear Classification with the Kernel Trick The main limitation of basic SVMs is that they assume the data can be separated by a straight line (or hyperplane). Real-world data is often not linearly separable. To handle this, SVMs use a clever technique called the kernel trick. The kernel trick works by mapping the input data into a higher-dimensional feature space where a linear separation becomes possible. For example, imagine you have data that forms a spiral pattern in 2D space—impossible to separate with a line. By mapping this into 3D space, a 2D plane might now separate the two classes perfectly. The beauty of the kernel trick is that this transformation happens implicitly—you don't need to explicitly create all the new dimensions, which would be computationally expensive. Instead, kernel functions calculate relationships between data points in the high-dimensional space directly. Probabilistic Outputs By default, SVMs produce only class predictions (category A or category B). However, if you need probabilistic outputs (confidence scores between 0 and 1 representing the probability of each class), Platt scaling can be applied to convert SVM outputs into probabilities. Regression Analysis Linear Regression Fundamentals Linear regression is one of the most fundamental techniques in statistics and machine learning. The goal is simple: find the best straight line that fits your data. "Best" is defined mathematically as the line that minimizes the sum of squared differences between the actual data points and the predicted values on the line. This quantity is called the sum of squared errors or residual sum of squares. $$\text{SSE} = \sum{i=1}^{n} (yi - \hat{y}i)^2$$ where $yi$ is the observed value and $\hat{y}i$ is the predicted value. In the figure above, the red line is the best-fit linear regression line, and each blue dot is a data point. The vertical distances from points to the line are the errors that the regression minimizes. Why Use Regularization? When fitting a model to data, there's always a risk of overfitting: the model learns the noise and peculiarities of the training data rather than the true underlying pattern. This causes poor performance on new, unseen data. Regularization techniques address this by adding a penalty term to the objective function, discouraging the model from becoming too complex. Ridge regression is one common approach that adds a penalty based on the squared magnitude of the model parameters: $$\text{Objective} = \sum{i=1}^{n} (yi - \hat{y}i)^2 + \lambda \sum{j=1}^{p} \betaj^2$$ The $\lambda$ parameter controls the strength of the penalty—higher values push the model toward simpler solutions, while lower values allow the model to fit the data more closely. Extensions Beyond Simple Linear Regression Linear regression assumes a linear relationship between inputs and outputs, which isn't always realistic. Several important extensions exist: Polynomial regression generalizes linear regression by fitting a polynomial curve to the data instead of a line. For example, a quadratic regression might fit a parabola: $y = \beta0 + \beta1 x + \beta2 x^2$. Logistic regression handles a different kind of problem: binary classification (predicting yes/no outcomes). Despite its name, logistic regression is a classification method, not a regression method. It models the probability of an item belonging to the positive class using an S-shaped curve. Kernel regression uses the kernel trick (similar to kernel-based SVMs) to introduce non-linearity. By mapping input variables into a higher-dimensional space, kernel regression can fit curved relationships in data while remaining computationally efficient. Multiple Dependent Variables So far we've discussed predicting a single output variable. Multivariate linear regression extends this to estimate relationships between multiple input variables and multiple dependent variables simultaneously. This is useful when you need to predict several related outputs at once, and the outputs may influence each other through the shared input space. <extrainfo> Bayesian Networks Probabilistic Graphical Models A Bayesian network is a directed acyclic graph (a network with arrows pointing in one direction and no cycles) that represents random variables and their conditional dependencies. Each node represents a random variable, and arrows indicate which variables directly influence which other variables. The power of Bayesian networks is that they can represent complex probability relationships in an intuitive visual form. Instead of specifying all possible joint probabilities (which grows exponentially with the number of variables), you only need to specify conditional probabilities for each variable given its parents in the network. Inference and Learning Two fundamental tasks with Bayesian networks are: Inference: Given some observed evidence (you know certain variables' values), compute the probability distribution over unknown variables. Efficient algorithms exist to perform this computation even in large networks. Learning: Given data, discover both the structure of the network (which variables influence which others) and the probability values. This is more complex but essential when you don't know the network structure in advance. Dynamic Bayesian Networks Dynamic Bayesian networks extend the basic framework to handle sequences of variables over time, such as speech signals, protein sequences, or stock prices. These networks represent how the state at time $t$ depends on the state at time $t-1$, enabling prediction of future states based on observed past behavior. </extrainfo> <extrainfo> Gaussian Processes Understanding Gaussian Processes A Gaussian process is a type of stochastic process (a mathematical object representing randomness evolving over time or space) with a special property: any finite collection of random variables follows a multivariate normal distribution (a generalization of the bell curve to multiple dimensions). This might sound abstract, but the practical implication is powerful: Gaussian processes provide a flexible way to model uncertainty in predictions. The Covariance Function The behavior of a Gaussian process is entirely determined by two functions: a mean function and a covariance function (also called a kernel). The covariance function models how pairs of points relate to each other based on their distance in the input space. Points that are close together tend to have outputs that are highly correlated, while distant points have outputs that are less correlated. Making Predictions The key strength of Gaussian processes appears when making predictions: Given a set of observed input-output examples, a Gaussian process doesn't just produce a single point prediction. Instead, it computes the entire distribution of possible outputs for a new input, incorporating uncertainty from observed data and their covariances. Application to Bayesian Optimisation Gaussian processes serve as surrogate models in Bayesian optimization, a technique for finding the best hyperparameter settings for machine learning models. Rather than training a model many times with different hyperparameters (expensive), Bayesian optimization uses a Gaussian process to predict which hyperparameters might work best, intelligently exploring the hyperparameter space. </extrainfo>
Flashcards
Under what condition does an artificial neuron transmit a signal?
When the aggregate input exceeds a specified threshold.
What determines the amount of influence one artificial neuron has on another?
The strength of the signal at the connection.
What structural feature of an artificial neural network defines deep learning?
The use of multiple hidden layers.
In a classification tree, what do the leaves represent?
Class labels.
What do the leaves of a regression tree represent?
Continuous target values (e.g., real numbers).
What type of non-probabilistic classifier is originally created by a support vector machine?
A binary linear classifier.
Which technique allows support vector machines to perform efficient non-linear classification?
The kernel trick.
How does the kernel trick enable non-linear classification in support vector machines?
By mapping inputs into a high-dimensional feature space.
Which method can be applied to support vector machines to produce probabilistic outputs?
Platt scaling.
By what criteria does linear regression fit a line to a set of data?
Minimizing the sum of squared differences.
What does ridge regression add to the linear regression objective function?
A penalty term.
Which regression model is used to model binary outcomes for statistical classification?
Logistic regression.
How does kernel regression introduce non-linearity into a model?
By using the kernel trick to map input variables into a higher-dimensional space.
What is estimated by multivariate linear regression?
Relationships between multiple input variables and multiple dependent variables.
What mathematical structure defines a Bayesian network?
A directed acyclic graph (DAG).
What two things does a Bayesian network represent?
Random variables Conditional independencies
What are the two primary computational tasks performed with Bayesian networks?
Inference (computing posterior probabilities) Learning the network structure
What is the primary purpose of a dynamic Bayesian network?
To model sequences of variables.
What distribution must any finite collection of random variables in a Gaussian process follow?
A multivariate normal distribution.
In a Gaussian process, what is the role of the covariance function (kernel)?
To model how pairs of points relate based on their locations.
What role do Gaussian processes play in Bayesian optimisation?
They serve as surrogate models for hyperparameter tuning.

Quiz

What type of model does a support vector machine initially construct?
1 of 20
Key Concepts
Neural Networks and Deep Learning
Artificial neural network
Deep learning
Supervised Learning Techniques
Decision tree
Support vector machine
Kernel trick
Linear regression
Ridge regression
Probabilistic Models
Bayesian network
Dynamic Bayesian network
Gaussian process