RemNote Community
Community

Artificial intelligence - Core AI Techniques and Advanced Models

Understand core AI techniques, deep learning architectures, and landmark AI research breakthroughs.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is the primary purpose of heuristics in state-space search?
1 of 31

Summary

Techniques in Artificial Intelligence: A Study Guide Introduction Artificial intelligence has evolved into a diverse field with techniques spanning from classical logic-based approaches to modern deep learning systems. These techniques solve problems in different ways: some search through possible solutions, others use probability to reason under uncertainty, and still others learn patterns from data. Understanding the core techniques and how they fit together is essential for working with AI systems. This guide covers the major problem-solving approaches in AI, from foundational methods to contemporary practices. Part I: Search and Optimization Understanding State-Space Search At the heart of many AI problems lies a fundamental question: how do we explore possibilities to find a solution? State-space search frames a problem as a tree of possible states. You begin at an initial state and explore various transitions (actions) until you reach a goal state. Think of it like finding your way through a maze—you start at the entrance and explore different paths until you reach the exit. The key challenge is that without guidance, exploring every possibility becomes computationally expensive. A naive approach might examine millions of states unnecessarily. Key motivation: We need intelligent ways to explore only the promising parts of the search space. Using Heuristics to Guide Search A heuristic is an informed guess that tells you which branches of the search tree are most likely to lead to a goal. Rather than exploring blindly, heuristics direct your attention toward promising directions. For example, in pathfinding, a heuristic might estimate the straight-line distance to your destination—this helps the search algorithm avoid exploring paths that clearly go the wrong direction. Heuristics don't guarantee finding the optimal solution, but they dramatically reduce the computational work by focusing effort where it matters. Adversarial Search in Games When you're not just searching for any goal, but competing against an opponent trying to prevent you from winning, you enter the realm of adversarial search. This is used in game-playing AI. The key insight: In a two-player game, you want to find moves that maximize your chances of winning, while your opponent tries to minimize your chances. This creates a game tree where you alternate choosing moves. Classic approaches like the minimax algorithm evaluate positions in the game tree by assuming both players play optimally. The algorithm works backward from end-game positions, determining which moves lead to winning positions. Local Search and Iterative Improvement Not all problems require exploring a complete search tree. Some problems are better solved by starting with an initial guess and repeatedly making small improvements. Local search works like this: you start at a random or initial solution, then iteratively make small changes that improve your current solution. You keep improving until you reach a local optimum—a solution where no small change makes it better. This approach works well for optimization problems like scheduling, route planning, and resource allocation. The tradeoff: local search is fast and memory-efficient, but it might get stuck at a local optimum rather than finding the global best solution. Gradient Descent: Optimizing Continuous Parameters Many AI problems involve adjusting numerical parameters to minimize some measure of error called a loss function. This is where gradient descent becomes powerful. Imagine you're standing on a hill and want to reach the valley below (the minimum loss). Gradient descent works by: Calculating the slope (gradient) of the loss function at your current position Taking a step downhill in the direction of steepest descent Repeating until you reach a valley Mathematically, if you have parameters $\theta$ and a loss function $L(\theta)$, gradient descent updates parameters like: $$\theta{new} = \theta{old} - \alpha \nabla L(\theta)$$ where $\alpha$ is the learning rate (how big each step is) and $\nabla L(\theta)$ is the gradient (which direction is downhill). Why it's critical: Gradient descent is the fundamental algorithm for training neural networks. Understanding how it works is essential for understanding modern AI. One important caution: gradient descent finds local minima, not necessarily the global minimum. The loss landscape may have multiple valleys, and you might end up in any of them depending on where you start. Evolutionary Computation: Evolving Solutions Instead of starting with a single solution and improving it, evolutionary computation maintains a population of candidate solutions and evolves them over generations. The process mimics natural evolution: Start with a population of random candidate solutions Evaluate how good each solution is (fitness) Select the better solutions Create new solutions by mutation (random changes) and recombination (combining pieces of good solutions) Repeat for many generations This approach is useful when the problem landscape is rough and has many local optima, because the population-based approach helps you explore multiple promising regions simultaneously. <extrainfo> Swarm Intelligence extends this idea further by simulating how distributed agents (like particles or ants) can solve problems collectively. Particle swarm optimization simulates particles moving through solution space, influenced by their best known position and the best position found by the swarm. Ant colony optimization simulates ants laying pheromone trails, with more successful routes attracting future ants. These algorithms are inspired by nature but are quite general problem-solving techniques. </extrainfo> Part II: Reasoning and Logic Propositional Logic: True/False Statements Propositional logic is the simplest formal logic system. It operates on propositions (statements that are either true or false) using logical connectives. The basic connectives are: AND (conjunction): "A and B" is true only if both A and B are true OR (disjunction): "A or B" is true if at least one is true NOT (negation): "not A" flips the truth value Implies (implication): "A implies B" is false only when A is true and B is false You can use these to build complex logical statements and reason about them. Limitation: Propositional logic cannot express relationships between objects. You can say "A is true," but not "All cats are animals." Predicate Logic: Adding Objects and Relations Predicate logic (also called first-order logic) extends propositional logic by introducing objects, predicates (relationships between objects), and quantifiers. With predicates, you can write statements like: $\text{Cat}(x)$: "x is a cat" $\text{Mortal}(x)$: "x is mortal" $\forall x \, (\text{Cat}(x) \to \text{Animal}(x))$: "All cats are animals" $\exists x \, \text{Cat}(x)$: "There exists at least one cat" The quantifiers are: $\forall$ (for all): universal quantification $\exists$ (there exists): existential quantification This is much more expressive than propositional logic and is the foundation for knowledge representation in AI. Deductive Reasoning and Inference Deductive reasoning is the process of proving that a conclusion must be true given a set of premises. For example: Premise 1: All cats are animals. Premise 2: Whiskers is a cat. Conclusion: Therefore, Whiskers is an animal. If the premises are true and the reasoning is valid, the conclusion must be true. <extrainfo> Resolution is a single, unified inference rule that can solve any problem in first-order logic. Rather than using many different inference rules, resolution provides a systematic way to prove theorems. This elegance made it important for automated reasoning systems. Horn-clause logic restricts first-order logic to a specific form that is more efficient to compute. It underlies the Prolog programming language and is powerful enough to express any Turing-computable function. This means it's theoretically as powerful as any programming language, despite its restrictions. </extrainfo> Handling Uncertainty with Fuzzy Logic Real-world statements are often vague. Is a person "tall"? This isn't simply true or false—it's a matter of degree. Fuzzy logic assigns truth values between 0 and 1 (instead of just true/false) to handle vague propositions. A person with height 5'10" might have a "tallness" value of 0.7 (somewhat tall). You can then use fuzzy logic operations: AND: take the minimum truth value OR: take the maximum truth value NOT: subtract from 1 This allows AI systems to reason about imprecise, real-world concepts. <extrainfo> Non-monotonic logics support default reasoning—concluding something is probably true unless evidence proves otherwise. For instance, "birds typically fly" is a default rule, but "penguins don't fly" is an exception. Non-monotonic logics let you retract conclusions when new information arrives, unlike classical deductive logic where conclusions, once proven, never change. </extrainfo> Part III: Probabilistic Reasoning Reasoning Under Uncertainty The real world is uncertain. You rarely have complete information. Probabilistic methods handle this uncertainty by working with probability distributions rather than certain facts. The core insight: Instead of asking "Is it raining?", you ask "What's the probability it's raining?" This captures your degree of belief given the available evidence. Decision theory extends this to decision-making: given uncertain information, what action should you take? It combines your beliefs (probabilities) about the world with your preferences (what outcomes you value) to recommend the best decision. Bayesian Networks: Reasoning About Relationships A Bayesian network is a graphical model that represents probabilistic relationships between variables. It shows: Which variables depend on which other variables The conditional probabilities that quantify these relationships For example, a simple network might show that "sprinkler" and "rain" both influence "wet grass," and "wet grass" influences "slippery ground." Why use them? Bayesian networks enable several reasoning tasks: Inference: Given evidence about some variables, what can we infer about others? Learning: Can we estimate the network structure and probabilities from data? Planning: What action should we take to achieve a goal? Perception: Given observations, what's the most likely hidden explanation? The network structure makes reasoning efficient by exploiting independence relationships—if two variables are unrelated, you don't need to consider their interaction. Temporal Reasoning: Hidden Markov Models and Kalman Filters Many real-world processes change over time. You might observe noisy sensor readings and want to infer what's actually happening, or predict what happens next. Hidden Markov models assume: The world has hidden states that evolve over time You observe noisy measurements related to these states The current state depends only on the previous state (Markov property) Common tasks include: Filtering: Given observations up to time $t$, what's the probability of the current hidden state? Prediction: What's the probability of a future state? Smoothing: Given all observations (past and future), what's the probability of a past state? Kalman filters are a continuous-valued version of Hidden Markov Models, useful for tracking objects in space, estimating vehicle positions, or smoothing sensor data. Part IV: Learning from Data The Classification Problem Many practical AI problems boil down to classification: given an observation (input), assign it to one of several predefined categories. Examples: Email classification: Is this email spam or legitimate? Image classification: What object is in this image? Medical diagnosis: Given patient data, what disease do they have? A classifier is a learned model that maps inputs to categories. The learning happens through seeing examples—pairs of inputs with their correct categories—and adjusting the classifier until it gets them right. Decision Trees: Simple and Interpretable A decision tree is a tree of if-then rules that classify data. At each internal node, the tree asks a question about a feature value. Depending on the answer, you follow one branch or another, until you reach a leaf node that predicts the category. For example, a tree for classifying whether to play tennis might ask: "Is it sunny?" If yes, ask "Is humidity high?" If yes, predict "Don't play." If no, predict "Play." Advantages: Easy to understand and interpret Can handle mixed numerical and categorical data Fast predictions Disadvantages: May overfit (memorizing training data too specifically) Often less accurate than more complex methods Distance-Based Classification: k-Nearest Neighbour The k-Nearest Neighbour (kNN) classifier uses a simple idea: to classify a new example, look at the $k$ most similar examples in your training data and take a vote. For example, with $k=3$, you find the 3 closest training examples to your new point. If 2 are labeled "cat" and 1 is labeled "dog," you predict "cat." The method requires: A distance metric (how do you measure similarity?) A choice of $k$ (how many neighbors to consider?) It's simple and can learn complex patterns, but it requires storing all training data and is slow to classify new examples. Finding Optimal Boundaries: Support Vector Machines Support vector machines (SVMs) solve a different problem: find the line (or hyperplane in higher dimensions) that best separates two classes of data. The key insight: the best separating line is the one that is furthest from the data points of both classes. These margin-maximizing lines are robust—they generalize well to new data. SVMs can also use "kernel tricks" to handle non-linear separation by implicitly mapping data to higher dimensions where linear separation becomes possible. They're powerful classifiers that work well on many problems, though they can be computationally expensive on large datasets. Probabilistic Classification: Naïve Bayes The Naïve Bayes classifier uses Bayes' theorem to classify data. It estimates the probability of each category given the observed features: $$P(\text{category}|\text{features}) \propto P(\text{features}|\text{category}) \times P(\text{category})$$ The "naïve" part: it assumes all features are independent given the category. This assumption is usually wrong in reality, but surprisingly often works well anyway. Advantages: Fast to train and predict Works well with limited data Naturally handles missing data It's particularly popular for text classification like spam detection. Universal Classifiers: Neural Networks Neural networks are a fundamentally different approach: instead of hand-crafting rules or decision boundaries, you learn a complex function through training. We'll discuss neural networks in detail in the next section, but the key point is that after training on examples, a neural network becomes a classifier that can handle very complex patterns. Part V: Artificial Neural Networks The Building Block: Artificial Neurons An artificial neuron (or perceptron) is a simple computational unit inspired by biological neurons. It: Takes multiple numerical inputs Multiplies each input by a weight Sums these weighted inputs Applies a nonlinear function (activation function) to produce an output Mathematically: $$\text{output} = f(w1 x1 + w2 x2 + \cdots + wn xn + b)$$ where $wi$ are weights, $b$ is a bias term, and $f$ is an activation function like sigmoid or ReLU. The weights and bias are the parameters that get learned during training. Organizing Neurons into Networks Neural networks organize neurons into layers: Input layer: receives the raw data Hidden layers: perform intermediate computations Output layer: produces the final prediction A deep neural network has at least two hidden layers (though "deep" often means many more). Why layers? Each layer can learn to detect increasingly abstract features. Lower layers might learn simple patterns, while higher layers combine these into more complex concepts. The Backpropagation Algorithm: How Networks Learn Backpropagation is the algorithm that trains neural networks. It's essentially gradient descent applied to neural networks. Here's the process: Forward pass: Push training data through the network to get predictions Compute loss: Measure how wrong the predictions are Backward pass: Calculate how much each weight contributed to the error (using calculus) Update weights: Adjust weights in the direction that reduces error (gradient descent) Repeat: Do this for many examples until the network improves The "back" in backpropagation refers to propagating error information backward through the network to understand each weight's responsibility. Key insight: Backpropagation efficiently computes gradients for all weights simultaneously, making it practical to train networks with millions of parameters. Feedforward vs. Recurrent Networks Feedforward networks process data in one direction: input → hidden layers → output. Each layer passes information forward, with no feedback loops. Recurrent networks feed outputs back as inputs, creating loops. This gives them memory—they can process sequences of data where the current output depends on previous inputs. Recurrent networks excel at sequential data like text, speech, or time series. However, they can suffer from vanishing gradient problems: the gradient signals can become too small to effectively train early layers. Long Short-Term Memory Networks Long Short-Term Memory (LSTM) networks solve the vanishing gradient problem of standard recurrent networks. They include special mechanisms called "gates" that control what information flows through the network. Key components: Forget gate: decides what to discard from memory Input gate: decides what new information to store Output gate: decides what memory to output These gates let LSTM networks learn to keep important information for long periods while forgetting irrelevant details. This makes them excellent for processing sequences with long-term dependencies—like understanding text where a word near the end might depend on context from the beginning. Convolutional Neural Networks for Spatial Data Images have spatial structure: pixels near each other are usually related, but distant pixels typically aren't. Convolutional neural networks (CNNs) exploit this structure. Instead of fully connecting every neuron to every input (which wastes computation), CNNs use convolution: small learned filters slide across the image, detecting local patterns like edges, corners, or textures. Key features: Convolutional layers: apply filters to detect spatial patterns Pooling layers: reduce data size while preserving important information Fully connected layers: at the end, combine detected features for final classification CNNs are dramatically more efficient at image processing than fully connected networks because they share weights across the image and focus computation on local patterns. Part VI: Deep Learning Why Depth Matters Deep learning stacks many neural layers to automatically extract hierarchical features from data. This is the key advantage: you don't manually engineer features—the network learns them. What happens in a deep network? Lower layers learn simple, local patterns (like edges in images) Middle layers combine these into more complex patterns (like textures or shapes) Upper layers assemble these into high-level concepts (like faces or objects) This hierarchical feature extraction is powerful because it mirrors how humans understand visual and other information: we first see edges, then shapes, then objects. The Deep Learning Revolution Deep learning is not new—the core ideas existed for decades. But in 2012, something changed dramatically. The 2012 breakthrough: AlexNet, a deep convolutional network, won the ImageNet competition by a huge margin. This success sparked the deep learning revolution. Two factors enabled this: GPU computing: Graphics processing units are designed for parallel computation. Modern GPUs with AI-specific enhancements can train large networks thousands of times faster than CPUs. Massive labeled datasets: ImageNet and similar datasets provided millions of labeled examples that deep networks could learn from. With faster computation and more data, researchers could train deeper networks that learned more complex patterns. Each breakthrough led to more funding and better hardware, creating a rapid acceleration. Part VII: Generative Pre-trained Transformers (GPT) What GPT Does GPT (Generative Pre-trained Transformer) models are large language models that predict the next token (word or subword) in a text sequence. Fundamentally, that's it. Given some text, they calculate which word is most likely to come next. But this simple task, performed millions of times across billions of parameters, produces surprisingly intelligent text generation. Example: Input: "The capital of France is" Output: highly likely to predict "Paris" Pre-training: Learning World Knowledge The power of GPT comes from pre-training on massive text corpora (like a large fraction of the internet). During this training: The model sees billions of text examples It learns to predict the next word in each example To do this well, it must learn about language, facts, reasoning, and more This pre-training is computationally expensive (millions of dollars worth of GPU time), but it's done once. The resulting model has absorbed enormous amounts of knowledge. The result: a model that can continue text on almost any topic intelligently, because it has implicitly learned patterns across all these topics. Fine-tuning with Human Feedback Pre-trained GPT models, while impressive, can have problems: They might confidently state false information (hallucination) They might produce biased, harmful, or unhelpful content They might refuse reasonable requests Reinforcement learning from human feedback (RLHF) addresses this. The process: Generate multiple completions for a prompt Have human raters rank these completions for quality, truthfulness, and safety Train a reward model to predict human preferences Fine-tune the language model using reinforcement learning to maximize the reward This doesn't eliminate hallucination completely, but it significantly reduces it and makes models more helpful and harmless. Important caveat: Current GPT systems can still hallucinate. They're good but not perfect. Always verify factual claims from AI systems. Multimodal GPT: Beyond Text Recent GPT systems extend beyond text. Multimodal models can process: Text together with images Video Audio Combinations of these For instance, a multimodal model can answer questions about images because it has learned relationships between text and visual information during training. <extrainfo> The Transformer architecture underlying GPT models uses self-attention mechanisms that allow the model to weigh the importance of different words when processing each word. Rather than processing sequentially (like older recurrent networks), transformers can process all words in parallel, making training faster. The attention mechanism lets the model learn long-range dependencies efficiently—critical for understanding coherent text. </extrainfo> Part VIII: Hardware, Software, and Tools For practical AI development: Hardware: Modern AI training happens on graphics processing units (GPUs) rather than CPUs. GPUs excel at the parallel matrix operations needed for neural networks. Specialized AI accelerators with AI-specific enhancements have made large-scale training feasible. Software languages: Early AI research used Lisp and Prolog. Today, Python dominates AI development due to its simplicity and extensive libraries (TensorFlow, PyTorch, etc. for deep learning). These are implementation details, but important context for understanding how modern AI works. Appendix: Foundational Research Papers This outline references several landmark papers that advanced AI: "Attention is all you need" (Vaswani et al., 2017): Introduced the transformer architecture and self-attention mechanisms, enabling efficient sequence modeling without recurrence. "Human-level control through deep reinforcement learning" (Mnih et al., 2015): Deep Q-Networks achieved superhuman performance on video games, showing that deep learning could work for reinforcement learning. "Mastering the game of Go with deep neural networks and tree search" (Silver et al., 2016): Combined neural networks with Monte Carlo tree search to defeat top human Go players, demonstrating deep learning's power for complex strategic games. "Highly accurate protein structure prediction with AlphaFold" (Jumper et al., 2021): Deep learning solved a 50-year-old problem in biology, predicting protein structures at atomic-level accuracy. Deep Learning review (LeCun, Bengio, Hinton, 2015): A comprehensive overview of deep learning architectures, training, and applications. These papers represent major breakthroughs that shaped modern AI. Understanding the techniques they introduced will help you understand contemporary AI systems.
Flashcards
What is the primary purpose of heuristics in state-space search?
To guide the search toward promising branches and reduce time.
What does adversarial search evaluate to locate winning strategies?
Game move trees.
What is the mechanism of gradient descent in training neural networks?
Adjusting numerical parameters to minimize a loss function.
Through what two processes does evolutionary computation evolve candidate solutions?
Mutation and recombination.
On what type of statements does propositional logic operate using logical connectives?
True/false statements.
What three elements does predicate logic add to express relations?
Objects Predicates Quantifiers
What is the goal of deductive reasoning?
To prove conclusions from given premises.
Which Turing-complete programming language is based on Horn-clause logic?
Prolog.
How does fuzzy logic handle vague propositions?
By assigning truth values between $0$ and $1$.
What type of reasoning is supported by non-monotonic logics?
Default reasoning.
What four tasks do Bayesian networks enable through probabilistic inference?
Reasoning Learning Planning Perception
On what basis does the k-Nearest Neighbour algorithm classify data?
Proximity to labeled examples.
What is the primary objective of support vector machines in classification?
To find optimal separating hyperplanes between classes.
What specific assumption do Naïve Bayes classifiers apply to Bayes' theorem?
Strong independence assumptions.
In what three layers are artificial neurons typically organized?
Input Hidden Output
How many hidden layers must a neural network contain to be considered "deep"?
At least two.
What is the purpose of adjusting connection weights during backpropagation?
To minimize output error.
What is the difference between feedforward and recurrent neural networks?
Feedforward networks propagate signals in one direction; recurrent networks feed outputs back as inputs.
What is the specialization of convolutional neural networks (CNNs)?
Processing spatially local patterns (e.g., image edges).
What is the benefit of stacking multiple neural layers in deep learning?
To automatically extract hierarchical features.
How do features differ between lower and higher layers in deep learning?
Lower layers learn simple patterns (edges); higher layers capture complex concepts (faces).
What two factors contributed to the 2012 deep learning breakthrough?
Increased GPU computing power Massive labeled datasets (e.g., ImageNet)
What is the core task of a GPT model?
Predicting the next token in a text sequence.
What is the purpose of Reinforcement Learning from Human Feedback (RLHF) in GPT models?
To fine-tune models for truthfulness, usefulness, and safety.
What is a "hallucination" in the context of GPT systems?
The generation of false statements.
What defines a "multimodal" GPT model?
The ability to process text along with images, video, or audio.
Which hardware has largely replaced CPUs for large-scale AI model training?
Graphics Processing Units (GPUs) with AI-specific enhancements.
Which programming language currently dominates AI development?
Python.
Which mechanism did the paper "Attention is all you need" use to replace recurrent networks?
Self-attention mechanisms.
What milestone was achieved by the 2015 Deep Q-Network paper?
Human-level performance on several Atari video games.
What breakthrough in biology was demonstrated by AlphaFold?
Predicting protein 3D shapes with atomic-level accuracy using deep learning.

Quiz

What structure does state‑space search explore to locate a goal state?
1 of 19
Key Concepts
Neural Network Architectures
Transformer architecture
Generative Pre‑trained Transformer (GPT)
Convolutional neural network (CNN)
Deep reinforcement learning
AlphaFold
Optimization and Learning Techniques
Bayesian network
Evolutionary computation
Particle swarm optimization
Fuzzy logic
Support vector machine (SVM)