Introduction to Probability Theory
Understand the fundamentals of probability, conditional probability and independence, and the core concepts of random variables, expectation, and variance.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
Which branch of mathematics deals with the study of uncertainty?
1 of 26
Summary
Foundations of Probability Theory
What Is Probability Theory?
Probability theory is the branch of mathematics that quantifies and studies uncertainty. In real life, we often encounter situations where outcomes are unpredictable—a coin flip, the weather tomorrow, or the result of a medical test. Probability theory provides a systematic framework for assigning numerical values, called probabilities, to these uncertain outcomes. These probabilities range from 0 (impossible) to 1 (certain), allowing us to reason rigorously about random events.
Sample Space and Events
To work with probability formally, we need to define what outcomes are possible. The sample space, denoted $\Omega$, is the set of all possible outcomes of a random experiment. For example:
If we flip a coin, the sample space is $\Omega = \{\text{Heads}, \text{Tails}\}$
If we roll a die, the sample space is $\Omega = \{1, 2, 3, 4, 5, 6\}$
An event is any collection of outcomes—in other words, any subset of the sample space. For instance, if we roll a die, the event "rolling an even number" corresponds to the subset $\{2, 4, 6\}$. Events allow us to group outcomes that share a common characteristic and assign probabilities to them.
The Kolmogorov Axioms
Any valid probability assignment must satisfy three fundamental rules called the Kolmogorov axioms. These axioms ensure that probabilities behave consistently and intuitively.
The non-negativity axiom states that probabilities cannot be negative. For any event $A$: $$P(A) \geq 0$$
This makes sense: you cannot have "negative certainty."
The normalization axiom states that the probability of the entire sample space equals one: $$P(\Omega) = 1$$
This reflects the fact that one of the outcomes in the sample space must occur with certainty.
The additivity axiom applies to mutually exclusive events—events that cannot occur simultaneously. If events $A$ and $B$ are mutually exclusive, then: $$P(A \cup B) = P(A) + P(B)$$
In other words, the probability that either event occurs is the sum of their individual probabilities. This axiom extends naturally to any finite or countable collection of mutually exclusive events.
Derived Probability Rules
From the Kolmogorov axioms, we can derive several useful rules.
The complement rule concerns the probability of an event not occurring. If $A^c$ denotes the complement of $A$ (all outcomes not in $A$), then: $$P(A^c) = 1 - P(A)$$
This is straightforward: the event $A$ and its complement $A^c$ are mutually exclusive and together cover the entire sample space, so their probabilities must sum to 1.
The inclusion–exclusion formula handles the probability of the union of events that may overlap. For two events $A$ and $B$: $$P(A \cup B) = P(A) + P(B) - P(A \cap B)$$
The key insight is that when we add $P(A)$ and $P(B)$, we double-count the overlap $P(A \cap B)$, so we must subtract it once. This generalizes to more than two events, though the formula becomes more complex with additional terms.
Conditional Probability and Independence
Conditional Probability: Updating Beliefs
Conditional probability answers a fundamental question: how should we update our assessment of probability when we learn that another event has occurred?
The conditional probability of $A$ given $B$, denoted $P(A \mid B)$, is defined as: $$P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{(provided } P(B) > 0\text{)}$$
This formula has an intuitive interpretation. The numerator $P(A \cap B)$ is the probability that both events occur. The denominator $P(B)$ is the probability that $B$ occurs. By dividing, we're asking: "Of all the scenarios where $B$ happens, what fraction also have $A$ occurring?"
For example, suppose a test for a disease has a 99% accuracy rate. Before taking the test, your prior probability of having the disease might be 1% (based on population prevalence). After testing positive, you'd want to compute $P(\text{disease} \mid \text{positive test})$ to understand your actual risk—and you'll find it's much lower than 99% because the disease is so rare that false positives are common. This is conditional probability in action.
Independent Events
Two events are independent when the occurrence of one does not change the probability of the other. Mathematically, events $A$ and $B$ are independent if and only if: $$P(A \cap B) = P(A) \cdot P(B)$$
An intuitive example: rolling a die and flipping a coin are independent—the result of the coin flip tells you nothing about the die roll.
An important consequence: if $A$ and $B$ are independent, then: $$P(A \mid B) = P(A)$$
This makes sense from the definition of conditional probability: if knowing $B$ occurred doesn't change the probability of $A$, then the conditional probability should equal the unconditional probability.
The Multiplication Rule
The multiplication rule, which follows directly from the definition of conditional probability, is useful for computing joint probabilities: $$P(A \cap B) = P(A \mid B) \cdot P(B)$$
This says: the probability that both events occur is the probability that $B$ occurs, multiplied by the probability that $A$ occurs given that $B$ occurred. This extends naturally to chains of events and is particularly elegant when events are independent (in which case $P(A \mid B) = P(A)$, simplifying to $P(A \cap B) = P(A) \cdot P(B)$).
Random Variables and Their Distributions
From Outcomes to Numbers
A random variable is a function that assigns a numerical value to each outcome in the sample space. Rather than thinking about abstract outcomes like "Heads" or "Tails," a random variable translates these into numbers we can analyze mathematically.
For instance, suppose we flip a coin three times and count the number of heads. The sample space has outcomes like HHT, HTH, THH, etc., but our random variable might simply be $X$ = number of heads, taking values in $\{0, 1, 2, 3\}$.
Discrete and Continuous Random Variables
Random variables fall into two main categories.
Discrete random variables take on a countable set of values—think of the integers or any finite set. Examples include the number of heads in coin flips, the score on a test, or the number of customers arriving at a store. When working with discrete random variables, we can assign a probability to each specific value.
Continuous random variables take on values from an uncountable interval of real numbers. Examples include height, weight, temperature, or time. Because there are infinitely many possible values, the probability of any single exact value is typically zero; instead, we talk about the probability that the variable falls within a range.
Probability Mass Functions
For a discrete random variable $X$, the probability mass function (PMF) gives the probability that $X$ equals each specific value. Denoted $P(X = x)$ or sometimes $p(x)$, the PMF tells us the entire distribution of $X$.
Key properties of a PMF:
$P(X = x) \geq 0$ for all $x$ (probabilities are non-negative)
$\sum{x} P(X = x) = 1$ (probabilities sum to 1, reflecting the normalization axiom)
The histogram above shows a typical PMF: the height of each bar represents the probability of that value occurring.
Probability Density Functions
For a continuous random variable $X$, we use a probability density function (PDF), denoted $fX(x)$ or simply $f(x)$. Unlike a PMF, the PDF is not a probability itself; rather, it describes the relative likelihood of values in infinitesimal neighborhoods.
The key relationship is: $$P(a \leq X \leq b) = \inta^b fX(x) \, dx$$
In other words, to find the probability that $X$ falls in an interval, we integrate the PDF over that interval. This is analogous to area under a curve: the total area under the PDF curve equals 1, and the area under the curve within any interval gives the probability for that interval.
Key properties of a PDF:
$fX(x) \geq 0$ for all $x$ (density is non-negative)
$\int{-\infty}^{\infty} fX(x) \, dx = 1$ (total area equals 1)
Summary Measures of Random Variables
Expected Value: The Long-Run Average
The expected value (also called the mean) of a random variable is its long-run average. If we were to repeat a random experiment many times, the expected value tells us what average outcome we'd observe.
For a discrete random variable $X$: $$E[X] = \sum{x} x \cdot P(X = x)$$
Each possible value $x$ is weighted by its probability of occurring.
For a continuous random variable $X$ with PDF $fX(x)$: $$E[X] = \int x \cdot fX(x) \, dx$$
Example: A fair die has expected value $E[X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + \cdots + 6 \cdot \frac{1}{6} = 3.5$. This doesn't mean you'll ever roll 3.5, but over many rolls, the average approaches 3.5.
Variance: Measuring Spread
While expected value tells us the center of a distribution, variance measures how much the values tend to spread around that center. Variance is defined as: $$\operatorname{Var}(X) = E\big[(X - E[X])^2\big]$$
In words: take the squared deviation from the mean for each value, weight it by probability, and sum. Squaring ensures all deviations contribute positively. A larger variance indicates the random variable's values are more scattered; a smaller variance means values cluster tightly around the mean.
Key computational formula: Variance is often easier to compute using: $$\operatorname{Var}(X) = E[X^2] - (E[X])^2$$
This says: variance equals the expected value of the square minus the square of the expected value. This formula avoids computing deviations explicitly.
Standard deviation, denoted $\sigma$ or $\text{SD}(X)$, is the square root of variance: $\text{SD}(X) = \sqrt{\operatorname{Var}(X)}$. Standard deviation is often preferred because it has the same units as the original variable, making it more interpretable.
Summary
The foundations you've learned here—probability axioms, conditional probability, independence, random variables, and their distributions—form the bedrock of probability and statistics. These tools allow us to:
Formally define and assign probabilities to uncertain events
Update beliefs when new information arrives (conditional probability)
Recognize when events don't influence each other (independence)
Convert abstract outcomes into numerical summaries (random variables)
Characterize distributions via expected value and variance
Mastering these concepts is essential for all applications of probability in science, engineering, economics, and beyond.
Flashcards
Which branch of mathematics deals with the study of uncertainty?
Probability theory
What is the term for the set of all possible outcomes that can occur in an experiment?
Sample space
What symbol is often used to denote the sample space?
$\Omega$ (Omega)
How is an event defined in relation to the sample space?
It is a collection of outcomes (a subset of the sample space).
What are the three fundamental Kolmogorov axioms of probability?
Non-negativity axiom
Normalization axiom
Additivity axiom
What does the non-negativity axiom state regarding the probability of an event $A$?
$P(A) \ge 0$
According to the normalization axiom, what is the probability of the entire sample space $\Omega$?
$P(\Omega) = 1$
What is the additivity axiom formula for two mutually exclusive events $A$ and $B$?
$P(A \cup B) = P(A) + P(B)$
What is the formula for the complement rule regarding an event $A$?
$P(A^c) = 1 - P(A)$
Which formula is used to find the probability of the union of events by correcting for overlapping probabilities?
Inclusion–exclusion formula
What concept quantifies how belief changes when another event is known to have occurred?
Conditional probability
What is the mathematical definition of the conditional probability of $A$ given $B$?
$P(A \mid B) = \dfrac{P(A \cap B)}{P(B)}$ (where $P(B) > 0$)
What is the multiplication rule derived from the definition of conditional probability?
$P(A \cap B) = P(A \mid B) P(B)$
When are two events $A$ and $B$ considered independent?
When the occurrence of $B$ does not affect the likelihood of $A$.
What mathematical equation expresses the independence of events $A$ and $B$?
$P(A \cap B) = P(A) P(B)$
If $A$ and $B$ are independent, what does the conditional probability $P(A \mid B)$ simplify to?
$P(A)$
What is the definition of a random variable?
A function that assigns a numerical value to each outcome in the sample space.
What is the difference between discrete and continuous random variables regarding their possible values?
Discrete variables take a countable set of values, while continuous variables take values from an uncountable interval.
What function provides the probability that a discrete random variable equals a specific value?
Probability mass function
What function describes the relative likelihood of a continuous random variable falling within infinitesimal intervals?
Probability density function
What does the expected value represent for a random variable?
The long-run average
What is the formula for the expected value $E[X]$ of a discrete random variable?
$E[X] = \sum{x} x P(X = x)$
What is the formula for the expected value $E[X]$ of a continuous random variable with density $fX(x)$?
$E[X] = \int x fX(x) dx$
What does the variance of a random variable measure?
The spread of values around the mean (dispersion).
What is the formal definition of variance $\operatorname{Var}(X)$ in terms of expectation?
$\operatorname{Var}(X) = E[(X - E[X])^2]$
What is the computational formula for variance involving the raw second moment and the mean?
$\operatorname{Var}(X) = E[X^2] - (E[X])^2$
Quiz
Introduction to Probability Theory Quiz Question 1: Which axiom of probability states that the probability of any event cannot be negative?
- The probability of any event is greater than or equal to zero. (correct)
- The probability of the entire sample space equals one.
- The probability of mutually exclusive events adds.
- The probability of an event’s complement equals one minus the event.
Introduction to Probability Theory Quiz Question 2: What does the expected value of a random variable represent?
- The long‑run average (mean) of the random variable. (correct)
- The most likely (mode) value of the random variable.
- The maximum possible value the random variable can take.
- The probability that the random variable equals its mean.
Introduction to Probability Theory Quiz Question 3: Which formula can be used to compute the variance of a random variable?
- $\operatorname{Var}(X)=E[X^{2}]-(E[X])^{2}$ (correct)
- $\operatorname{Var}(X)=E[X]^{2}-E[X^{2}]$
- $\operatorname{Var}(X)=\big(E[X]\big)^{2}$
- $\operatorname{Var}(X)=\sqrt{E[X^{2}]-(E[X])^{2}}$
Which axiom of probability states that the probability of any event cannot be negative?
1 of 3
Key Concepts
Fundamentals of Probability
Probability theory
Sample space
Kolmogorov axioms
Key Concepts in Probability
Conditional probability
Independent events
Random variable
Probability Functions and Measures
Probability mass function
Probability density function
Expected value
Variance
Definitions
Probability theory
The branch of mathematics that studies uncertainty and assigns numerical probabilities to outcomes of random experiments.
Sample space
The set of all possible outcomes of a random experiment, typically denoted Ω.
Kolmogorov axioms
The three foundational rules (non‑negativity, normalization, and additivity) that define a probability measure.
Conditional probability
The probability of an event occurring given that another event has already occurred, expressed as P(A | B)=P(A∩B)/P(B).
Independent events
Two events whose occurrence does not affect each other's probabilities, satisfying P(A∩B)=P(A)P(B).
Random variable
A function that maps each outcome in a sample space to a numerical value.
Probability mass function
A function that gives the probability that a discrete random variable equals each possible value.
Probability density function
A function that describes the relative likelihood of a continuous random variable taking values in infinitesimal intervals.
Expected value
The long‑run average or mean of a random variable, calculated as E[X]=∑x x P(X=x) or E[X]=∫x f_X(x) dx.
Variance
A measure of the dispersion of a random variable around its mean, defined as Var(X)=E[(X−E[X])²].