RemNote Community
Community

Information theory - Core Information Measures

Understand entropy, mutual information, and KL divergence as fundamental measures of uncertainty and information in information theory.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is the formula for Shannon entropy $H$ in bits per symbol?
1 of 18

Summary

Core Information Theory Quantities Introduction Information theory provides mathematical tools to measure and quantify uncertainty. At its heart are a few core quantities that tell us about randomness, predictability, and how much we can learn by observing events. Mastering these concepts—entropy, mutual information, and divergence—forms the foundation for understanding everything from data compression to communication channels. Entropy: Measuring Uncertainty Entropy quantifies how much uncertainty or randomness exists in an information source. For a discrete random variable $X$ with possible outcomes and their probabilities, Shannon entropy is defined as: $$H(X) = -\sum{i} pi \log2 pi$$ where $pi$ is the probability of the $i$-th outcome. Understanding What Entropy Measures Entropy answers a fundamental question: On average, how surprised should we be by the next symbol we see? The more unpredictable the source, the higher the entropy. Self-information (also called surprisal) captures this idea for a single outcome. When you observe a symbol with probability $p$, its self-information is: $$\text{Self-information} = -\log2 p$$ A low-probability event carries more information (higher surprise) than a high-probability event. Entropy is the average self-information across all outcomes. Key Properties of Entropy Entropy is maximized when all symbols are equally likely. If you have $n$ possible symbols, maximum entropy is: $$H{\max} = \log2 n$$ For example, a fair coin flip (two equally likely outcomes) has maximum entropy of $\log2 2 = 1$ bit. A fair six-sided die has maximum entropy of $\log2 6 \approx 2.58$ bits. Conversely, entropy is zero when the outcome is completely predictable—when one symbol has probability 1 and all others have probability 0. Units Matter The logarithm base determines the unit: Base-2 logarithm → bits (shannons) Natural logarithm (base $e$) → nats Base-10 logarithm → decimal digits (hartleys) Most computer science focuses on bits, while some physics and mathematics uses nats. Always note which unit is being used—the numerical value changes with the base. The Binary Entropy Function A particularly important special case is when you have just two outcomes. The binary entropy function describes entropy for a Bernoulli trial (a single yes/no event): $$H{\text{binary}}(p) = -p\log2 p - (1-p)\log2(1-p)$$ where $p$ is the probability of one outcome and $(1-p)$ is the probability of the other. This function has nice properties: it's zero when $p = 0$ or $p = 1$ (no uncertainty), and reaches its maximum of 1 bit when $p = 0.5$ (maximum uncertainty). This curve is so common in information theory that you should develop an intuition for its shape. Joint Entropy: Multiple Variables When you have two random variables $X$ and $Y$, their joint entropy measures the total uncertainty in both: $$H(X,Y) = -\sum{x,y} p(x,y)\log2 p(x,y)$$ where $p(x,y)$ is the probability of the pair $(x,y)$ occurring together. Important insight: If $X$ and $Y$ are independent, then: $$H(X,Y) = H(X) + H(Y)$$ This makes intuitive sense—if the variables are unrelated, the total uncertainty is simply the sum of their individual uncertainties. However, if they're correlated, $H(X,Y)$ will be less than $H(X) + H(Y)$ because knowing one variable gives you information about the other. Conditional Entropy: Uncertainty Given Information What if you already know the value of $Y$? How much uncertainty remains about $X$? This is measured by conditional entropy: $$H(X|Y) = -\sum{x,y} p(x,y)\log2 p(x|y)$$ This is the average entropy of $X$ across all possible values of $Y$, weighted by how likely each value of $Y$ is. A crucial relationship connects these quantities: $$H(X,Y) = H(X) + H(Y|X)$$ This can be rewritten as: $$H(Y|X) = H(X,Y) - H(X)$$ This relationship captures the intuition: the joint uncertainty equals the uncertainty in $X$ plus whatever additional uncertainty remains in $Y$ once you know $X$. Mutual Information: How Much Two Variables Tell Each Other Mutual information quantifies how much knowing one variable reduces uncertainty about another. It's one of the most important concepts in information theory: $$I(X;Y) = H(X) - H(X|Y)$$ This reads as: "The mutual information between $X$ and $Y$ equals the original entropy of $X$ minus the remaining entropy of $X$ after observing $Y$." In other words, it's the reduction in uncertainty about $X$ caused by learning $Y$. Multiple Formulations Mutual information can be expressed in several equivalent ways: $$I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)$$ Also in probability terms: $$I(X;Y) = \sum{x,y} p(x,y)\log2 \frac{p(x,y)}{p(x)p(y)}$$ Key Properties Mutual information is symmetric: $I(X;Y) = I(Y;X)$. Learning about $Y$ reduces uncertainty in $X$ by exactly the same amount that learning about $X$ reduces uncertainty in $Y$. Mutual information is always non-negative: $I(X;Y) \geq 0$, and equals zero if and only if $X$ and $Y$ are independent. Mutual information is bounded: $I(X;Y) \leq \min(H(X), H(Y))$. You can't learn more about $Y$ than it contains in total. Kullback-Leibler Divergence: Measuring Distribution Mismatch Suppose you're trying to compress data that actually comes from distribution $P$, but you mistakenly think it comes from distribution $Q$. How many extra bits per symbol will you need? The Kullback-Leibler (KL) divergence answers this: $$D{\text{KL}}(P \| Q) = \sum{i} pi \log2 \frac{pi}{qi}$$ where $pi$ are the true probabilities and $qi$ are your assumed probabilities. Interpretation KL divergence measures how different $P$ is from $Q$. It's the average extra bits required when you use the wrong model. When $P = Q$ everywhere, $D{\text{KL}}(P \| Q) = 0$. Important Property: KL Divergence is Not Symmetric Note that $D{\text{KL}}(P \| Q) \neq D{\text{KL}}(Q \| P)$ in general. The direction matters. $D{\text{KL}}(P \| Q)$ penalizes $Q$ more heavily for assigning low probability to events that actually have high probability in $P$. This asymmetry makes sense: if the true distribution has a high-probability event that your model thinks is rare, you're wasting lots of bits. But if your model wastes bits on rare events, that's less costly on average. The Entropy Bound Theorem A fundamental upper bound on entropy states that for any discrete random variable $X$ with support of size $|\mathcal{X}|$: $$H(X) \leq \log2 |\mathcal{X}|$$ Equality holds if and only if $X$ is uniformly distributed over all possible outcomes. This should make intuitive sense: maximum entropy occurs when all outcomes are equally likely. If any outcome is more likely than others, you have less uncertainty—you can predict slightly better. For example, a random variable over 8 possible outcomes can have at most $\log2 8 = 3$ bits of entropy, achieved only when each outcome has probability $1/8$. How These Concepts Fit Together These quantities form a coherent mathematical framework. Here are the key relationships: Entropy ($H$) measures raw uncertainty in a single variable or joint distribution Conditional entropy ($H(Y|X)$) tells us how much uncertainty remains after removing the information from another variable Mutual information ($I(X;Y)$) quantifies the connection between variables—how much one variable reveals about another KL divergence ($D{\text{KL}}$) measures how far a guessed model strays from reality The entropy bound provides an absolute limit on how uncertain any variable can be Understanding these relationships will help you recognize what quantity you need for any given problem. Are you measuring total uncertainty? Conditional on something else? How much two variables tell each other? The mathematical framework provides the right tool for each question.
Flashcards
What is the formula for Shannon entropy $H$ in bits per symbol?
$H = -\sum{i} pi \log2 pi$ (where $pi$ is the probability of the $i$-th symbol)
Under what condition is the entropy of $n$ symbols maximized?
When all symbols are equiprobable
What is the formula for the maximum entropy $H{\max}$ of $n$ symbols?
$H{\max} = \log2 n$
According to the Entropy Bound Theorem, when does $H(X) = \log| \mathcal{X} |$?
If and only if $X$ is uniformly distributed over its support
What is the formula for the self-information (surprisal) of a symbol with probability $p$?
$-\log2 p$
What are the common units of entropy and their corresponding logarithm bases?
Bits (shannons): Base-2 logarithm Nats: Natural logarithm Hartleys (decimal digits): Base-10 logarithm
What is the formula for the entropy of a Bernoulli trial with probability $p$ of one outcome?
$H{\text{binary}} = -p\log2 p -(1-p)\log2 (1-p)$
What is the formula for the joint entropy $H(X,Y)$ of two discrete variables?
$H(X,Y)= -\sum{x,y} p(x,y)\log2 p(x,y)$
If $X$ and $Y$ are independent, what is the relationship between their joint entropy and individual entropies?
$H(X,Y) = H(X) + H(Y)$
What is the formula for the conditional entropy $H(X|Y)$?
$H(X|Y)= -\sum{x,y} p(x,y)\log2 p(x|y)$
What concept quantifies the reduction in uncertainty of $X$ after observing $Y$?
Mutual information $I(X;Y)$
What are three ways to express mutual information $I(X;Y)$ using entropy and conditional entropy?
$I(X;Y) = H(X) - H(X|Y)$ $I(X;Y) = H(Y) - H(Y|X)$ $I(X;Y) = H(X) + H(Y) - H(X,Y)$
What property describes the relationship between $I(X;Y)$ and $I(Y;X)$?
Symmetry ($I(X;Y) = I(Y;X)$)
What is the summation formula for mutual information $I(X;Y)$ using probability distributions?
$I(X;Y) = \sum{x,y} p(x,y)\log\frac{p(x,y)}{p(x)p(y)}$
What is the formula for KL divergence $D{\text{KL}}(P\|Q)$ between a true distribution $P$ and assumed distribution $Q$?
$D{\text{KL}}(P\|Q)=\sum{i} pi \log2 \frac{pi}{qi}$
In terms of data compression, what does KL divergence measure?
The average extra bits required when using the wrong model ($Q$) instead of the true model ($P$)
Which measure extends the concept of entropy to continuous distributions?
Differential entropy
What does conditional mutual information measure?
The mutual information between two variables conditioned on a third variable

Quiz

What is the formula for the Shannon entropy $H$ (in bits per symbol) of a discrete information source?
1 of 13
Key Concepts
Entropy Concepts
Shannon entropy
Self‑information
Binary entropy function
Joint entropy
Conditional entropy
Differential entropy
Entropy bound theorem
Information Measures
Mutual information
Conditional mutual information
Kullback–Leibler divergence