Subjects/Science/Computer and Information Science/Computer Science/Information theory

Information theory - Core Information Measures

Understand entropy, mutual information, and KL divergence as fundamental measures of uncertainty and information in information theory.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the formula for Shannon entropy $H$ in bits per symbol?

1 of 18

Summary

Core Information Theory Quantities Introduction Information theory provides mathematical tools to measure and quantify uncertainty. At its heart are a few core quantities that tell us about randomness, predictability, and how much we can learn by observing events. Mastering these concepts—entropy, mutual information, and divergence—forms the foundation for understanding everything from data compression to communication channels. Entropy: Measuring Uncertainty Entropy quantifies how much uncertainty or randomness exists in an information source. For a discrete random variable $X$ with possible outcomes and their probabilities, Shannon entropy is defined as: $$H(X) = -\sum{i} pi \log2 pi$$ where $pi$ is the probability of the $i$-th outcome. Understanding What Entropy Measures Entropy answers a fundamental question: On average, how surprised should we be by the next symbol we see? The more unpredictable the source, the higher the entropy. Self-information (also called surprisal) captures this idea for a single outcome. When you observe a symbol with probability $p$, its self-information is: $$\text{Self-information} = -\log2 p$$ A low-probability event carries more information (higher surprise) than a high-probability event. Entropy is the average self-information across all outcomes. Key Properties of Entropy Entropy is maximized when all symbols are equally likely. If you have $n$ possible symbols, maximum entropy is: $$H{\max} = \log2 n$$ For example, a fair coin flip (two equally likely outcomes) has maximum entropy of $\log2 2 = 1$ bit. A fair six-sided die has maximum entropy of $\log2 6 \approx 2.58$ bits. Conversely, entropy is zero when the outcome is completely predictable—when one symbol has probability 1 and all others have probability 0. Units Matter The logarithm base determines the unit: Base-2 logarithm → bits (shannons) Natural logarithm (base $e$) → nats Base-10 logarithm → decimal digits (hartleys) Most computer science focuses on bits, while some physics and mathematics uses nats. Always note which unit is being used—the numerical value changes with the base. The Binary Entropy Function A particularly important special case is when you have just two outcomes. The binary entropy function describes entropy for a Bernoulli trial (a single yes/no event): $$H{\text{binary}}(p) = -p\log2 p - (1-p)\log2(1-p)$$ where $p$ is the probability of one outcome and $(1-p)$ is the probability of the other. This function has nice properties: it's zero when $p = 0$ or $p = 1$ (no uncertainty), and reaches its maximum of 1 bit when $p = 0.5$ (maximum uncertainty). This curve is so common in information theory that you should develop an intuition for its shape. Joint Entropy: Multiple Variables When you have two random variables $X$ and $Y$, their joint entropy measures the total uncertainty in both: $$H(X,Y) = -\sum{x,y} p(x,y)\log2 p(x,y)$$ where $p(x,y)$ is the probability of the pair $(x,y)$ occurring together. Important insight: If $X$ and $Y$ are independent, then: $$H(X,Y) = H(X) + H(Y)$$ This makes intuitive sense—if the variables are unrelated, the total uncertainty is simply the sum of their individual uncertainties. However, if they're correlated, $H(X,Y)$ will be less than $H(X) + H(Y)$ because knowing one variable gives you information about the other. Conditional Entropy: Uncertainty Given Information What if you already know the value of $Y$? How much uncertainty remains about $X$? This is measured by conditional entropy: $$H(X|Y) = -\sum{x,y} p(x,y)\log2 p(x|y)$$ This is the average entropy of $X$ across all possible values of $Y$, weighted by how likely each value of $Y$ is. A crucial relationship connects these quantities: $$H(X,Y) = H(X) + H(Y|X)$$ This can be rewritten as: $$H(Y|X) = H(X,Y) - H(X)$$ This relationship captures the intuition: the joint uncertainty equals the uncertainty in $X$ plus whatever additional uncertainty remains in $Y$ once you know $X$. Mutual Information: How Much Two Variables Tell Each Other Mutual information quantifies how much knowing one variable reduces uncertainty about another. It's one of the most important concepts in information theory: $$I(X;Y) = H(X) - H(X|Y)$$ This reads as: "The mutual information between $X$ and $Y$ equals the original entropy of $X$ minus the remaining entropy of $X$ after observing $Y$." In other words, it's the reduction in uncertainty about $X$ caused by learning $Y$. Multiple Formulations Mutual information can be expressed in several equivalent ways: $$I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)$$ Also in probability terms: $$I(X;Y) = \sum{x,y} p(x,y)\log2 \frac{p(x,y)}{p(x)p(y)}$$ Key Properties Mutual information is symmetric: $I(X;Y) = I(Y;X)$. Learning about $Y$ reduces uncertainty in $X$ by exactly the same amount that learning about $X$ reduces uncertainty in $Y$. Mutual information is always non-negative: $I(X;Y) \geq 0$, and equals zero if and only if $X$ and $Y$ are independent. Mutual information is bounded: $I(X;Y) \leq \min(H(X), H(Y))$. You can't learn more about $Y$ than it contains in total. Kullback-Leibler Divergence: Measuring Distribution Mismatch Suppose you're trying to compress data that actually comes from distribution $P$, but you mistakenly think it comes from distribution $Q$. How many extra bits per symbol will you need? The Kullback-Leibler (KL) divergence answers this: $$D{\text{KL}}(P \| Q) = \sum{i} pi \log2 \frac{pi}{qi}$$ where $pi$ are the true probabilities and $qi$ are your assumed probabilities. Interpretation KL divergence measures how different $P$ is from $Q$. It's the average extra bits required when you use the wrong model. When $P = Q$ everywhere, $D{\text{KL}}(P \| Q) = 0$. Important Property: KL Divergence is Not Symmetric Note that $D{\text{KL}}(P \| Q) \neq D{\text{KL}}(Q \| P)$ in general. The direction matters. $D{\text{KL}}(P \| Q)$ penalizes $Q$ more heavily for assigning low probability to events that actually have high probability in $P$. This asymmetry makes sense: if the true distribution has a high-probability event that your model thinks is rare, you're wasting lots of bits. But if your model wastes bits on rare events, that's less costly on average. The Entropy Bound Theorem A fundamental upper bound on entropy states that for any discrete random variable $X$ with support of size $|\mathcal{X}|$: $$H(X) \leq \log2 |\mathcal{X}|$$ Equality holds if and only if $X$ is uniformly distributed over all possible outcomes. This should make intuitive sense: maximum entropy occurs when all outcomes are equally likely. If any outcome is more likely than others, you have less uncertainty—you can predict slightly better. For example, a random variable over 8 possible outcomes can have at most $\log2 8 = 3$ bits of entropy, achieved only when each outcome has probability $1/8$. How These Concepts Fit Together These quantities form a coherent mathematical framework. Here are the key relationships: Entropy ($H$) measures raw uncertainty in a single variable or joint distribution Conditional entropy ($H(Y|X)$) tells us how much uncertainty remains after removing the information from another variable Mutual information ($I(X;Y)$) quantifies the connection between variables—how much one variable reveals about another KL divergence ($D{\text{KL}}$) measures how far a guessed model strays from reality The entropy bound provides an absolute limit on how uncertain any variable can be Understanding these relationships will help you recognize what quantity you need for any given problem. Are you measuring total uncertainty? Conditional on something else? How much two variables tell each other? The mathematical framework provides the right tool for each question.

Flashcards

What is the formula for Shannon entropy $H$ in bits per symbol?

$H = -\sum{i} pi \log2 pi$ (where $pi$ is the probability of the $i$-th symbol)

Under what condition is the entropy of $n$ symbols maximized?

When all symbols are equiprobable

What is the formula for the maximum entropy $H{\max}$ of $n$ symbols?

$H{\max} = \log2 n$

According to the Entropy Bound Theorem, when does $H(X) = \log| \mathcal{X} |$?

If and only if $X$ is uniformly distributed over its support

What is the formula for the self-information (surprisal) of a symbol with probability $p$?

$-\log2 p$

What are the common units of entropy and their corresponding logarithm bases?

Bits (shannons): Base-2 logarithm Nats: Natural logarithm Hartleys (decimal digits): Base-10 logarithm

What is the formula for the entropy of a Bernoulli trial with probability $p$ of one outcome?

$H{\text{binary}} = -p\log2 p -(1-p)\log2 (1-p)$

What is the formula for the joint entropy $H(X,Y)$ of two discrete variables?

$H(X,Y)= -\sum{x,y} p(x,y)\log2 p(x,y)$

If $X$ and $Y$ are independent, what is the relationship between their joint entropy and individual entropies?

$H(X,Y) = H(X) + H(Y)$

What is the formula for the conditional entropy $H(X|Y)$?

$H(X|Y)= -\sum{x,y} p(x,y)\log2 p(x|y)$

What concept quantifies the reduction in uncertainty of $X$ after observing $Y$?

Mutual information $I(X;Y)$

What are three ways to express mutual information $I(X;Y)$ using entropy and conditional entropy?

$I(X;Y) = H(X) - H(X|Y)$ $I(X;Y) = H(Y) - H(Y|X)$ $I(X;Y) = H(X) + H(Y) - H(X,Y)$

What property describes the relationship between $I(X;Y)$ and $I(Y;X)$?

Symmetry ($I(X;Y) = I(Y;X)$)

What is the summation formula for mutual information $I(X;Y)$ using probability distributions?

$I(X;Y) = \sum{x,y} p(x,y)\log\frac{p(x,y)}{p(x)p(y)}$

What is the formula for KL divergence $D{\text{KL}}(P\|Q)$ between a true distribution $P$ and assumed distribution $Q$?

$D{\text{KL}}(P\|Q)=\sum{i} pi \log2 \frac{pi}{qi}$

In terms of data compression, what does KL divergence measure?

The average extra bits required when using the wrong model ($Q$) instead of the true model ($P$)

Which measure extends the concept of entropy to continuous distributions?

Differential entropy

What does conditional mutual information measure?

The mutual information between two variables conditioned on a third variable

Quiz

Information theory - Core Information Measures Quiz Question 1: What is the formula for the Shannon entropy $H$ (in bits per symbol) of a discrete information source?

$H = -\displaystyle\sum_{i} p_i \log_2 p_i$ (correct)
$H = \displaystyle\sum_{i} p_i \log_2 p_i$
$H = -\displaystyle\sum_{i} p_i \log_{10} p_i$
$H = -\displaystyle\sum_{i} p_i^2$

Information theory - Core Information Measures Quiz Question 2: What unit of information results when entropy is measured using a base‑2 logarithm?

bits (shannons) (correct)
nats
hartleys (decimal digits)
bytes

Information theory - Core Information Measures Quiz Question 3: Which unit of entropy corresponds to using the natural (base‑e) logarithm?

nats (correct)
bits
hartleys
bytes

Information theory - Core Information Measures Quiz Question 4: What is the binary entropy function for a Bernoulli trial with success probability $p$?

$-p\log_2 p\;-\;(1-p)\log_2(1-p)$ (correct)
$-p\log p\;-\;(1-p)\log(1-p)$
$p\log_2 p\;+\;(1-p)\log_2(1-p)$
$-p\log_{10} p\;-\;(1-p)\log_{10}(1-p)$

Information theory - Core Information Measures Quiz Question 5: If $X$ and $Y$ are independent, what is the relationship between $H(X,Y)$, $H(X)$ and $H(Y)$?

$H(X,Y)=H(X)+H(Y)$ (correct)
$H(X,Y)=H(X)-H(Y)$
$H(X,Y)=|H(X)-H(Y)|$
$H(X,Y)=\max\{H(X),H(Y)\}$

Information theory - Core Information Measures Quiz Question 6: Which property of mutual information is always true?

$I(X;Y)=I(Y;X)$ (correct)
$I(X;Y)=H(X)-H(Y)$
$I(X;Y)=0$ for any $X,Y$
$I(X;Y)=H(X)+H(Y)$

Information theory - Core Information Measures Quiz Question 7: What does KL divergence measure in the context of data compression?

The average extra bits required when using an incorrect model (correct)
The reduction in bits achieved by optimal coding
The total number of bits needed to encode the data
The mutual information between the true and assumed distributions

Information theory - Core Information Measures Quiz Question 8: What does conditional mutual information $I(X;Y\mid Z)$ quantify?

The mutual information between $X$ and $Y$ given $Z$ (correct)
The entropy of $X$ conditioned on $Z$
The KL divergence between $X$ and $Y$
The total correlation among $X$, $Y$, and $Z$

Information theory - Core Information Measures Quiz Question 9: What is the formula for the entropy $H(X)$ of a discrete variable $X$?

$-\displaystyle\sum_x p(x)\log p(x)$ (correct)
$\displaystyle\sum_x p(x)\log p(x)$
$-\displaystyle\sum_x \log p(x)$
$\displaystyle\sum_x \log\frac{1}{p(x)}$

Information theory - Core Information Measures Quiz Question 10: How is conditional entropy $H(Y\mid X)$ expressed mathematically?

$-\displaystyle\sum_{x,y} p(x,y)\log p(y\mid x)$ (correct)
$-\displaystyle\sum_{x,y} p(x,y)\log p(x\mid y)$
$-\displaystyle\sum_{y} p(y)\log p(y\mid x)$
$-\displaystyle\sum_{x} p(x)\log p(x\mid y)$

Information theory - Core Information Measures Quiz Question 11: Which equation correctly defines mutual information $I(X;Y)$?

$I(X;Y)=H(X)+H(Y)-H(X,Y)$ (correct)
$I(X;Y)=H(X,Y)-H(X)-H(Y)$
$I(X;Y)=H(X)-H(Y)$
$I(X;Y)=H(X\mid Y)-H(Y\mid X)$

Information theory - Core Information Measures Quiz Question 12: Which of the following is always true about the Kullback–Leibler divergence $D_{KL}(P\|Q)$ for any two probability distributions $P$ and $Q$?

$D_{KL}(P\|Q)\ge 0$, with equality iff $P=Q$ almost everywhere (correct)
$D_{KL}(P\|Q)\le 0$, with equality iff $P=Q$ almost everywhere
$D_{KL}(P\|Q)$ can be either positive or negative depending on the distributions
$D_{KL}(P\|Q)$ is symmetric: $D_{KL}(P\|Q)=D_{KL}(Q\|P)$

Information theory - Core Information Measures Quiz Question 13: In the definition $H(X|Y)= -\sum_{x,y} p(x,y)\log_2 p(x|y)$, what does $p(x|y)$ represent?

Conditional probability of $X=x$ given $Y=y$ (correct)
Joint probability of $X=x$ and $Y=y$
Marginal probability of $X=x$
Conditional probability of $Y=y$ given $X=x$

What is the formula for the Shannon entropy $H$ (in bits per symbol) of a discrete information source?

1 of 13

Key Concepts

Entropy Concepts

Shannon entropy

Self‑information

Binary entropy function

Joint entropy

Conditional entropy

Differential entropy

Entropy bound theorem

Information Measures

Mutual information

Conditional mutual information

Kullback–Leibler divergence

Definitions

Shannon entropy

The average amount of information produced by a stochastic source of data, measured in bits per symbol.

Self‑information

The information content of a specific outcome, defined as the negative logarithm of its probability.

Binary entropy function

The entropy of a Bernoulli (binary) random variable expressed as a function of its success probability.

Joint entropy

The entropy of a pair of random variables, representing the uncertainty of their combined outcomes.

Conditional entropy

The expected remaining uncertainty of one variable after another variable is known.

Mutual information

The amount of information shared between two variables, quantifying the reduction in uncertainty of one given the other.

Kullback–Leibler divergence

A measure of how one probability distribution diverges from a second, reference distribution.

Differential entropy

The continuous analogue of Shannon entropy, defined for continuous probability distributions.

Conditional mutual information

The mutual information between two variables conditioned on the knowledge of a third variable.

Entropy bound theorem

The theorem stating that the entropy of a discrete variable cannot exceed the logarithm of the size of its support, with equality for a uniform distribution.