Information theory - Fundamental Theorems and Advanced Concepts
Understand the core theorems of information theory, the role of directed information, and its connections to thermodynamics.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the minimum average number of bits required to encode a source without loss?
1 of 6
Summary
Information Theory Theorems and Properties
Introduction
Information theory provides us with fundamental limits on how efficiently we can encode information, transmit it over noisy channels, and process it. This section covers the key theorems that form the foundation of modern communication systems: theorems that tell us the absolute best we can do, regardless of how clever our algorithms become.
The Source Coding Theorem
The Problem: When we want to store or transmit data, we need to represent it using bits. The question is: what's the absolute minimum number of bits required to encode our data without losing information?
The Source Coding Theorem provides the answer: the minimum average number of bits needed per symbol equals the entropy $H(X)$ of the source.
Understanding the Result
Shannon's entropy, defined as $H(X) = -\sumx p(x) \log2 p(x)$, isn't just an abstract measure—it's literally the fundamental limit on compression. If your source has entropy $H(X) = 3$ bits, then:
You cannot reliably compress it below 3 bits per symbol on average
But you can get arbitrarily close to 3 bits with the right encoding
Example: Consider a source that outputs either A or B. If A occurs with probability 0.99 and B with probability 0.01, the entropy is approximately 0.08 bits. This makes intuitive sense: A is almost always produced, so we need very few bits per symbol on average. In contrast, if both occur with probability 0.5, the entropy is 1 bit, and we'd need closer to 1 bit per symbol.
The key insight: entropy measures uncertainty. More uncertain sources require more bits to encode.
The Channel Capacity Theorem
The Problem: Real-world communication channels are noisy—bits get flipped, signals get corrupted. How much information can we reliably transmit through a noisy channel?
The Channel Capacity Theorem answers this. For a discrete memoryless channel, the capacity is:
$$C = \max{p(x)} I(X;Y)$$
where $I(X;Y)$ is the mutual information between input $X$ and output $Y$, and we maximize over all possible input distributions $p(x)$.
What This Means
Channel capacity $C$ is measured in bits per channel use. It represents the maximum rate at which we can transmit information such that the error probability can be made arbitrarily small (though we may need very long codes to achieve this).
Importantly, achieving capacity requires careful choice of the input distribution. For some channels, the optimal distribution is uniform; for others, it's not. The theorem guarantees capacity is achievable, but doesn't prescribe exactly how.
Why This Matters: If you try to send information faster than capacity, no amount of clever encoding will help—errors are unavoidable. If you send slower than capacity, error-free communication is theoretically possible.
Common Channel Example
For a binary symmetric channel (BSC) where bits are flipped with probability $p$: $$C = 1 - H(p)$$
where $H(p) = -p\log2 p - (1-p)\log2(1-p)$ is the entropy function. When $p = 0$ (no noise), $C = 1$ bit. When $p = 0.5$ (maximum noise), $C = 0$—you gain no information through the channel.
The Data Processing Inequality
The Principle: Information cannot increase as we process it. This seemingly simple idea is profound.
If we have a Markov chain $X \rightarrow Y \rightarrow Z$ (meaning $Z$ depends on $X$ only through $Y$), then:
$$I(X;Z) \le I(X;Y)$$
Why This Holds
When $Y$ is processed to produce $Z$, any information about $X$ can only stay the same or decrease. The processing of $Y$ cannot create new information about $X$—it can only lose or discard it. Mathematically, $Z$ cannot be more informative about $X$ than $Y$ itself is.
Practical Implications
This inequality appears throughout information theory:
In cascade channels: information degrades as it passes through successive noisy channels
In proofs: it's often used to establish upper bounds on what's achievable
In systems: it justifies why intermediate processing steps cannot improve end-to-end information rates
The equality $I(X;Z) = I(X;Y)$ holds only when $Z$ is a sufficient statistic for $X$ given $Y$—that is, no information about $X$ is lost in going from $Y$ to $Z$.
Asymptotic Equipartition Property (AEP)
The Intuition: For a long sequence from a source, most sequences are "typical"—they behave like you'd expect from the source's probabilities. Atypical sequences are exceedingly rare.
Formally, for an i.i.d. source, as we encode longer and longer sequences of length $n$:
Almost all sequences fall within a typical set of size approximately $2^{nH(X)}$
Each typical sequence has probability approximately $2^{-nH(X)}$
The probability of seeing an atypical sequence decreases exponentially to zero
The Typical Set Explained
For large $n$, a sequence $(x1, x2, \ldots, xn)$ is "typical" if its empirical distribution closely matches the true distribution $p(x)$. This means the actual number of times each symbol appears is close to what we'd expect.
Why This Matters: The AEP justifies the source coding theorem. To encode $n$ symbols reliably, we only need to distinguish among the $2^{nH(X)}$ typical sequences, not all possible sequences. This requires $nH(X)$ bits, or $H(X)$ bits per symbol on average.
Example
Suppose we have a coin with $P(H) = P(T) = 0.5$. For $n = 100$ flips:
Entropy is $H = 1$ bit
The typical set has size roughly $2^{100}$ sequences
We could encode any sequence of 100 flips using about 100 bits
A sequence with 99 heads and 1 tail is atypical (exponentially unlikely)
We almost never see it, so we don't need to encode it efficiently
<extrainfo>
Relation to Central Limit Theorem
When using natural logarithms (base $e$, giving entropy in units of "nats" rather than bits), the entropy of a continuous distribution connects to properties of the Gaussian distribution. Specifically, among all continuous distributions with a fixed variance, the Gaussian has maximum differential entropy.
This connection deepens when considering sequences of many samples: the distribution of their sample averages approaches a Gaussian (by the Central Limit Theorem), and the entropy grows in a way that reflects the underlying Gaussian behavior. While mathematically elegant, this relationship is often not central to discrete information theory courses.
</extrainfo>
<extrainfo>
Directed and Transfer Information
Beyond mutual information, directed information $I(X^n \to Y^n)$ captures the flow of information from a source sequence to an output sequence, accounting for causality and feedback. It's defined as:
$$I(X^n \to Y^n) = \sum{k=1}^n I(X^k; Yk | Y^{k-1})$$
This measure is particularly useful in:
Feedback channels: analyzing how feedback improves channel capacity
Portfolio theory: measuring information flow in financial systems
Data compression: optimal compression with side information
Hypothesis testing: sequential detection problems
Directed information generalizes mutual information in settings where the flow of information over time matters, making it valuable for dynamic systems. However, it's typically an advanced topic beyond core information theory courses.
</extrainfo>
<extrainfo>
Information Theory in Physics and Thermodynamics
Entropy in Thermodynamics vs. Information Theory
The mathematical form of thermodynamic entropy and Shannon entropy are remarkably similar:
Thermodynamic entropy: $S = -kB \sumi pi \ln pi$ (where $kB$ is Boltzmann's constant)
Shannon entropy: $H(X) = -\sumx p(x) \ln p(x)$ (measured in nats)
Despite identical mathematical structure, they describe different things. Thermodynamic entropy measures the physical disorder of a system, while Shannon entropy measures uncertainty in information. The connection is deep but interpretive: both describe the number of microscopic states consistent with macroscopic observations.
Maxwell's Demon and Landauer's Principle
Maxwell's demon is a thought experiment: a tiny being that could separate fast (hot) molecules from slow (cold) ones, apparently violating the second law of thermodynamics. The resolution involves information and computation.
Landauer's Principle states that erasing one bit of information necessarily dissipates at least $kB T \ln 2$ of heat into the environment, where $T$ is temperature. This links the abstract notion of computational information to physical thermodynamic cost.
The implications:
Computation has thermodynamic limits
Even "free" logical operations require energy costs
The demon's trick fails because resetting its memory dissipates heat
These topics bridge information theory and physics, showing how abstract concepts connect to fundamental physical laws. They're intellectually fascinating but typically outside standard information theory exam material.
</extrainfo>
Flashcards
What is the minimum average number of bits required to encode a source without loss?
Its entropy $H(X)$ (where $H(X)$ is the entropy of the source $X$)
What is the mathematical definition for the capacity $C$ of a discrete memoryless channel?
$C = \max{p(x)} I(X;Y)$ (where $I(X;Y)$ is the mutual information between input $X$ and output $Y$)
If $X \rightarrow Y \rightarrow Z$ form a Markov chain, what is the relationship between the mutual information $I(X;Z)$ and $I(X;Y)$?
$I(X;Z) \le I(X;Y)$
For large $n$, what is the approximate size of the typical set of sequences from an i.i.d. source?
$2^{nH(X)}$ (where $n$ is the sequence length and $H(X)$ is the entropy)
How do Thermodynamic entropy and Shannon entropy relate in terms of mathematical form and physical interpretation?
They share the same mathematical form but differ in physical interpretation
What is the minimum amount of heat dissipated when erasing one bit of information?
$kB T \ln 2$ (where $kB$ is the Boltzmann constant and $T$ is the temperature)
Quiz
Information theory - Fundamental Theorems and Advanced Concepts Quiz Question 1: What does the source coding theorem state about the minimum average number of bits required to encode a source without loss?
- It equals the source entropy $H(X)$. (correct)
- It is greater than the source entropy $H(X)$.
- It is less than the source entropy $H(X)$.
- It is unrelated to the source entropy.
Information theory - Fundamental Theorems and Advanced Concepts Quiz Question 2: Directed information is commonly applied in which of the following fields?
- Portfolio theory (correct)
- Quantum error correction
- Image classification
- Natural language parsing
Information theory - Fundamental Theorems and Advanced Concepts Quiz Question 3: According to Landauer’s principle, erasing one bit of information dissipates at least how much heat?
- $k_B\,T\ln 2$ (correct)
- $k_B\,T$
- $2\,k_B\,T\ln 2$
- Zero heat
Information theory - Fundamental Theorems and Advanced Concepts Quiz Question 4: What is the formula for the capacity C of a discrete memoryless channel?
- C = max_{p(x)} I(X;Y) (correct)
- C = min_{p(x)} H(Y|X)
- C = I(X;Y) for a uniform input distribution
- C = H(Y) - H(Y|X)
Information theory - Fundamental Theorems and Advanced Concepts Quiz Question 5: If X → Y → Z form a Markov chain, which inequality is always true?
- I(X;Z) ≤ I(X;Y) (correct)
- I(X;Z) ≥ I(X;Y)
- I(Y;Z) ≤ I(X;Y)
- I(X;Y) = I(Y;Z)
Information theory - Fundamental Theorems and Advanced Concepts Quiz Question 6: According to the Asymptotic Equipartition Property, the size of the typical set for an i.i.d. source of length n is approximately what?
- 2^{nH(X)} (correct)
- 2^{nH(Y)}
- 2^{nI(X;Y)}
- 2^{n}
Information theory - Fundamental Theorems and Advanced Concepts Quiz Question 7: When entropy is measured in nats (log base e), its connection to the Central Limit Theorem involves which distribution?
- Gaussian distribution (correct)
- Poisson distribution
- Uniform distribution
- Exponential distribution
What does the source coding theorem state about the minimum average number of bits required to encode a source without loss?
1 of 7
Key Concepts
Information Theory
Source coding theorem
Channel capacity theorem
Data processing inequality
Asymptotic equipartition property
Directed information
Shannon entropy
Thermodynamics and Information
Thermodynamic entropy
Maxwell’s demon
Landauer’s principle
Definitions
Source coding theorem
States that the minimum average number of bits required to losslessly encode a source equals its entropy \(H(X)\).
Channel capacity theorem
Gives the maximum reliable communication rate of a discrete memoryless channel as \(C = \max_{p(x)} I(X;Y)\).
Data processing inequality
Asserts that processing data cannot increase mutual information: for \(X\rightarrow Y\rightarrow Z\), \(I(X;Z) \le I(X;Y)\).
Asymptotic equipartition property
Shows that long i.i.d. sequences are almost uniformly likely within a typical set of size \(2^{nH(X)}\).
Directed information
Measures the flow of information in systems with feedback, quantifying causal influence from one process to another.
Shannon entropy
Quantifies the average uncertainty or information content of a random variable in bits (or nats).
Thermodynamic entropy
A physical quantity representing the degree of disorder or energy dispersal in a thermodynamic system.
Maxwell’s demon
A thought experiment that seemingly violates the second law of thermodynamics by using information to extract work.
Landauer’s principle
States that erasing one bit of information dissipates at least \(k_B T \ln 2\) of heat, linking computation to thermodynamics.