Subjects/Science/Computer and Information Science/Computer Science/Information theory

Information theory - Fundamental Theorems and Advanced Concepts

Understand the core theorems of information theory, the role of directed information, and its connections to thermodynamics.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the minimum average number of bits required to encode a source without loss?

1 of 6

Summary

Information Theory Theorems and Properties Introduction Information theory provides us with fundamental limits on how efficiently we can encode information, transmit it over noisy channels, and process it. This section covers the key theorems that form the foundation of modern communication systems: theorems that tell us the absolute best we can do, regardless of how clever our algorithms become. The Source Coding Theorem The Problem: When we want to store or transmit data, we need to represent it using bits. The question is: what's the absolute minimum number of bits required to encode our data without losing information? The Source Coding Theorem provides the answer: the minimum average number of bits needed per symbol equals the entropy $H(X)$ of the source. Understanding the Result Shannon's entropy, defined as $H(X) = -\sumx p(x) \log2 p(x)$, isn't just an abstract measure—it's literally the fundamental limit on compression. If your source has entropy $H(X) = 3$ bits, then: You cannot reliably compress it below 3 bits per symbol on average But you can get arbitrarily close to 3 bits with the right encoding Example: Consider a source that outputs either A or B. If A occurs with probability 0.99 and B with probability 0.01, the entropy is approximately 0.08 bits. This makes intuitive sense: A is almost always produced, so we need very few bits per symbol on average. In contrast, if both occur with probability 0.5, the entropy is 1 bit, and we'd need closer to 1 bit per symbol. The key insight: entropy measures uncertainty. More uncertain sources require more bits to encode. The Channel Capacity Theorem The Problem: Real-world communication channels are noisy—bits get flipped, signals get corrupted. How much information can we reliably transmit through a noisy channel? The Channel Capacity Theorem answers this. For a discrete memoryless channel, the capacity is: $$C = \max{p(x)} I(X;Y)$$ where $I(X;Y)$ is the mutual information between input $X$ and output $Y$, and we maximize over all possible input distributions $p(x)$. What This Means Channel capacity $C$ is measured in bits per channel use. It represents the maximum rate at which we can transmit information such that the error probability can be made arbitrarily small (though we may need very long codes to achieve this). Importantly, achieving capacity requires careful choice of the input distribution. For some channels, the optimal distribution is uniform; for others, it's not. The theorem guarantees capacity is achievable, but doesn't prescribe exactly how. Why This Matters: If you try to send information faster than capacity, no amount of clever encoding will help—errors are unavoidable. If you send slower than capacity, error-free communication is theoretically possible. Common Channel Example For a binary symmetric channel (BSC) where bits are flipped with probability $p$: $$C = 1 - H(p)$$ where $H(p) = -p\log2 p - (1-p)\log2(1-p)$ is the entropy function. When $p = 0$ (no noise), $C = 1$ bit. When $p = 0.5$ (maximum noise), $C = 0$—you gain no information through the channel. The Data Processing Inequality The Principle: Information cannot increase as we process it. This seemingly simple idea is profound. If we have a Markov chain $X \rightarrow Y \rightarrow Z$ (meaning $Z$ depends on $X$ only through $Y$), then: $$I(X;Z) \le I(X;Y)$$ Why This Holds When $Y$ is processed to produce $Z$, any information about $X$ can only stay the same or decrease. The processing of $Y$ cannot create new information about $X$—it can only lose or discard it. Mathematically, $Z$ cannot be more informative about $X$ than $Y$ itself is. Practical Implications This inequality appears throughout information theory: In cascade channels: information degrades as it passes through successive noisy channels In proofs: it's often used to establish upper bounds on what's achievable In systems: it justifies why intermediate processing steps cannot improve end-to-end information rates The equality $I(X;Z) = I(X;Y)$ holds only when $Z$ is a sufficient statistic for $X$ given $Y$—that is, no information about $X$ is lost in going from $Y$ to $Z$. Asymptotic Equipartition Property (AEP) The Intuition: For a long sequence from a source, most sequences are "typical"—they behave like you'd expect from the source's probabilities. Atypical sequences are exceedingly rare. Formally, for an i.i.d. source, as we encode longer and longer sequences of length $n$: Almost all sequences fall within a typical set of size approximately $2^{nH(X)}$ Each typical sequence has probability approximately $2^{-nH(X)}$ The probability of seeing an atypical sequence decreases exponentially to zero The Typical Set Explained For large $n$, a sequence $(x1, x2, \ldots, xn)$ is "typical" if its empirical distribution closely matches the true distribution $p(x)$. This means the actual number of times each symbol appears is close to what we'd expect. Why This Matters: The AEP justifies the source coding theorem. To encode $n$ symbols reliably, we only need to distinguish among the $2^{nH(X)}$ typical sequences, not all possible sequences. This requires $nH(X)$ bits, or $H(X)$ bits per symbol on average. Example Suppose we have a coin with $P(H) = P(T) = 0.5$. For $n = 100$ flips: Entropy is $H = 1$ bit The typical set has size roughly $2^{100}$ sequences We could encode any sequence of 100 flips using about 100 bits A sequence with 99 heads and 1 tail is atypical (exponentially unlikely) We almost never see it, so we don't need to encode it efficiently <extrainfo> Relation to Central Limit Theorem When using natural logarithms (base $e$, giving entropy in units of "nats" rather than bits), the entropy of a continuous distribution connects to properties of the Gaussian distribution. Specifically, among all continuous distributions with a fixed variance, the Gaussian has maximum differential entropy. This connection deepens when considering sequences of many samples: the distribution of their sample averages approaches a Gaussian (by the Central Limit Theorem), and the entropy grows in a way that reflects the underlying Gaussian behavior. While mathematically elegant, this relationship is often not central to discrete information theory courses. </extrainfo> <extrainfo> Directed and Transfer Information Beyond mutual information, directed information $I(X^n \to Y^n)$ captures the flow of information from a source sequence to an output sequence, accounting for causality and feedback. It's defined as: $$I(X^n \to Y^n) = \sum{k=1}^n I(X^k; Yk | Y^{k-1})$$ This measure is particularly useful in: Feedback channels: analyzing how feedback improves channel capacity Portfolio theory: measuring information flow in financial systems Data compression: optimal compression with side information Hypothesis testing: sequential detection problems Directed information generalizes mutual information in settings where the flow of information over time matters, making it valuable for dynamic systems. However, it's typically an advanced topic beyond core information theory courses. </extrainfo> <extrainfo> Information Theory in Physics and Thermodynamics Entropy in Thermodynamics vs. Information Theory The mathematical form of thermodynamic entropy and Shannon entropy are remarkably similar: Thermodynamic entropy: $S = -kB \sumi pi \ln pi$ (where $kB$ is Boltzmann's constant) Shannon entropy: $H(X) = -\sumx p(x) \ln p(x)$ (measured in nats) Despite identical mathematical structure, they describe different things. Thermodynamic entropy measures the physical disorder of a system, while Shannon entropy measures uncertainty in information. The connection is deep but interpretive: both describe the number of microscopic states consistent with macroscopic observations. Maxwell's Demon and Landauer's Principle Maxwell's demon is a thought experiment: a tiny being that could separate fast (hot) molecules from slow (cold) ones, apparently violating the second law of thermodynamics. The resolution involves information and computation. Landauer's Principle states that erasing one bit of information necessarily dissipates at least $kB T \ln 2$ of heat into the environment, where $T$ is temperature. This links the abstract notion of computational information to physical thermodynamic cost. The implications: Computation has thermodynamic limits Even "free" logical operations require energy costs The demon's trick fails because resetting its memory dissipates heat These topics bridge information theory and physics, showing how abstract concepts connect to fundamental physical laws. They're intellectually fascinating but typically outside standard information theory exam material. </extrainfo>

Flashcards

What is the minimum average number of bits required to encode a source without loss?

Its entropy $H(X)$ (where $H(X)$ is the entropy of the source $X$)

What is the mathematical definition for the capacity $C$ of a discrete memoryless channel?

$C = \max{p(x)} I(X;Y)$ (where $I(X;Y)$ is the mutual information between input $X$ and output $Y$)

If $X \rightarrow Y \rightarrow Z$ form a Markov chain, what is the relationship between the mutual information $I(X;Z)$ and $I(X;Y)$?

$I(X;Z) \le I(X;Y)$

For large $n$, what is the approximate size of the typical set of sequences from an i.i.d. source?

$2^{nH(X)}$ (where $n$ is the sequence length and $H(X)$ is the entropy)

How do Thermodynamic entropy and Shannon entropy relate in terms of mathematical form and physical interpretation?

They share the same mathematical form but differ in physical interpretation

What is the minimum amount of heat dissipated when erasing one bit of information?

$kB T \ln 2$ (where $kB$ is the Boltzmann constant and $T$ is the temperature)

Quiz

What does the source coding theorem state about the minimum average number of bits required to encode a source without loss?

1 of 7

Key Concepts

Information Theory

Source coding theorem

Channel capacity theorem

Data processing inequality

Asymptotic equipartition property

Directed information

Shannon entropy

Thermodynamics and Information

Thermodynamic entropy

Maxwell’s demon

Landauer’s principle

Definitions

Source coding theorem

States that the minimum average number of bits required to losslessly encode a source equals its entropy $H(X)$.

Channel capacity theorem

Gives the maximum reliable communication rate of a discrete memoryless channel as $C = \max_{p(x)} I(X;Y)$.

Data processing inequality

Asserts that processing data cannot increase mutual information: for $X\rightarrow Y\rightarrow Z$, $I(X;Z) \le I(X;Y)$.

Asymptotic equipartition property

Shows that long i.i.d. sequences are almost uniformly likely within a typical set of size $2^{nH(X)}$.

Directed information

Measures the flow of information in systems with feedback, quantifying causal influence from one process to another.

Shannon entropy

Quantifies the average uncertainty or information content of a random variable in bits (or nats).

Thermodynamic entropy

A physical quantity representing the degree of disorder or energy dispersal in a thermodynamic system.

Maxwell’s demon

A thought experiment that seemingly violates the second law of thermodynamics by using information to extract work.

Landauer’s principle

States that erasing one bit of information dissipates at least $k_B T \ln 2$ of heat, linking computation to thermodynamics.