Information theory Study Guide
Study Guide
📖 Core Concepts
Information – reduction of uncertainty about a random variable; measured in bits (log₂), nats (ln), or hartleys (log₁₀).
Entropy \(H\) – average “surprisal’’ of a source: \(H(X)= -\sum{x} p(x)\log2 p(x)\). Higher \(H\) → more unpredictable.
Self‑information – surprisal of a single outcome: \(-\log2 p\).
Joint Entropy \(H(X,Y)\) – uncertainty about a pair of variables together.
Conditional Entropy \(H(X|Y)\) – remaining uncertainty of \(X\) after observing \(Y\).
Mutual Information \(I(X;Y)\) – reduction in uncertainty of one variable given the other: \(I = H(X)-H(X|Y) = H(Y)-H(Y|X) = H(X)+H(Y)-H(X,Y)\).
Kullback–Leibler (KL) Divergence \(D{\text{KL}}(P\|Q)\) – extra bits needed when using the wrong distribution: \(\displaystyle D{\text{KL}}(P\|Q)=\sumi pi\log2\frac{pi}{qi}\).
Channel Capacity \(C\) – maximum achievable mutual information over all input distributions: \(C=\max{p(x)} I(X;Y)\).
Source Coding Theorem – optimal loss‑less compression rate equals the source entropy \(H\).
Channel Coding Theorem – reliable communication is possible iff the transmission rate \(R\) is less than \(C\).
📌 Must Remember
Entropy formula: \(H = -\sumi pi \log2 pi\).
Binary entropy: \(H{\text{bin}}(p)= -p\log2 p -(1-p)\log2(1-p)\).
Maximum entropy: \(H{\max}= \log2 n\) for \(n\) equiprobable symbols.
Channel capacity of BSC: \(C = 1 - H{\text{bin}}(p)\).
Channel capacity of BEC: \(C = 1-p\).
Shannon–Hartley (Gaussian) capacity: \(C = B\log2\!\left(1+\frac{S}{N}\right)\).
Data Processing Inequality: If \(X \rightarrow Y \rightarrow Z\), then \(I(X;Z) \le I(X;Y)\).
Entropy bound: \(H(X) \le \log |\mathcal{X}|\), equality only for uniform distribution.
🔄 Key Processes
Computing Entropy
List probabilities \(pi\).
Compute \(-pi\log2 pi\) for each.
Sum the terms → \(H\).
Finding Mutual Information
Obtain joint distribution \(p(x,y)\) and marginals \(p(x),p(y)\).
Compute \(I = \sum{x,y} p(x,y)\log2\frac{p(x,y)}{p(x)p(y)}\).
Channel Capacity Maximization (discrete memoryless)
Guess an input distribution \(p(x)\).
Compute \(I(X;Y)\) using channel transition probabilities.
Adjust \(p(x)\) to increase \(I\) (use calculus of variations or Blahut‑Arimoto algorithm).
Designing a Lossless Code (Huffman)
Sort symbols by probability.
Combine two least‑probable symbols iteratively, building a binary tree.
Assign 0/1 along branches → codewords.
Evaluating Error‑Correction Feasibility
Compare desired rate \(R\) with channel capacity \(C\).
If \(R<C\): choose a code (e.g., LDPC, Turbo) with enough block length to drive error probability ↓.
🔍 Key Comparisons
Entropy vs. KL Divergence
Entropy quantifies uncertainty of a single distribution.
KL divergence measures “distance’’ between two distributions (asymmetrical).
Binary Symmetric Channel (BSC) vs. Binary Erasure Channel (BEC)
BSC flips bits with probability \(p\); capacity \(1-H{\text{bin}}(p)\).
BEC erases bits (unknown) with probability \(p\); capacity \(1-p\).
Lossless vs. Lossy Compression
Lossless: exact reconstruction; rate ≥ \(H\).
Lossy: allows distortion; governed by rate‑distortion function (not detailed in outline).
Shannon entropy vs. Thermodynamic entropy
Same mathematical form; Shannon quantifies information, thermodynamic entropy quantifies physical disorder.
⚠️ Common Misunderstandings
“Entropy is a measure of information content.”
Correct: Entropy measures average uncertainty, not the information of a particular outcome (that's self‑information).
“Higher mutual information always means a better channel.”
Correct: Mutual information is bounded by capacity; a channel can have high \(I\) for a specific input distribution but still have lower capacity if a better distribution exists.
“KL divergence is symmetric.”
Wrong: \(D{\text{KL}}(P\|Q) \neq D{\text{KL}}(Q\|P)\) in general.
“If \(R<C\) then any code works.”
Wrong: Only codes with sufficiently long block length and proper design achieve arbitrarily low error.
🧠 Mental Models / Intuition
Typical Set (AEP): For large \(n\), most source sequences look “typical’’—they all have probability ≈ \(2^{-nH}\). Think of the source as pouring its probability mass into a tiny “ball’’ of size \(2^{nH}\).
Channel as a Filter: The channel passes some information unchanged (the mutual information) while discarding the rest (the equivocation).
KL as “Penalty’’: Using the wrong probability model costs extra bits equal to the KL divergence—like a tax for a bad guess.
🚩 Exceptions & Edge Cases
Zero‑probability events: \(-\log2 0\) is undefined; treat such outcomes as having infinite surprisal (they never occur).
Continuous variables: Use differential entropy (not covered in detail) – it can be negative and isn’t directly comparable to discrete entropy.
Uniform distribution on infinite alphabet: Entropy is infinite; the bound theorem only applies to finite \(|\mathcal{X}|\).
📍 When to Use Which
Entropy vs. Binary Entropy: Use binary entropy when dealing with a single Bernoulli trial (e.g., BSC crossover).
Channel Capacity Formula:
Use Shannon–Hartley for continuous Gaussian channels.
Use \(1-H{\text{bin}}(p)\) for BSC.
Use \(1-p\) for BEC.
KL Divergence vs. Mutual Information:
Use KL when comparing a model to the true distribution (e.g., compression mismatch).
Use Mutual Information when quantifying shared information between two variables.
Huffman vs. Arithmetic Coding:
Huffman for small alphabets, easy implementation.
Arithmetic for higher compression efficiency when symbol probabilities are skewed.
👀 Patterns to Recognize
Entropy maximization: Whenever a problem says “maximum uncertainty’’ or “equiprobable outcomes”, answer is \(\log2 n\).
Capacity‑rate comparison: If a question gives a rate \(R\) and a channel description, check “\(R < C\) → reliable possible; \(R > C\) → impossible.”
Symmetry in formulas: Mutual information and KL divergence often appear with symmetric counterparts (e.g., \(I(X;Y)=I(Y;X)\)).
🗂️ Exam Traps
Confusing bits with nats: Remember base‑2 → bits, base‑e → nats. A formula using \(\log\) (no subscript) may imply natural log.
Mixing up conditional entropy and equivocation: Both are \(H(X|Y)\), but “equivocation’’ specifically refers to uncertainty at the receiver after a noisy channel.
Assuming independence: Joint entropy equals sum of entropies only when variables are independent; many distractors forget this condition.
Channel capacity of BSC: Some answer choices incorrectly write \(C = H{\text{bin}}(p)\); the correct expression is \(1 - H{\text{bin}}(p)\).
KL divergence sign: Distractors may write \(\sum pi \log \frac{qi}{pi}\); the correct order is \(\log \frac{pi}{qi}\).
Source coding theorem misstatement: “Entropy is the minimum number of bits per symbol” – true average number; some options claim “exact” which is false for finite blocks.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or