Subjects/Science/Computer and Information Science/Computer Science/Information theory

Information theory Study Guide

Study Guide

📖 Core Concepts Information – reduction of uncertainty about a random variable; measured in bits (log₂), nats (ln), or hartleys (log₁₀). Entropy \(H\) – average “surprisal’’ of a source: \(H(X)= -\sum{x} p(x)\log2 p(x)\). Higher \(H\) → more unpredictable. Self‑information – surprisal of a single outcome: \(-\log2 p\). Joint Entropy \(H(X,Y)\) – uncertainty about a pair of variables together. Conditional Entropy \(H(X|Y)\) – remaining uncertainty of \(X\) after observing \(Y\). Mutual Information \(I(X;Y)\) – reduction in uncertainty of one variable given the other: \(I = H(X)-H(X|Y) = H(Y)-H(Y|X) = H(X)+H(Y)-H(X,Y)\). Kullback–Leibler (KL) Divergence \(D{\text{KL}}(P\|Q)\) – extra bits needed when using the wrong distribution: \(\displaystyle D{\text{KL}}(P\|Q)=\sumi pi\log2\frac{pi}{qi}\). Channel Capacity \(C\) – maximum achievable mutual information over all input distributions: \(C=\max{p(x)} I(X;Y)\). Source Coding Theorem – optimal loss‑less compression rate equals the source entropy \(H\). Channel Coding Theorem – reliable communication is possible iff the transmission rate \(R\) is less than \(C\). 📌 Must Remember Entropy formula: \(H = -\sumi pi \log2 pi\). Binary entropy: \(H{\text{bin}}(p)= -p\log2 p -(1-p)\log2(1-p)\). Maximum entropy: \(H{\max}= \log2 n\) for \(n\) equiprobable symbols. Channel capacity of BSC: \(C = 1 - H{\text{bin}}(p)\). Channel capacity of BEC: \(C = 1-p\). Shannon–Hartley (Gaussian) capacity: \(C = B\log2\!\left(1+\frac{S}{N}\right)\). Data Processing Inequality: If \(X \rightarrow Y \rightarrow Z\), then \(I(X;Z) \le I(X;Y)\). Entropy bound: \(H(X) \le \log |\mathcal{X}|\), equality only for uniform distribution. 🔄 Key Processes Computing Entropy List probabilities \(pi\). Compute \(-pi\log2 pi\) for each. Sum the terms → \(H\). Finding Mutual Information Obtain joint distribution \(p(x,y)\) and marginals \(p(x),p(y)\). Compute \(I = \sum{x,y} p(x,y)\log2\frac{p(x,y)}{p(x)p(y)}\). Channel Capacity Maximization (discrete memoryless) Guess an input distribution \(p(x)\). Compute \(I(X;Y)\) using channel transition probabilities. Adjust \(p(x)\) to increase \(I\) (use calculus of variations or Blahut‑Arimoto algorithm). Designing a Lossless Code (Huffman) Sort symbols by probability. Combine two least‑probable symbols iteratively, building a binary tree. Assign 0/1 along branches → codewords. Evaluating Error‑Correction Feasibility Compare desired rate \(R\) with channel capacity \(C\). If \(R<C\): choose a code (e.g., LDPC, Turbo) with enough block length to drive error probability ↓. 🔍 Key Comparisons Entropy vs. KL Divergence Entropy quantifies uncertainty of a single distribution. KL divergence measures “distance’’ between two distributions (asymmetrical). Binary Symmetric Channel (BSC) vs. Binary Erasure Channel (BEC) BSC flips bits with probability \(p\); capacity \(1-H{\text{bin}}(p)\). BEC erases bits (unknown) with probability \(p\); capacity \(1-p\). Lossless vs. Lossy Compression Lossless: exact reconstruction; rate ≥ \(H\). Lossy: allows distortion; governed by rate‑distortion function (not detailed in outline). Shannon entropy vs. Thermodynamic entropy Same mathematical form; Shannon quantifies information, thermodynamic entropy quantifies physical disorder. ⚠️ Common Misunderstandings “Entropy is a measure of information content.” Correct: Entropy measures average uncertainty, not the information of a particular outcome (that's self‑information). “Higher mutual information always means a better channel.” Correct: Mutual information is bounded by capacity; a channel can have high \(I\) for a specific input distribution but still have lower capacity if a better distribution exists. “KL divergence is symmetric.” Wrong: \(D{\text{KL}}(P\|Q) \neq D{\text{KL}}(Q\|P)\) in general. “If \(R<C\) then any code works.” Wrong: Only codes with sufficiently long block length and proper design achieve arbitrarily low error. 🧠 Mental Models / Intuition Typical Set (AEP): For large \(n\), most source sequences look “typical’’—they all have probability ≈ \(2^{-nH}\). Think of the source as pouring its probability mass into a tiny “ball’’ of size \(2^{nH}\). Channel as a Filter: The channel passes some information unchanged (the mutual information) while discarding the rest (the equivocation). KL as “Penalty’’: Using the wrong probability model costs extra bits equal to the KL divergence—like a tax for a bad guess. 🚩 Exceptions & Edge Cases Zero‑probability events: \(-\log2 0\) is undefined; treat such outcomes as having infinite surprisal (they never occur). Continuous variables: Use differential entropy (not covered in detail) – it can be negative and isn’t directly comparable to discrete entropy. Uniform distribution on infinite alphabet: Entropy is infinite; the bound theorem only applies to finite \(|\mathcal{X}|\). 📍 When to Use Which Entropy vs. Binary Entropy: Use binary entropy when dealing with a single Bernoulli trial (e.g., BSC crossover). Channel Capacity Formula: Use Shannon–Hartley for continuous Gaussian channels. Use \(1-H{\text{bin}}(p)\) for BSC. Use \(1-p\) for BEC. KL Divergence vs. Mutual Information: Use KL when comparing a model to the true distribution (e.g., compression mismatch). Use Mutual Information when quantifying shared information between two variables. Huffman vs. Arithmetic Coding: Huffman for small alphabets, easy implementation. Arithmetic for higher compression efficiency when symbol probabilities are skewed. 👀 Patterns to Recognize Entropy maximization: Whenever a problem says “maximum uncertainty’’ or “equiprobable outcomes”, answer is \(\log2 n\). Capacity‑rate comparison: If a question gives a rate \(R\) and a channel description, check “\(R < C\) → reliable possible; \(R > C\) → impossible.” Symmetry in formulas: Mutual information and KL divergence often appear with symmetric counterparts (e.g., \(I(X;Y)=I(Y;X)\)). 🗂️ Exam Traps Confusing bits with nats: Remember base‑2 → bits, base‑e → nats. A formula using \(\log\) (no subscript) may imply natural log. Mixing up conditional entropy and equivocation: Both are \(H(X|Y)\), but “equivocation’’ specifically refers to uncertainty at the receiver after a noisy channel. Assuming independence: Joint entropy equals sum of entropies only when variables are independent; many distractors forget this condition. Channel capacity of BSC: Some answer choices incorrectly write \(C = H{\text{bin}}(p)\); the correct expression is \(1 - H{\text{bin}}(p)\). KL divergence sign: Distractors may write \(\sum pi \log \frac{qi}{pi}\); the correct order is \(\log \frac{pi}{qi}\). Source coding theorem misstatement: “Entropy is the minimum number of bits per symbol” – true average number; some options claim “exact” which is false for finite blocks.

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or