Information retrieval Study Guide
Study Guide
📖 Core Concepts
Information Retrieval (IR) – Finding and returning system resources (documents, images, audio, video, etc.) that satisfy a user’s information need.
Information Need – Expressed by the user as a search query.
Ranking – IR returns a ranked list sorted by a numeric relevance score, unlike unordered database results.
Cross‑Modal Retrieval – Use a query in one modality (e.g., text) to retrieve items in another (e.g., images).
Model Families –
Set‑theoretic / Boolean – Documents as sets of words, similarity via set operations.
Algebraic (Vector Space) – Documents & queries are vectors; similarity = scalar product (dot‑product).
Probabilistic – Retrieval modeled as inference; key example is BM25.
Neural – Sparse, dense, or hybrid representations learned by deep models.
Evaluation – Requires ground‑truth relevance labels; metrics quantify how well the ranked list matches these labels.
---
📌 Must Remember
IR returns ranked results; higher scores = higher relevance.
Sparse models → term‑based, interpretable (e.g., TF‑IDF, BM25, learned sparse).
Dense models → continuous vectors from transformers (e.g., dual‑encoders, ColBERT).
Hybrid models = lexical + semantic fusion (score‑fusion, late interaction).
Precision = relevant retrieved / total retrieved.
Recall = relevant retrieved / total relevant.
Top‑k metrics (e.g., P@k, R@k) focus on the first k ranks.
BM25 is the classic probabilistic relevance function; still a baseline in modern systems.
TREC (1992) pioneered large‑scale text‑retrieval evaluation.
PageRank (1998) introduced hyperlink‑based importance.
BERT (2018) and later ColBERT (2020) brought contextualized neural ranking.
---
🔄 Key Processes
Query Formulation – User types a query → system tokenizes & possibly expands it.
Document Representation –
Sparse: compute TF‑IDF / BM25 weights → inverted index.
Dense: encode with transformer → dense vector stored in ANN index.
Scoring – Compute relevance score for each candidate:
Sparse: BM25 formula (term frequency, document length, inverse document frequency).
Dense: cosine similarity $ \cos(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q}\cdot\mathbf{d}}{\|\mathbf{q}\|\|\mathbf{d}\|} $.
Ranking – Sort candidates by descending score.
Evaluation – Compare ranked list to ground‑truth using precision, recall, P@k, etc.
---
🔍 Key Comparisons
Sparse vs. Dense Retrieval
Sparse: interpretable terms, relies on exact matches, efficient inverted indexes.
Dense: captures semantics, tolerant to lexical mismatch, requires ANN search.
Boolean Model vs. Vector Space Model
Boolean: strict true/false matching (AND, OR, NOT).
Vector Space: graded similarity via dot‑product, supports ranking.
Traditional IR (TF‑IDF/BM25) vs. Neural IR (BERT/ColBERT)
Traditional: hand‑crafted term statistics, fast, less semantic understanding.
Neural: learns contextual embeddings, higher effectiveness on ambiguous queries, more compute‑intensive.
Sparse vs. Hybrid Retrieval
Sparse: only lexical signals.
Hybrid: adds dense semantic scores → usually better overall performance.
---
⚠️ Common Misunderstandings
“Higher recall always means better system.” – Recall ignores ranking; a system that retrieves everything gets 100 % recall but terrible precision.
“BM25 is outdated.” – Still a strong baseline; many hybrid models combine BM25 with neural scores.
“Dense vectors replace inverted indexes.” – Dense retrieval needs ANN structures; they complement, not replace, classic indexes.
“Cross‑modal retrieval works without modality alignment.” – Requires learned joint embeddings or mapping functions; plain keyword search won’t retrieve images from text.
---
🧠 Mental Models / Intuition
“Relevance = similarity + importance.” – Think of each document as a point; the closer it is to the query vector and the higher its intrinsic importance (e.g., PageRank), the higher its score.
“Sparse = exact words, Dense = meaning.” – Sparse models match the words you typed; dense models match the idea behind them.
“Ranking is a “score‑then‑sort” pipeline.” – Visualize a scoreboard where each document gets points; the scoreboard is then ordered from highest to lowest.
---
🚩 Exceptions & Edge Cases
Term Interdependence – Some models (e.g., language models) treat terms as dependent; pure BM25 assumes independence.
Relevance Shades – Not all relevance is binary; graded relevance (e.g., “highly relevant”, “partially relevant”) requires metrics like NDCG, which weren’t detailed in the outline.
Zero‑Shot Retrieval – BEIR benchmark (2022) shows models can be evaluated on unseen domains; performance may drop sharply if the model was only trained on a narrow corpus.
---
📍 When to Use Which
Sparse (TF‑IDF/BM25) → When you need fast, interpretable results on large corpora with limited compute.
Dense (ColBERT, dual‑encoder) → When queries are ambiguous or you need semantic matching across vocabularies.
Hybrid → Most production systems: start with BM25 to filter, then rescore with a dense model.
Cross‑Modal Retrieval → Use joint embedding models trained on paired text‑image data.
Data Fusion (CombSUM, Borda) → When you have multiple independent rankers and want to combine their strengths.
---
👀 Patterns to Recognize
“Keyword mismatch → low BM25, high dense score.” – Indicates the query uses synonyms or paraphrasing.
“Long tail queries with few exact terms → dense models shine.”
“Top‑k precision spikes then drops → possible over‑fitting to frequent terms.”
“High recall but low precision → system is retrieving too many non‑relevant items; consider tighter ranking or additional filters.”
---
🗂️ Exam Traps
Choosing “BM25 is obsolete” – Wrong; BM25 remains a baseline and is often part of hybrid pipelines.
Confusing precision with recall – Remember: precision = relevant among retrieved; recall = relevant retrieved among all relevant.
Assuming dense models need no indexing – They still require an ANN index; forgetting this can lead to an answer that suggests linear scan.
Mixing up “sparse vs. dense” with “set‑theoretic vs. vector space” – Sparse refers to representation sparsity; set‑theoretic/Boolean is a separate modeling family.
Over‑emphasizing “top‑k metrics only matter for web search” – Top‑k is relevant for any system where users only see the first few results (e.g., recommendation, QA).
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or