Subjects/Science/Computer and Information Science/Computer Science/Information retrieval

Information retrieval Study Guide

Study Guide

📖 Core Concepts Information Retrieval (IR) – Finding and returning system resources (documents, images, audio, video, etc.) that satisfy a user’s information need. Information Need – Expressed by the user as a search query. Ranking – IR returns a ranked list sorted by a numeric relevance score, unlike unordered database results. Cross‑Modal Retrieval – Use a query in one modality (e.g., text) to retrieve items in another (e.g., images). Model Families – Set‑theoretic / Boolean – Documents as sets of words, similarity via set operations. Algebraic (Vector Space) – Documents & queries are vectors; similarity = scalar product (dot‑product). Probabilistic – Retrieval modeled as inference; key example is BM25. Neural – Sparse, dense, or hybrid representations learned by deep models. Evaluation – Requires ground‑truth relevance labels; metrics quantify how well the ranked list matches these labels. --- 📌 Must Remember IR returns ranked results; higher scores = higher relevance. Sparse models → term‑based, interpretable (e.g., TF‑IDF, BM25, learned sparse). Dense models → continuous vectors from transformers (e.g., dual‑encoders, ColBERT). Hybrid models = lexical + semantic fusion (score‑fusion, late interaction). Precision = relevant retrieved / total retrieved. Recall = relevant retrieved / total relevant. Top‑k metrics (e.g., P@k, R@k) focus on the first k ranks. BM25 is the classic probabilistic relevance function; still a baseline in modern systems. TREC (1992) pioneered large‑scale text‑retrieval evaluation. PageRank (1998) introduced hyperlink‑based importance. BERT (2018) and later ColBERT (2020) brought contextualized neural ranking. --- 🔄 Key Processes Query Formulation – User types a query → system tokenizes & possibly expands it. Document Representation – Sparse: compute TF‑IDF / BM25 weights → inverted index. Dense: encode with transformer → dense vector stored in ANN index. Scoring – Compute relevance score for each candidate: Sparse: BM25 formula (term frequency, document length, inverse document frequency). Dense: cosine similarity $ \cos(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q}\cdot\mathbf{d}}{\|\mathbf{q}\|\|\mathbf{d}\|} $. Ranking – Sort candidates by descending score. Evaluation – Compare ranked list to ground‑truth using precision, recall, P@k, etc. --- 🔍 Key Comparisons Sparse vs. Dense Retrieval Sparse: interpretable terms, relies on exact matches, efficient inverted indexes. Dense: captures semantics, tolerant to lexical mismatch, requires ANN search. Boolean Model vs. Vector Space Model Boolean: strict true/false matching (AND, OR, NOT). Vector Space: graded similarity via dot‑product, supports ranking. Traditional IR (TF‑IDF/BM25) vs. Neural IR (BERT/ColBERT) Traditional: hand‑crafted term statistics, fast, less semantic understanding. Neural: learns contextual embeddings, higher effectiveness on ambiguous queries, more compute‑intensive. Sparse vs. Hybrid Retrieval Sparse: only lexical signals. Hybrid: adds dense semantic scores → usually better overall performance. --- ⚠️ Common Misunderstandings “Higher recall always means better system.” – Recall ignores ranking; a system that retrieves everything gets 100 % recall but terrible precision. “BM25 is outdated.” – Still a strong baseline; many hybrid models combine BM25 with neural scores. “Dense vectors replace inverted indexes.” – Dense retrieval needs ANN structures; they complement, not replace, classic indexes. “Cross‑modal retrieval works without modality alignment.” – Requires learned joint embeddings or mapping functions; plain keyword search won’t retrieve images from text. --- 🧠 Mental Models / Intuition “Relevance = similarity + importance.” – Think of each document as a point; the closer it is to the query vector and the higher its intrinsic importance (e.g., PageRank), the higher its score. “Sparse = exact words, Dense = meaning.” – Sparse models match the words you typed; dense models match the idea behind them. “Ranking is a “score‑then‑sort” pipeline.” – Visualize a scoreboard where each document gets points; the scoreboard is then ordered from highest to lowest. --- 🚩 Exceptions & Edge Cases Term Interdependence – Some models (e.g., language models) treat terms as dependent; pure BM25 assumes independence. Relevance Shades – Not all relevance is binary; graded relevance (e.g., “highly relevant”, “partially relevant”) requires metrics like NDCG, which weren’t detailed in the outline. Zero‑Shot Retrieval – BEIR benchmark (2022) shows models can be evaluated on unseen domains; performance may drop sharply if the model was only trained on a narrow corpus. --- 📍 When to Use Which Sparse (TF‑IDF/BM25) → When you need fast, interpretable results on large corpora with limited compute. Dense (ColBERT, dual‑encoder) → When queries are ambiguous or you need semantic matching across vocabularies. Hybrid → Most production systems: start with BM25 to filter, then rescore with a dense model. Cross‑Modal Retrieval → Use joint embedding models trained on paired text‑image data. Data Fusion (CombSUM, Borda) → When you have multiple independent rankers and want to combine their strengths. --- 👀 Patterns to Recognize “Keyword mismatch → low BM25, high dense score.” – Indicates the query uses synonyms or paraphrasing. “Long tail queries with few exact terms → dense models shine.” “Top‑k precision spikes then drops → possible over‑fitting to frequent terms.” “High recall but low precision → system is retrieving too many non‑relevant items; consider tighter ranking or additional filters.” --- 🗂️ Exam Traps Choosing “BM25 is obsolete” – Wrong; BM25 remains a baseline and is often part of hybrid pipelines. Confusing precision with recall – Remember: precision = relevant among retrieved; recall = relevant retrieved among all relevant. Assuming dense models need no indexing – They still require an ANN index; forgetting this can lead to an answer that suggests linear scan. Mixing up “sparse vs. dense” with “set‑theoretic vs. vector space” – Sparse refers to representation sparsity; set‑theoretic/Boolean is a separate modeling family. Over‑emphasizing “top‑k metrics only matter for web search” – Top‑k is relevant for any system where users only see the first few results (e.g., recommendation, QA). ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or