Subjects/Science/Computer and Information Science/Computer Science/Query optimization

Query optimization Study Guide

Study Guide

📖 Core Concepts Query Optimization – automatic process that searches many execution plans and picks the one with the smallest estimated processing time. Cost‑Based Optimizer (CBO) – assigns a numeric cost to each plan (I/O, CPU, buffer use, disk service time, parallel interconnect) and chooses the lowest‑cost plan. Logical vs. Physical Optimization – Logical rewrites the query into an equivalent relational‑algebra tree; Physical decides the concrete operators (e.g., which join algorithm, which access path). Access Paths – ways to retrieve rows: primary‑key index scan, secondary‑index scan, or full table (sequential) scan. Join Techniques – merge join, hash join, and (rarely used) Cartesian product join. Cardinality & Selectivity – cardinality = estimated row count flowing through an edge; selectivity = fraction of rows that satisfy a predicate, derived from column statistics (histograms). Sargable Predicate – a condition that can be satisfied by an index lookup (e.g., col = 5, col BETWEEN a AND b). --- 📌 Must Remember CBO Goal: choose the plan with the lowest estimated cost. Join Order Impact: joining large tables first can blow up execution time; optimal order often starts with the smallest or most selective tables. Statistics are King: stale or missing statistics → bad cardinality → poor plan choice. Refresh after bulk loads or massive deletes. Sort‑Order Preservation: keeping an existing order can skip a costly sort operator later in the plan. Hint Usage: only when you’re sure the optimizer’s choice is sub‑optimal; otherwise rely on CBO. SPJ Flattening: if a nested SELECT‑PROJECT‑JOIN block can be rewritten as a single flat SPJ without changing semantics, the optimizer can produce a better plan. --- 🔄 Key Processes Cost‑Based Optimization Workflow Parse → build initial syntax tree. Logical Optimization → apply algebraic rewrites (predicate push‑down, join commutation). Access‑Path Enumeration (System R Stage 1) – list all viable scans for each table; keep cheapest overall and cheapest that yields required sort order. Pairwise Join Evaluation (Stage 2) – for every joinable pair, evaluate all join algorithms; store cheapest with/without output sort. Multi‑Table Join Expansion (Stage 3) – combine two‑table plans with remaining tables, propagating sort order information. Cost Calculation – sum estimated I/O, CPU, buffer, and other resource costs for each full plan. Select Plan – choose plan with minimum total cost. Execute – run the chosen physical plan. Cardinality / Selectivity Estimation Gather column statistics (histograms). For each predicate: Look up histogram → estimate selectivity (fraction). Multiply parent cardinality by selectivity → child cardinality. For multiple predicates on the same column, combine selectivities (assuming independence unless correlation is known). Query Rewriting Checklist Push predicates as low as possible (closest to leaf nodes). Replace IN (SELECT ...) with a join when it yields a sargable condition. Convert NOT EXISTS to an anti‑join if the engine handles anti‑joins efficiently. --- 🔍 Key Comparisons Merge Join vs. Hash Join Merge Join: requires both inputs sorted; great when sort order already exists. Hash Join: builds a hash table on the smaller input; works without pre‑sorted data but uses more memory. Primary Index Scan vs. Full Table Scan Primary Index Scan: uses the table’s clustered key; fast when predicate matches the key or a leading range. Full Table Scan: reads every row; only optimal when a large fraction of rows satisfy the predicate or no useful index exists. Logical Rewriting vs. Physical Tuning Logical: changes what the query does (e.g., predicate push‑down). Physical: changes how it does it (e.g., choose hash join). --- ⚠️ Common Misunderstandings “More indexes = faster queries.” Over‑indexing can increase write cost and may mislead the optimizer into expensive index‑scan plans. “The optimizer always finds the optimal plan.” Time‑budget limits and inaccurate statistics can force the CBO to settle for sub‑optimal plans. “All predicates are automatically sargable.” Functions on columns (LOWER(col) = 'x') or leading % in LIKE break sargability. --- 🧠 Mental Models / Intuition “Cost = distance traveled + effort spent.” Think of a plan as a road trip: each I/O is a mile, each CPU operation is fuel. Shorter, smoother routes (fewer miles, less fuel) win. “Join order is a chain reaction.” The first join creates an intermediate result that feeds the next join; a huge early intermediate = exponential slowdown. --- 🚩 Exceptions & Edge Cases Correlated Predicates – independent selectivity multiplication fails (e.g., model = 'Accord' implies make = 'Honda'). Sort‑Order Preservation Failure – if a later operator requires a different order, earlier preservation provides no benefit. Histogram Skew – heavily skewed data may not be captured by uniform bucket histograms; optimizer may underestimate selectivity. --- 📍 When to Use Which Choose Join Algorithm Use merge join when input streams are already sorted or can be produced sorted cheaply. Use hash join when inputs are unsorted and the smaller side fits in memory. Select Access Path Prefer index scan if predicate selectivity < 10 % and an appropriate index exists. Fall back to full scan for high‑selectivity predicates or when no covering index is available. Apply Hints Only after verifying that the CBO’s chosen plan is demonstrably slower (e.g., via EXPLAIN ANALYZE). --- 👀 Patterns to Recognize “Large‑to‑large join” pattern → likely a cost problem; look for a smaller driving table or a different join order. “Missing sort node” after a join → indicates the optimizer preserved order; verify that downstream operators truly need that order. “Seq Scan + Filter” on a column with an index → suspect non‑sargable predicate (function call, implicit conversion). --- 🗂️ Exam Traps Distractor: “Hash join is always faster than merge join.” Why tempting: hash join avoids sorting. Why wrong: if inputs are already sorted, merge join can skip the hash build phase and be cheaper. Distractor: “Statistics are optional for the optimizer.” Why tempting: some DBs can run with default assumptions. Why wrong: without up‑to‑date stats, cardinality estimates are inaccurate → poor plan choice. Distractor: “All predicates in the WHERE clause are evaluated before any joins.” Why tempting: natural reading order. Why wrong: optimizer may reorder predicates and push them below joins for cost reduction. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or