Introduction to Query Optimization
Understand how query optimizers transform SQL into efficient execution plans, evaluate costs using statistics, and select optimal join and index strategies.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the primary goal of query optimization in a database management system?
1 of 8
Summary
Introduction to Query Optimization
What is Query Optimization?
Query optimization is the process by which a database management system automatically chooses the most efficient way to execute your SQL query. When you write a SQL query, there are often many different execution strategies that produce the correct answer. The optimizer's job is to find the strategy that minimizes query cost—the amount of computational resources needed, including input/output operations, CPU time, and memory usage.
Think of it like getting directions to a destination. There might be many routes that get you there, but you want the fastest one. Similarly, there are many ways for a database to retrieve the data you requested, and the optimizer finds the most efficient path.
Why We Need Optimization: The Naïve Approach
Without optimization, a database might execute the clauses of a SQL query in the exact order they're written. For example, if your SQL says "FROM table1, table2 WHERE condition", the system might literally process table1 first, then table2, then apply the condition.
However, this naive approach is often far from optimal. A smarter approach might apply filters first to reduce the data being processed, or choose different join strategies. Query optimization exists precisely because following the literal structure of SQL rarely yields the most efficient execution.
How Optimization Works
Step 1: Internal Representation
Before the optimizer can evaluate alternatives, it must translate your SQL query into an internal representation called a relational algebra tree. This tree shows the structure of operations—selections (filtering), projections (choosing columns), and joins—in a graphical form that the optimizer can manipulate.
The image above shows an example of a relational algebra tree where multiple tables are joined together in a particular order and structure.
Step 2: Applying Transformation Rules
The optimizer applies transformation rules that rearrange the tree while keeping the result identical. These rules allow the optimizer to explore different execution strategies. Common transformation rules include:
Pushing selections down: Move filtering operations earlier in the execution to reduce the amount of data processed
Merging projections: Combine multiple column-selection operations
Reordering joins: Change the sequence in which tables are combined
For example, if your query filters for "age > 30", the optimizer might push this selection down to happen immediately after reading the table, rather than waiting until after joining with other tables.
Step 3: Generating Candidate Plans
By applying different combinations of transformation rules, the optimizer generates many candidate execution plans. Each candidate represents a different way to answer the same query.
Notice how this image shows a different arrangement of the same tables compared to the first image. Both trees answer the query correctly, but they represent different execution orders that will have different performance characteristics.
Step 4: Cost Estimation and Selection
The optimizer uses a cost model to estimate how expensive each candidate plan would be. The cost model relies on statistics the database maintains about tables, including:
Table sizes (how many rows)
Available indexes
Data distribution (how values are spread across the table)
The optimizer then selects the candidate plan with the lowest estimated cost and executes that plan.
Types of Optimizers
Rule-Based (Heuristic) Optimizers
Rule-based optimizers apply a fixed set of heuristic rules without computing detailed costs. For example, they might follow rules like "always use an index if one exists on the filtering column" or "process selections before joins."
Advantages: Rule-based optimizers are fast because they don't need to estimate costs for many plans.
Disadvantages: They may not find the truly best plan, especially for complex queries where the rules don't apply well.
Cost-Based Optimizers
Cost-based optimizers are more sophisticated. They evaluate many candidate plans using the cost model to estimate resource usage for each one.
Advantages: They can find much better execution plans, especially for intricate queries with multiple joins and various filtering options.
Disadvantages: They require accurate statistics to work well, and the planning phase takes more computation time.
Most modern databases use cost-based optimization because the better execution plans typically save far more time during query execution than is lost during planning.
Key Concepts in Query Optimization
Indexes
An index is a data structure (typically a B-tree) that allows the database to quickly locate rows without scanning an entire table. Think of it like an index in a book—you can look up a topic and jump directly to relevant pages rather than reading every page.
The optimizer must decide: for each filter condition in the query, is using an index faster than scanning the entire table? This depends on how much data needs to be retrieved and the statistics available.
Join Order and Join Methods
For queries involving multiple tables, the order in which joins are performed can dramatically affect performance. Consider a query joining three tables: the optimizer might join table A with B first, then join the result with C. Or it might join B with C first, then join with A. These different orders can produce vastly different query costs.
Additionally, the optimizer chooses among different join algorithms:
Nested-loop join: For each row in the first table, scan all rows in the second table looking for matches. Simple but potentially slow.
Hash join: Build a hash table from one table, then probe it with rows from the other table. Faster for large joins.
Merge join: Sort both tables and scan through them together. Efficient when data is already sorted.
The optimizer selects both the join order and the join method to minimize overall cost.
Selectivity and Statistics
Selectivity is the estimated percentage of rows that satisfy a filter condition. For example, if a table has 1,000,000 rows and a filter is "age > 30", and selectivity estimates that 60% of rows satisfy this (600,000 rows), this helps the optimizer make better decisions.
Accurate selectivity estimates are crucial because they help the optimizer:
Predict the size of intermediate results (outputs from one operation fed into the next)
Decide whether using an index is worthwhile
Compare the relative costs of different join orders
Databases maintain statistics about tables to estimate selectivity. If these statistics are outdated, the optimizer may make poor choices.
Flashcards
What is the primary goal of query optimization in a database management system?
To automatically select the most efficient execution plan to retrieve requested data.
What internal representation does a DBMS translate SQL text into before optimization?
Relational algebra tree
Upon what does a rule-based (heuristic) optimizer rely to determine an execution plan?
A fixed set of rules and heuristics (e.g., "use an index if one exists").
What is a major disadvantage of using rule-based optimizers for complex queries?
They may miss the truly best plan compared to cost-based methods.
How do cost-based optimizers determine the best execution plan?
By evaluating many possible plans using a cost model.
What is the functional purpose of data structures like B-trees in a database?
To locate rows without scanning an entire table.
What decision must the optimizer make regarding data retrieval methods and indexes?
Whether to use an index or perform a full table scan.
In the context of database statistics, what does the term "selectivity" define?
An estimate of how many rows satisfy a predicate.
Quiz
Introduction to Query Optimization Quiz Question 1: Why does a query optimizer use a cost model?
- To estimate the expense of each candidate execution plan (correct)
- To encrypt the results of the query for security
- To automatically generate indexes on the queried tables
- To rewrite the SQL syntax into a different programming language
Introduction to Query Optimization Quiz Question 2: Which characteristic distinguishes cost‑based optimizers from rule‑based optimizers?
- They evaluate many possible plans using a cost model (correct)
- They rely only on a fixed set of heuristic rules
- They ignore statistical information about tables and indexes
- They always choose a full table scan regardless of indexes
Why does a query optimizer use a cost model?
1 of 2
Key Concepts
Query Optimization Techniques
Query optimization
Cost‑based optimizer
Rule‑based optimizer
Transformation rule (query optimization)
Database Operations and Structures
Relational algebra
Join algorithm
Database index
Selectivity (database)
Query execution plan
Database statistics
Definitions
Query optimization
The process by which a DBMS automatically selects the most efficient execution plan for a given query.
Relational algebra
A formal language of operations on relations used as an internal representation of SQL queries.
Cost‑based optimizer
An optimizer that evaluates many candidate plans using a cost model to choose the lowest‑estimated‑cost plan.
Rule‑based optimizer
An optimizer that applies a fixed set of heuristic rules to choose execution strategies without cost estimation.
Database index
A data structure, such as a B‑tree, that enables fast row retrieval based on key values, reducing the need for full table scans.
Join algorithm
Methods (e.g., nested‑loop, hash, merge) used to combine rows from two or more tables during query execution.
Selectivity (database)
An estimate of the fraction of rows that satisfy a predicate, used to predict result size and plan cost.
Query execution plan
A detailed, ordered set of operations the DBMS will perform to retrieve the requested data.
Database statistics
Metadata about table sizes, data distribution, and index availability that inform cost estimation.
Transformation rule (query optimization)
A rewrite of a query expression that preserves semantics while potentially lowering execution cost.