Subjects/Technology/Data and AI/Database/Index (database)

Index (database) - Advanced Index Techniques

Understand advanced index techniques, their performance trade‑offs, and the appropriate use of various index types such as bitmap, dense, sparse, inverted, hash, and covering indexes.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

Why is a query using a leading wildcard, such as LIKE '%@wikipedia.org', considered non-sargable?

1 of 14

Summary

Database Indexes: Applications and Limitations Indexes are powerful tools for improving query performance, but they come with important limitations and trade-offs. Understanding when indexes work efficiently, and when they fail, is essential for writing performant SQL queries and designing effective database schemas. When Indexes Fail: The Sargability Problem The term sargable comes from "Search ARGument ABLE"—a query is sargable when a database can efficiently use an index to answer it. Not all queries are sargable, and understanding why is critical. Leading Wildcard Searches Cannot Use Indexes A leading wildcard—where the wildcard character appears at the beginning of a search pattern—makes a query non-sargable. Consider this query: sql SELECT emailaddress FROM customers WHERE emailaddress LIKE '%@wikipedia.org'; This query searches for all emails ending with "@wikipedia.org". The problem is that the database cannot use a standard B-tree index because it doesn't know which row to start examining. With the wildcard at the front, the database must scan every value in the index (or every row if no index exists) to find matches. This defeats the purpose of indexing. Why this happens: Indexes store values in sorted order. To use an index efficiently, the database needs to know the starting point. With a leading wildcard, it doesn't—the matching values could appear anywhere in the sorted order. Solving Leading Wildcards with Reverse Indexes If you frequently search for patterns like '%@wikipedia.org', you can create a reverse index on the reversed values of the column: sql CREATE INDEX idxemailreversed ON customers(reverse(emailaddress)); Then rewrite your query as: sql SELECT emailaddress FROM customers WHERE reverse(emailaddress) LIKE reverse('%@wikipedia.org'); This converts the leading wildcard into a trailing wildcard (which is sargable). The wildcard now appears at the rightmost position in the reversed string, so the database can use the index to find the starting point and scan forward efficiently. Wildcards on Both Sides: No Index Solution A pattern like '%wikipedia%' has wildcards on both sides. This pattern is inherently non-sargable—there's no way to know where matches begin, so no index can help. The database must perform a full table scan or full index scan regardless of what indexes exist. Understanding Different Index Types Database systems offer various index types, each optimized for different use cases. This section covers the most common and important types. Dense Index vs. Sparse Index These two types differ in how completely they map the data: Dense indexes contain a key-pointer pair for every record in the data file. Each entry in the index points directly to one row in the table. Dense indexes require more storage but enable direct access to any row. Sparse indexes contain a key-pointer pair for each data block (or page), not for each row. Each index entry points to the first row in a block that has that key value. Sparse indexes use less storage but require a sequential scan within a block once you've located the block. Sparse indexes are typically used when the underlying data is physically sorted by the index key. Think of it this way: a dense index is like having a directory entry for every house on a street, while a sparse index is like having a directory entry only for each city block, and you have to search within the block once you arrive. Hash Index A hash index uses a hash function to map key values directly to bucket locations. When you search for a specific value, the hash function tells the database exactly where to look. This provides constant-time $O(1)$ average-case lookup performance for exact match queries. Important limitation: Hash indexes only work for equality conditions (WHERE id = 5). They cannot be used for range queries (WHERE id > 5) because hashing destroys the ordering of values. Bitmap Index A bitmap index represents all values in a column as bit arrays. It's most effective for columns with a small number of distinct values that repeat frequently, such as gender (typically just 2-3 values) or boolean flags. How it works: Instead of storing the value itself, a bitmap index stores one bit for each possible value. A 1 bit means the row contains that value, and a 0 means it doesn't. Queries are answered by performing fast bitwise logical operations on these bitmaps. When to use: Bitmap indexes excel in data warehouse scenarios where you're joining many small dimension tables with low-cardinality columns. They're not ideal for columns with many distinct values or for general-purpose OLTP (Online Transaction Processing) systems. Inverted Index An inverted index is specifically designed for full-text search. Instead of mapping rows to values (as a normal index does), an inverted index maps each distinct word or term to the list of rows (or documents) that contain that term. For example, if you're indexing the content of articles for search, an inverted index on the word "database" would point to all articles containing that word. This makes full-text search queries highly efficient. Secondary Index A secondary index is any index built on columns that are neither the table's primary key nor its ordering columns (the clustering key). Secondary indexes provide a dense key-pointer entry for every row, allowing you to efficiently search by non-primary columns. In practice, most of the indexes you create are secondary indexes, since primary indexes are typically created automatically by the database system. Covering Indexes: Eliminating Table Access A covering index is an index that contains all the columns needed to answer a query. When a query can be satisfied entirely by reading the index, without accessing the underlying table rows, we say the index "covers" the query. Why Covering Indexes Matter Covering indexes significantly improve query performance because: Faster access: Index pages are typically smaller and more cache-friendly than table pages Fewer disk reads: You access only the index, not the table Reduced I/O: This is the primary bottleneck in most databases For example, if you frequently run: sql SELECT id, name FROM customers WHERE customerid = 42; An index on customerid that includes both id and name as included columns will cover this query. Including Non-Key Columns Many relational database systems allow you to add non-key columns to the leaf level of an index without including them in the index key itself. This is done with the INCLUDE clause: sql CREATE INDEX idxcustomerlookup ON customers (customerid) INCLUDE (id, name); The advantages: The index remains smaller than if you made id and name part of the key The index still covers queries that reference these columns You avoid the space overhead of including them in the actual key structure This is a best-practice technique for creating efficient covering indexes without inflating the index key itself. The Cost-Benefit Trade-off of Indexes Indexes don't come free. Before creating an index, understand these trade-offs: Benefits: Faster read queries, especially for range scans and exact-match lookups Improved performance for WHERE, JOIN, and ORDER BY clauses Costs: Storage overhead: Indexes duplicate data and consume significant disk space Write overhead: Every INSERT, UPDATE, or DELETE operation must also update every index on that table. For heavily indexed tables, write operations become significantly slower Maintenance: The database must keep indexes current, which takes CPU resources Practical implications: Indexes are ideal for read-heavy workloads but can hurt performance in write-heavy systems. On a table where you insert thousands of rows per second, having many indexes will slow down the inserts substantially. When designing indexes, focus on: Queries that run frequently Queries that scan large result sets Columns used in WHERE and JOIN conditions The ratio of reads to writes on the table <extrainfo> Advanced Topics: Linear Hashing and Implementation Details Linear Hashing Index Linear hashing is a dynamic hash indexing method that gradually expands the hash table as more entries are inserted. Unlike traditional hash tables that require expensive rehashing of all entries when the table grows, linear hashing expands incrementally, one bucket at a time. This avoids the sudden performance penalty of full rehashing operations. This technique is useful for systems that cannot afford downtime for index maintenance, though it's a relatively specialized optimization. Underlying Data Structures Indexes are typically implemented using common data structures such as: B+ trees: Ordered trees that provide good performance for both range and equality queries Hash tables: Used in hash indexes for constant-time lookups Balanced trees: Various tree structures that maintain balance to ensure consistent performance The choice of data structure affects performance characteristics—B+ trees excel at range queries, while hash tables excel at equality queries. </extrainfo>

Flashcards

Why is a query using a leading wildcard, such as LIKE '%@wikipedia.org', considered non-sargable?

It forces a full index scan because the standard index cannot be used when the wildcard is at the beginning.

How can an index be used to efficiently search for a leading wildcard pattern like %@wikipedia.org?

By creating an index on the reverse() expression of the column and reversing the search pattern.

What is the performance impact of using wildcards on both sides of a search pattern, such as %wikipedia.org%?

It prevents index use and results in a sequential scan of the column data.

What are the primary general trade-offs of using indexes?

Increased storage requirements Added overhead to write operations (updates required when data changes)

How does a bitmap index store and query data?

It stores values as bit arrays and answers queries using fast bitwise logical operations.

For what type of data columns is a bitmap index most efficient?

Columns with a small number of frequently repeating distinct values (e.g., gender).

What characterizes the structure of a dense index?

It contains a key–pointer pair for every single record in the data file.

How does a sparse index differ from a dense index in its storage of pointers?

It stores a key–pointer pair for each data block rather than for each individual row.

What is the primary function of an inverted index in full-text search?

It maps each distinct word or term to a list of documents or rows containing that term.

On which columns is a secondary index typically built?

Columns that are neither the primary key nor the table's ordering columns.

What is the average lookup time complexity for exact matches using a hash index?

$O(1)$ (constant-time).

What is the main advantage of using linear hashing for dynamic indexing?

It gradually expands the hash table as entries are added, avoiding costly full rehashing operations.

What is the definition and main advantage of a covering index?

It contains all columns required by a query, allowing the query to be satisfied without accessing the underlying table rows.

How can non-key columns be added to an index to create a covering index without increasing the key size?

By adding them to the leaf level of the index using an INCLUDE clause.

Quiz

Which of the following is NOT a common index implementation?

1 of 4

Key Concepts

Index Types

Bitmap index

Dense index

Sparse index

Secondary index

Hash index

Linear hashing

B+ tree

Search Techniques

Leading wildcard search

Reverse index

Inverted index

Covering index

Definitions

Leading wildcard search

A query pattern that begins with a wildcard (e.g., ‘%text’) which prevents the use of standard indexes and forces a full scan.

Reverse index

An index built on the reversed values of a column, enabling efficient searches with leading wildcards by turning them into trailing wildcards.

Bitmap index

An index that stores column values as bit arrays, allowing fast bitwise logical operations especially for low‑cardinality columns.

Dense index

An index that contains a key‑pointer pair for every record in the data file, pointing directly to each row.

Sparse index

An index that stores a key‑pointer pair for each data block rather than each row, pointing to the first row in the block.

Inverted index

A data structure that maps each distinct term to a list of documents or rows containing that term, used for full‑text search.

Secondary index

An index created on columns that are neither the table’s primary key nor its ordering columns, providing dense entries for every row.

Hash index

An index that applies a hash function to keys to map them directly to bucket locations, offering average‑case O(1) lookup for exact matches.

Linear hashing

A dynamic hash‑based indexing method that incrementally expands the hash table as entries are added, avoiding costly full rehashes.

Covering index

An index that includes all columns required by a query, allowing the query to be satisfied using only the index without accessing the base table rows.

B+ tree

A balanced tree data structure commonly used for database indexes, supporting efficient range queries and ordered traversal.