Introduction to Hash Tables
Understand how hash tables store key‑value pairs, resolve collisions, and maintain efficient performance via load factor management and resizing.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the primary purpose of a hash table?
1 of 19
Summary
Overview of Hash Tables
What Are Hash Tables and Why Do They Matter?
A hash table is a data structure that stores key-value pairs and allows you to find, add, or remove entries extremely quickly. Unlike arrays where you access elements by position, hash tables let you access elements by a key—for example, looking up a phone number using a person's name as the key.
The remarkable feature of hash tables is their speed. When designed well, hash tables provide constant time complexity, written as $O(1)$, for their basic operations (find, insert, delete). This means that whether your hash table stores 10 items or 10 million items, these operations take roughly the same amount of time. This is why hash tables are used everywhere in computer science.
How a Hash Table is Structured
A hash table stores its entries in an array of buckets. Think of buckets as containers—each position in the array can hold one or more key-value pairs depending on how you handle collisions (we'll cover this shortly).
Here's the basic structure:
The hash table maintains an array with a fixed number of buckets
Each bucket can store entries—either one entry, or multiple entries depending on the collision handling method
The position where an entry is stored is determined by the key
The diagram above shows this visually: three different keys (John Smith, Lisa Smith, and Sandra Dee) are passed through a hash function, which assigns them to specific buckets in the array.
Hash Functions: The Core of Hash Tables
What a Hash Function Does
A hash function takes a key (which could be a string, number, or any object) and converts it into a bucket index—a number between 0 and the number of buckets minus 1.
For example, if you have 16 buckets, the hash function might convert:
"John Smith" → bucket index 1
"Sandra Dee" → bucket index 14
Critical Properties of a Good Hash Function
Determinism: The hash function must be deterministic, meaning it always returns the same index for the same key. If you hash "John Smith" today, you must get the same bucket index if you hash it again tomorrow. Without this property, you could never retrieve your data consistently.
Uniform Distribution: A good hash function should spread keys across all available buckets as evenly as possible. If most keys map to just a few buckets, you'll waste space and have poor performance. Think of it like distributing students evenly across classroom sections rather than cramming them all into one room.
Basic Operations on Hash Tables
Inserting Data
To insert a key-value pair into a hash table:
Compute the bucket index by passing the key through the hash function
Store the key-value pair in that bucket
For example, to insert "John Smith" → "555-1234":
Hash function computes: hash("John Smith") = 1
Store the pair in bucket 1
Looking Up Data
To find a value by its key:
Compute the bucket index using the same hash function
Search within that bucket for the matching key
Return the associated value if found, or indicate "not found" otherwise
This is why the hash function must be deterministic—you use the exact same computation to locate the data you previously stored.
Deleting Data
To remove a key-value pair:
Compute the bucket index using the hash function
Locate the entry with the matching key in that bucket
Remove that entry
Collision Resolution: Handling When Keys Collide
Understanding Collisions
A collision occurs when two different keys produce the same bucket index. For example, if both "John Smith" and "Jon Smith" happen to hash to index 1, they collide.
Collisions are inevitable in hash tables—even with a perfect hash function, if you have more keys than buckets, collisions must occur by the pigeonhole principle. The key is to handle collisions efficiently.
Separate Chaining: Store Multiple Entries Per Bucket
In separate chaining, each bucket contains a list (or linked list, or another small container) of all entries that mapped to that bucket.
How it works:
If a collision occurs, the new entry is simply added to the list in that bucket
When looking up a key, you compute the bucket index, then search through the list in that bucket for the matching key
When deleting, you find the bucket, then remove the matching entry from the list
For example, if both "John Smith" and "Jon Smith" hash to bucket 1:
Bucket 1 contains a list: ["John Smith" → "555-1234", "Jon Smith" → "555-5678"]
To find "John Smith", you hash to bucket 1, then search the list for the exact key match
Separate chaining is simple and works well when the number of collisions is small.
Open Addressing: Find an Alternative Bucket
In open addressing, when a bucket is already occupied, instead of storing multiple entries in one bucket, you search for another empty bucket nearby.
Linear Probing: Check buckets sequentially until you find an empty one.
If bucket $h(key)$ is occupied, check bucket $h(key) + 1$
If that's occupied, check $h(key) + 2$, and so on
When looking up a key, follow the same sequence: start at $h(key)$, and if it doesn't match, keep probing forward
Quadratic Probing: Check buckets at intervals that grow quadratically.
If bucket $h(key)$ is occupied, check bucket $h(key) + 1^2$
Then check bucket $h(key) + 2^2$, then $h(key) + 3^2$, and so on
This reduces clustering (the tendency of filled buckets to group together)
Double Hashing: Use a second hash function to determine the step size.
If bucket $h1(key)$ is occupied, the next bucket to check is $h1(key) + h2(key)$
Then check $h1(key) + 2 \cdot h2(key)$, and so on
This provides even better distribution than linear or quadratic probing
The key insight with open addressing: lookup must follow the same probe sequence as insertion to find the correct entry.
Load Factor and Performance
Defining Load Factor
The load factor is a measure of how full your hash table is:
$$\text{load factor} = \frac{\text{number of stored entries}}{\text{number of buckets}}$$
For example, if you have 100 entries and 200 buckets, your load factor is $100 \div 200 = 0.5$.
Why Load Factor Matters
The load factor directly affects performance:
Low load factor (e.g., 0.25): Buckets are mostly empty, collisions are rare, operations stay at $O(1)$
High load factor (e.g., 0.9 or higher): Buckets are very full, collisions become frequent, operations degrade toward $O(n)$ in the worst case
The higher the load factor, the more likely collisions become, and the more time you spend resolving them (either searching through long chains in separate chaining, or probing many buckets in open addressing).
Keeping Performance High
To maintain the $O(1)$ average-case performance, most implementations keep the load factor below a threshold (typically 0.7 or 0.75). When the load factor gets too high, the hash table is resized.
Resizing and Rehashing
When Resizing Happens
When the load factor exceeds a predetermined threshold (like 0.75), the hash table is resized because performance is degrading due to too many collisions.
The Resizing Process
Resizing involves three steps:
Allocate a larger array: Create a new bucket array, typically about double the size of the current array
Recompute all hash values: Pass every existing entry through the hash function again. Because the bucket count has changed, the indices will change. For example, if the old hash function computed indices for 16 buckets, the new hash function (with 32 buckets) will produce different indices
Reinsert all entries: Place each entry into its new bucket position in the larger array
This process is called rehashing. Although it requires reprocessing all existing entries, it's a relatively rare operation (you only resize when the load factor gets too high), so the average cost per operation remains constant.
Common Applications of Hash Tables
Hash tables are fundamental data structures used throughout computer science:
Dictionaries and Symbol Tables: Mapping words to definitions (like a dictionary) or variable names to their values in a compiler's symbol table.
Caches: Storing frequently accessed data so it can be retrieved by key in constant time, rather than computing or fetching it again.
Counting and Frequency Analysis: Tallying how many times each word appears in a document, or counting occurrences of events.
Sets: Storing unique items and checking membership quickly (e.g., "is this user in the banned list?").
Database Indexes: Creating fast lookups in databases by indexing records by key fields.
The common thread: whenever you need fast retrieval of data based on a key, hash tables are an excellent choice.
Flashcards
What is the primary purpose of a hash table?
To store key‑value pairs for fast find, insert, and delete operations.
What is the average time complexity for basic operations in a well-designed hash table?
$O(1)$ (Constant time).
What structural component of a hash table holds the stored entries?
An array of buckets.
How does a hash table determine the bucket index for a specific key?
By using a hash function to convert the key into an integer index.
What is the primary goal of a "good" hash function regarding key distribution?
To distribute keys uniformly across the bucket array.
Why must a hash function be deterministic?
To ensure it returns the same index for the same key every time.
What is the general process for inserting a (key, value) pair into a hash table?
Compute the bucket index with the hash function and store the pair in that bucket.
How is a lookup performed in a hash table?
Recompute the bucket index using the hash function and search that bucket for the matching key.
How is a key deleted from a hash table?
Locate the bucket via the hash function, find the matching entry, and remove it.
When does a collision occur in a hash table?
When two different keys produce the same bucket index.
How does the separate chaining technique handle collisions?
Each bucket holds a list (or small container) of all entries mapping to it.
What is the general mechanism of the open addressing technique for resolving collisions?
The algorithm probes alternative buckets until an empty slot is found.
In open addressing, how does linear probing find the next slot?
It checks the next sequential bucket each time.
How does double hashing determine the step size for probing?
By using a second hash function.
What is the mathematical definition of the load factor in a hash table?
$\text{load factor} = \frac{\text{number of stored entries}}{\text{number of buckets}}$
What is the relationship between a high load factor and hash table performance?
A higher load factor increases collisions, which can degrade performance.
Why is it important to keep the load factor low in a hash table?
To preserve the average-case $O(1)$ performance of operations.
What typically triggers the resizing of a hash table?
The load factor becoming too high.
What are the two main steps in the resizing procedure of a hash table?
Allocating a larger bucket array and re-hashing every existing entry into it.
Quiz
Introduction to Hash Tables Quiz Question 1: How is the load factor of a hash table defined?
- Load factor = (number of stored entries) ÷ (number of buckets) (correct)
- Load factor = (number of buckets) ÷ (number of stored entries)
- Load factor = (number of collisions) ÷ (number of buckets)
- Load factor = (sum of key hashes) ÷ (number of buckets)
Introduction to Hash Tables Quiz Question 2: What is the average time complexity for basic operations (find, insert, delete) in a well‑designed hash table?
- $O(1)$ (correct)
- $O(\log n)$
- $O(n)$
- $O(n \log n)$
Introduction to Hash Tables Quiz Question 3: In linear probing, how is the next bucket selected after a collision?
- Move to the next sequential bucket (correct)
- Jump to a bucket based on a second hash function
- Select a bucket at a quadratic offset
- Choose a random bucket
Introduction to Hash Tables Quiz Question 4: What effect does a higher load factor have on a hash table?
- It increases the likelihood of collisions (correct)
- It decreases memory usage
- It guarantees faster lookups
- It eliminates the need for a hash function
Introduction to Hash Tables Quiz Question 5: What is the primary structural component of a hash table that stores entries?
- An array of buckets (correct)
- A linked list of nodes
- A binary search tree
- A stack
Introduction to Hash Tables Quiz Question 6: What determines how many entries a single bucket in a hash table may hold?
- The collision‑handling method (correct)
- The size of the key
- The hash function’s output range
- The current load factor
Introduction to Hash Tables Quiz Question 7: After computing the bucket index for a (key, value) pair, what is the next step in insertion?
- Store the pair in that bucket (correct)
- Re‑hash the key with a different function
- Search the entire table for duplicates
- Resize the table immediately
Introduction to Hash Tables Quiz Question 8: What term describes two different keys that map to the same bucket index?
- Collision (correct)
- Overflow
- Underflow
- Rehash
Introduction to Hash Tables Quiz Question 9: Which condition most often causes a hash table to be resized?
- Load factor exceeds a chosen threshold (correct)
- Number of buckets equals number of entries
- All buckets become empty
- Hash function changes
Introduction to Hash Tables Quiz Question 10: Why is it important to keep the load factor of a hash table low?
- It helps preserve average‑case O(1) operation time (correct)
- It reduces the total memory consumption of the table
- It guarantees that no collisions will ever occur
- It ensures that keys are stored in sorted order
Introduction to Hash Tables Quiz Question 11: Why are hash tables well‑suited for building caches?
- They provide fast key‑based retrieval of cached items (correct)
- They automatically evict the least‑recently‑used entries
- They compress cached data to save space
- They store cached items in a sorted sequence
Introduction to Hash Tables Quiz Question 12: When a new key‑value pair hashes to a bucket that already contains entries, how is it added using separate chaining?
- It is appended to the list stored in that bucket (correct)
- It replaces the first entry in the bucket
- The insertion is ignored because the bucket is full
- The whole table is resized before insertion
Introduction to Hash Tables Quiz Question 13: Which kind of application commonly uses a hash table to associate words with their definitions?
- A lexical dictionary (map) of words to definitions (correct)
- A sorted list that orders words alphabetically
- A stack that records recent word lookups
- A binary search tree that stores words in order
How is the load factor of a hash table defined?
1 of 13
Key Concepts
Hash Table Fundamentals
Hash table
Hash function
Collision (hashing)
Load factor
Rehashing
Collision Resolution Techniques
Separate chaining
Open addressing
Applications of Hash Tables
Dictionary (data structure)
Cache (computing)
Symbol table
Definitions
Hash table
A data structure that stores key‑value pairs in an array of buckets to enable average‑case O(1) lookup, insertion, and deletion.
Hash function
A deterministic algorithm that maps a key to an integer index within the fixed range of a hash table’s buckets.
Collision (hashing)
An event where two distinct keys produce the same bucket index, requiring a resolution strategy.
Separate chaining
A collision‑resolution method where each bucket holds a list (or similar container) of all entries that hash to that bucket.
Open addressing
A collision‑resolution technique that probes alternative buckets in a defined sequence until an empty slot is found.
Load factor
The ratio of stored entries to total buckets in a hash table, influencing the likelihood of collisions and performance.
Rehashing
The process of resizing a hash table and recomputing bucket indices for all existing entries using the hash function.
Dictionary (data structure)
An abstract data type that maps unique keys to values, commonly implemented with hash tables.
Cache (computing)
A high‑speed storage layer that uses hash tables to provide fast key‑based retrieval of frequently accessed data.
Symbol table
A data structure used by compilers and interpreters to associate identifiers with information such as type or memory location, often built on hash tables.