RemNote Community
Community

Introduction to Hash Tables

Understand how hash tables store key‑value pairs, resolve collisions, and maintain efficient performance via load factor management and resizing.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is the primary purpose of a hash table?
1 of 19

Summary

Overview of Hash Tables What Are Hash Tables and Why Do They Matter? A hash table is a data structure that stores key-value pairs and allows you to find, add, or remove entries extremely quickly. Unlike arrays where you access elements by position, hash tables let you access elements by a key—for example, looking up a phone number using a person's name as the key. The remarkable feature of hash tables is their speed. When designed well, hash tables provide constant time complexity, written as $O(1)$, for their basic operations (find, insert, delete). This means that whether your hash table stores 10 items or 10 million items, these operations take roughly the same amount of time. This is why hash tables are used everywhere in computer science. How a Hash Table is Structured A hash table stores its entries in an array of buckets. Think of buckets as containers—each position in the array can hold one or more key-value pairs depending on how you handle collisions (we'll cover this shortly). Here's the basic structure: The hash table maintains an array with a fixed number of buckets Each bucket can store entries—either one entry, or multiple entries depending on the collision handling method The position where an entry is stored is determined by the key The diagram above shows this visually: three different keys (John Smith, Lisa Smith, and Sandra Dee) are passed through a hash function, which assigns them to specific buckets in the array. Hash Functions: The Core of Hash Tables What a Hash Function Does A hash function takes a key (which could be a string, number, or any object) and converts it into a bucket index—a number between 0 and the number of buckets minus 1. For example, if you have 16 buckets, the hash function might convert: "John Smith" → bucket index 1 "Sandra Dee" → bucket index 14 Critical Properties of a Good Hash Function Determinism: The hash function must be deterministic, meaning it always returns the same index for the same key. If you hash "John Smith" today, you must get the same bucket index if you hash it again tomorrow. Without this property, you could never retrieve your data consistently. Uniform Distribution: A good hash function should spread keys across all available buckets as evenly as possible. If most keys map to just a few buckets, you'll waste space and have poor performance. Think of it like distributing students evenly across classroom sections rather than cramming them all into one room. Basic Operations on Hash Tables Inserting Data To insert a key-value pair into a hash table: Compute the bucket index by passing the key through the hash function Store the key-value pair in that bucket For example, to insert "John Smith" → "555-1234": Hash function computes: hash("John Smith") = 1 Store the pair in bucket 1 Looking Up Data To find a value by its key: Compute the bucket index using the same hash function Search within that bucket for the matching key Return the associated value if found, or indicate "not found" otherwise This is why the hash function must be deterministic—you use the exact same computation to locate the data you previously stored. Deleting Data To remove a key-value pair: Compute the bucket index using the hash function Locate the entry with the matching key in that bucket Remove that entry Collision Resolution: Handling When Keys Collide Understanding Collisions A collision occurs when two different keys produce the same bucket index. For example, if both "John Smith" and "Jon Smith" happen to hash to index 1, they collide. Collisions are inevitable in hash tables—even with a perfect hash function, if you have more keys than buckets, collisions must occur by the pigeonhole principle. The key is to handle collisions efficiently. Separate Chaining: Store Multiple Entries Per Bucket In separate chaining, each bucket contains a list (or linked list, or another small container) of all entries that mapped to that bucket. How it works: If a collision occurs, the new entry is simply added to the list in that bucket When looking up a key, you compute the bucket index, then search through the list in that bucket for the matching key When deleting, you find the bucket, then remove the matching entry from the list For example, if both "John Smith" and "Jon Smith" hash to bucket 1: Bucket 1 contains a list: ["John Smith" → "555-1234", "Jon Smith" → "555-5678"] To find "John Smith", you hash to bucket 1, then search the list for the exact key match Separate chaining is simple and works well when the number of collisions is small. Open Addressing: Find an Alternative Bucket In open addressing, when a bucket is already occupied, instead of storing multiple entries in one bucket, you search for another empty bucket nearby. Linear Probing: Check buckets sequentially until you find an empty one. If bucket $h(key)$ is occupied, check bucket $h(key) + 1$ If that's occupied, check $h(key) + 2$, and so on When looking up a key, follow the same sequence: start at $h(key)$, and if it doesn't match, keep probing forward Quadratic Probing: Check buckets at intervals that grow quadratically. If bucket $h(key)$ is occupied, check bucket $h(key) + 1^2$ Then check bucket $h(key) + 2^2$, then $h(key) + 3^2$, and so on This reduces clustering (the tendency of filled buckets to group together) Double Hashing: Use a second hash function to determine the step size. If bucket $h1(key)$ is occupied, the next bucket to check is $h1(key) + h2(key)$ Then check $h1(key) + 2 \cdot h2(key)$, and so on This provides even better distribution than linear or quadratic probing The key insight with open addressing: lookup must follow the same probe sequence as insertion to find the correct entry. Load Factor and Performance Defining Load Factor The load factor is a measure of how full your hash table is: $$\text{load factor} = \frac{\text{number of stored entries}}{\text{number of buckets}}$$ For example, if you have 100 entries and 200 buckets, your load factor is $100 \div 200 = 0.5$. Why Load Factor Matters The load factor directly affects performance: Low load factor (e.g., 0.25): Buckets are mostly empty, collisions are rare, operations stay at $O(1)$ High load factor (e.g., 0.9 or higher): Buckets are very full, collisions become frequent, operations degrade toward $O(n)$ in the worst case The higher the load factor, the more likely collisions become, and the more time you spend resolving them (either searching through long chains in separate chaining, or probing many buckets in open addressing). Keeping Performance High To maintain the $O(1)$ average-case performance, most implementations keep the load factor below a threshold (typically 0.7 or 0.75). When the load factor gets too high, the hash table is resized. Resizing and Rehashing When Resizing Happens When the load factor exceeds a predetermined threshold (like 0.75), the hash table is resized because performance is degrading due to too many collisions. The Resizing Process Resizing involves three steps: Allocate a larger array: Create a new bucket array, typically about double the size of the current array Recompute all hash values: Pass every existing entry through the hash function again. Because the bucket count has changed, the indices will change. For example, if the old hash function computed indices for 16 buckets, the new hash function (with 32 buckets) will produce different indices Reinsert all entries: Place each entry into its new bucket position in the larger array This process is called rehashing. Although it requires reprocessing all existing entries, it's a relatively rare operation (you only resize when the load factor gets too high), so the average cost per operation remains constant. Common Applications of Hash Tables Hash tables are fundamental data structures used throughout computer science: Dictionaries and Symbol Tables: Mapping words to definitions (like a dictionary) or variable names to their values in a compiler's symbol table. Caches: Storing frequently accessed data so it can be retrieved by key in constant time, rather than computing or fetching it again. Counting and Frequency Analysis: Tallying how many times each word appears in a document, or counting occurrences of events. Sets: Storing unique items and checking membership quickly (e.g., "is this user in the banned list?"). Database Indexes: Creating fast lookups in databases by indexing records by key fields. The common thread: whenever you need fast retrieval of data based on a key, hash tables are an excellent choice.
Flashcards
What is the primary purpose of a hash table?
To store key‑value pairs for fast find, insert, and delete operations.
What is the average time complexity for basic operations in a well-designed hash table?
$O(1)$ (Constant time).
What structural component of a hash table holds the stored entries?
An array of buckets.
How does a hash table determine the bucket index for a specific key?
By using a hash function to convert the key into an integer index.
What is the primary goal of a "good" hash function regarding key distribution?
To distribute keys uniformly across the bucket array.
Why must a hash function be deterministic?
To ensure it returns the same index for the same key every time.
What is the general process for inserting a (key, value) pair into a hash table?
Compute the bucket index with the hash function and store the pair in that bucket.
How is a lookup performed in a hash table?
Recompute the bucket index using the hash function and search that bucket for the matching key.
How is a key deleted from a hash table?
Locate the bucket via the hash function, find the matching entry, and remove it.
When does a collision occur in a hash table?
When two different keys produce the same bucket index.
How does the separate chaining technique handle collisions?
Each bucket holds a list (or small container) of all entries mapping to it.
What is the general mechanism of the open addressing technique for resolving collisions?
The algorithm probes alternative buckets until an empty slot is found.
In open addressing, how does linear probing find the next slot?
It checks the next sequential bucket each time.
How does double hashing determine the step size for probing?
By using a second hash function.
What is the mathematical definition of the load factor in a hash table?
$\text{load factor} = \frac{\text{number of stored entries}}{\text{number of buckets}}$
What is the relationship between a high load factor and hash table performance?
A higher load factor increases collisions, which can degrade performance.
Why is it important to keep the load factor low in a hash table?
To preserve the average-case $O(1)$ performance of operations.
What typically triggers the resizing of a hash table?
The load factor becoming too high.
What are the two main steps in the resizing procedure of a hash table?
Allocating a larger bucket array and re-hashing every existing entry into it.

Quiz

How is the load factor of a hash table defined?
1 of 13
Key Concepts
Hash Table Fundamentals
Hash table
Hash function
Collision (hashing)
Load factor
Rehashing
Collision Resolution Techniques
Separate chaining
Open addressing
Applications of Hash Tables
Dictionary (data structure)
Cache (computing)
Symbol table