Subjects/Science/Computer and Information Science/Computer Science/Hash table

Introduction to Hash Tables

Understand how hash tables store key‑value pairs, resolve collisions, and maintain efficient performance via load factor management and resizing.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the primary purpose of a hash table?

1 of 19

Summary

Overview of Hash Tables What Are Hash Tables and Why Do They Matter? A hash table is a data structure that stores key-value pairs and allows you to find, add, or remove entries extremely quickly. Unlike arrays where you access elements by position, hash tables let you access elements by a key—for example, looking up a phone number using a person's name as the key. The remarkable feature of hash tables is their speed. When designed well, hash tables provide constant time complexity, written as $O(1)$, for their basic operations (find, insert, delete). This means that whether your hash table stores 10 items or 10 million items, these operations take roughly the same amount of time. This is why hash tables are used everywhere in computer science. How a Hash Table is Structured A hash table stores its entries in an array of buckets. Think of buckets as containers—each position in the array can hold one or more key-value pairs depending on how you handle collisions (we'll cover this shortly). Here's the basic structure: The hash table maintains an array with a fixed number of buckets Each bucket can store entries—either one entry, or multiple entries depending on the collision handling method The position where an entry is stored is determined by the key The diagram above shows this visually: three different keys (John Smith, Lisa Smith, and Sandra Dee) are passed through a hash function, which assigns them to specific buckets in the array. Hash Functions: The Core of Hash Tables What a Hash Function Does A hash function takes a key (which could be a string, number, or any object) and converts it into a bucket index—a number between 0 and the number of buckets minus 1. For example, if you have 16 buckets, the hash function might convert: "John Smith" → bucket index 1 "Sandra Dee" → bucket index 14 Critical Properties of a Good Hash Function Determinism: The hash function must be deterministic, meaning it always returns the same index for the same key. If you hash "John Smith" today, you must get the same bucket index if you hash it again tomorrow. Without this property, you could never retrieve your data consistently. Uniform Distribution: A good hash function should spread keys across all available buckets as evenly as possible. If most keys map to just a few buckets, you'll waste space and have poor performance. Think of it like distributing students evenly across classroom sections rather than cramming them all into one room. Basic Operations on Hash Tables Inserting Data To insert a key-value pair into a hash table: Compute the bucket index by passing the key through the hash function Store the key-value pair in that bucket For example, to insert "John Smith" → "555-1234": Hash function computes: hash("John Smith") = 1 Store the pair in bucket 1 Looking Up Data To find a value by its key: Compute the bucket index using the same hash function Search within that bucket for the matching key Return the associated value if found, or indicate "not found" otherwise This is why the hash function must be deterministic—you use the exact same computation to locate the data you previously stored. Deleting Data To remove a key-value pair: Compute the bucket index using the hash function Locate the entry with the matching key in that bucket Remove that entry Collision Resolution: Handling When Keys Collide Understanding Collisions A collision occurs when two different keys produce the same bucket index. For example, if both "John Smith" and "Jon Smith" happen to hash to index 1, they collide. Collisions are inevitable in hash tables—even with a perfect hash function, if you have more keys than buckets, collisions must occur by the pigeonhole principle. The key is to handle collisions efficiently. Separate Chaining: Store Multiple Entries Per Bucket In separate chaining, each bucket contains a list (or linked list, or another small container) of all entries that mapped to that bucket. How it works: If a collision occurs, the new entry is simply added to the list in that bucket When looking up a key, you compute the bucket index, then search through the list in that bucket for the matching key When deleting, you find the bucket, then remove the matching entry from the list For example, if both "John Smith" and "Jon Smith" hash to bucket 1: Bucket 1 contains a list: ["John Smith" → "555-1234", "Jon Smith" → "555-5678"] To find "John Smith", you hash to bucket 1, then search the list for the exact key match Separate chaining is simple and works well when the number of collisions is small. Open Addressing: Find an Alternative Bucket In open addressing, when a bucket is already occupied, instead of storing multiple entries in one bucket, you search for another empty bucket nearby. Linear Probing: Check buckets sequentially until you find an empty one. If bucket $h(key)$ is occupied, check bucket $h(key) + 1$ If that's occupied, check $h(key) + 2$, and so on When looking up a key, follow the same sequence: start at $h(key)$, and if it doesn't match, keep probing forward Quadratic Probing: Check buckets at intervals that grow quadratically. If bucket $h(key)$ is occupied, check bucket $h(key) + 1^2$ Then check bucket $h(key) + 2^2$, then $h(key) + 3^2$, and so on This reduces clustering (the tendency of filled buckets to group together) Double Hashing: Use a second hash function to determine the step size. If bucket $h1(key)$ is occupied, the next bucket to check is $h1(key) + h2(key)$ Then check $h1(key) + 2 \cdot h2(key)$, and so on This provides even better distribution than linear or quadratic probing The key insight with open addressing: lookup must follow the same probe sequence as insertion to find the correct entry. Load Factor and Performance Defining Load Factor The load factor is a measure of how full your hash table is: $$\text{load factor} = \frac{\text{number of stored entries}}{\text{number of buckets}}$$ For example, if you have 100 entries and 200 buckets, your load factor is $100 \div 200 = 0.5$. Why Load Factor Matters The load factor directly affects performance: Low load factor (e.g., 0.25): Buckets are mostly empty, collisions are rare, operations stay at $O(1)$ High load factor (e.g., 0.9 or higher): Buckets are very full, collisions become frequent, operations degrade toward $O(n)$ in the worst case The higher the load factor, the more likely collisions become, and the more time you spend resolving them (either searching through long chains in separate chaining, or probing many buckets in open addressing). Keeping Performance High To maintain the $O(1)$ average-case performance, most implementations keep the load factor below a threshold (typically 0.7 or 0.75). When the load factor gets too high, the hash table is resized. Resizing and Rehashing When Resizing Happens When the load factor exceeds a predetermined threshold (like 0.75), the hash table is resized because performance is degrading due to too many collisions. The Resizing Process Resizing involves three steps: Allocate a larger array: Create a new bucket array, typically about double the size of the current array Recompute all hash values: Pass every existing entry through the hash function again. Because the bucket count has changed, the indices will change. For example, if the old hash function computed indices for 16 buckets, the new hash function (with 32 buckets) will produce different indices Reinsert all entries: Place each entry into its new bucket position in the larger array This process is called rehashing. Although it requires reprocessing all existing entries, it's a relatively rare operation (you only resize when the load factor gets too high), so the average cost per operation remains constant. Common Applications of Hash Tables Hash tables are fundamental data structures used throughout computer science: Dictionaries and Symbol Tables: Mapping words to definitions (like a dictionary) or variable names to their values in a compiler's symbol table. Caches: Storing frequently accessed data so it can be retrieved by key in constant time, rather than computing or fetching it again. Counting and Frequency Analysis: Tallying how many times each word appears in a document, or counting occurrences of events. Sets: Storing unique items and checking membership quickly (e.g., "is this user in the banned list?"). Database Indexes: Creating fast lookups in databases by indexing records by key fields. The common thread: whenever you need fast retrieval of data based on a key, hash tables are an excellent choice.

Flashcards

What is the primary purpose of a hash table?

To store key‑value pairs for fast find, insert, and delete operations.

What is the average time complexity for basic operations in a well-designed hash table?

$O(1)$ (Constant time).

What structural component of a hash table holds the stored entries?

An array of buckets.

How does a hash table determine the bucket index for a specific key?

By using a hash function to convert the key into an integer index.

What is the primary goal of a "good" hash function regarding key distribution?

To distribute keys uniformly across the bucket array.

Why must a hash function be deterministic?

To ensure it returns the same index for the same key every time.

What is the general process for inserting a (key, value) pair into a hash table?

Compute the bucket index with the hash function and store the pair in that bucket.

How is a lookup performed in a hash table?

Recompute the bucket index using the hash function and search that bucket for the matching key.

How is a key deleted from a hash table?

Locate the bucket via the hash function, find the matching entry, and remove it.

When does a collision occur in a hash table?

When two different keys produce the same bucket index.

How does the separate chaining technique handle collisions?

Each bucket holds a list (or small container) of all entries mapping to it.

What is the general mechanism of the open addressing technique for resolving collisions?

The algorithm probes alternative buckets until an empty slot is found.

In open addressing, how does linear probing find the next slot?

It checks the next sequential bucket each time.

How does double hashing determine the step size for probing?

By using a second hash function.

What is the mathematical definition of the load factor in a hash table?

$\text{load factor} = \frac{\text{number of stored entries}}{\text{number of buckets}}$

What is the relationship between a high load factor and hash table performance?

A higher load factor increases collisions, which can degrade performance.

Why is it important to keep the load factor low in a hash table?

To preserve the average-case $O(1)$ performance of operations.

What typically triggers the resizing of a hash table?

The load factor becoming too high.

What are the two main steps in the resizing procedure of a hash table?

Allocating a larger bucket array and re-hashing every existing entry into it.

Quiz

Introduction to Hash Tables Quiz Question 1: How is the load factor of a hash table defined?

Load factor = (number of stored entries) ÷ (number of buckets) (correct)
Load factor = (number of buckets) ÷ (number of stored entries)
Load factor = (number of collisions) ÷ (number of buckets)
Load factor = (sum of key hashes) ÷ (number of buckets)

Introduction to Hash Tables Quiz Question 2: What is the average time complexity for basic operations (find, insert, delete) in a well‑designed hash table?

$O(1)$ (correct)
$O(\log n)$
$O(n)$
$O(n \log n)$

Introduction to Hash Tables Quiz Question 3: In linear probing, how is the next bucket selected after a collision?

Move to the next sequential bucket (correct)
Jump to a bucket based on a second hash function
Select a bucket at a quadratic offset
Choose a random bucket

Introduction to Hash Tables Quiz Question 4: What effect does a higher load factor have on a hash table?

It increases the likelihood of collisions (correct)
It decreases memory usage
It guarantees faster lookups
It eliminates the need for a hash function

Introduction to Hash Tables Quiz Question 5: What is the primary structural component of a hash table that stores entries?

An array of buckets (correct)
A linked list of nodes
A binary search tree
A stack

Introduction to Hash Tables Quiz Question 6: What determines how many entries a single bucket in a hash table may hold?

The collision‑handling method (correct)
The size of the key
The hash function’s output range
The current load factor

Introduction to Hash Tables Quiz Question 7: After computing the bucket index for a (key, value) pair, what is the next step in insertion?

Store the pair in that bucket (correct)
Re‑hash the key with a different function
Search the entire table for duplicates
Resize the table immediately

Introduction to Hash Tables Quiz Question 8: What term describes two different keys that map to the same bucket index?

Collision (correct)
Overflow
Underflow
Rehash

Introduction to Hash Tables Quiz Question 9: Which condition most often causes a hash table to be resized?

Load factor exceeds a chosen threshold (correct)
Number of buckets equals number of entries
All buckets become empty
Hash function changes

Introduction to Hash Tables Quiz Question 10: Why is it important to keep the load factor of a hash table low?

It helps preserve average‑case O(1) operation time (correct)
It reduces the total memory consumption of the table
It guarantees that no collisions will ever occur
It ensures that keys are stored in sorted order

Introduction to Hash Tables Quiz Question 11: Why are hash tables well‑suited for building caches?

They provide fast key‑based retrieval of cached items (correct)
They automatically evict the least‑recently‑used entries
They compress cached data to save space
They store cached items in a sorted sequence

Introduction to Hash Tables Quiz Question 12: When a new key‑value pair hashes to a bucket that already contains entries, how is it added using separate chaining?

It is appended to the list stored in that bucket (correct)
It replaces the first entry in the bucket
The insertion is ignored because the bucket is full
The whole table is resized before insertion

Introduction to Hash Tables Quiz Question 13: Which kind of application commonly uses a hash table to associate words with their definitions?

A lexical dictionary (map) of words to definitions (correct)
A sorted list that orders words alphabetically
A stack that records recent word lookups
A binary search tree that stores words in order

How is the load factor of a hash table defined?

1 of 13

Key Concepts

Hash Table Fundamentals

Hash table

Hash function

Collision (hashing)

Load factor

Rehashing

Collision Resolution Techniques

Separate chaining

Open addressing

Applications of Hash Tables

Dictionary (data structure)

Cache (computing)

Symbol table

Definitions

Hash table

A data structure that stores key‑value pairs in an array of buckets to enable average‑case O(1) lookup, insertion, and deletion.

Hash function

A deterministic algorithm that maps a key to an integer index within the fixed range of a hash table’s buckets.

Collision (hashing)

An event where two distinct keys produce the same bucket index, requiring a resolution strategy.

Separate chaining

A collision‑resolution method where each bucket holds a list (or similar container) of all entries that hash to that bucket.

Open addressing

A collision‑resolution technique that probes alternative buckets in a defined sequence until an empty slot is found.

Load factor

The ratio of stored entries to total buckets in a hash table, influencing the likelihood of collisions and performance.

Rehashing

The process of resizing a hash table and recomputing bucket indices for all existing entries using the hash function.

Dictionary (data structure)

An abstract data type that maps unique keys to values, commonly implemented with hash tables.

Cache (computing)

A high‑speed storage layer that uses hash tables to provide fast key‑based retrieval of frequently accessed data.

Symbol table

A data structure used by compilers and interpreters to associate identifiers with information such as type or memory location, often built on hash tables.