Hash table - Designing Effective Hash Functions
Understand the purpose of hash functions, common construction methods (division and multiplication), and how to achieve uniform distribution to avoid clustering.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
How does a hash function map a key $k$ to an integer index within a table of size $m$?
1 of 8
Summary
Hash Functions
Introduction
A hash function is a fundamental tool in data structures that enables efficient data storage and retrieval. Rather than storing data directly indexed by their keys (which may be very large or non-numeric), hash functions compress these keys into small integer indices that fit within a hash table. This compression is the source of both hash functions' power and their challenges.
Purpose and Properties
The core role of a hash function is to map keys from a potentially infinite universe of possible values into indices within a fixed-size table. Formally, we write this as:
$$h(k) = \text{hash}(k) \bmod m$$
where $k$ is a key, $m$ is the size of the hash table, and $h(k)$ is the resulting index between 0 and $m-1$.
The modulo operation is crucial here—it constrains any hash value to fall within our table's boundaries, regardless of the input key's magnitude.
Perfect Hash Functions
In the ideal case, a perfect hash function is one that is injective on a given static set of keys. This means each key maps to a unique index with no collisions whatsoever. While perfect hash functions are theoretically elegant and useful for certain specialized applications, they're rarely achievable in practice for dynamic sets of keys (where keys are added or removed over time). Instead, we focus on designing hash functions that perform well on average.
Uniform Distribution
The most important property of a practical hash function is uniform distribution: hash values should be spread evenly across all table indices. When keys hash to a non-uniform distribution, some table slots become crowded while others remain empty. This leads to collisions (multiple keys hashing to the same index) concentrated in certain regions, degrading performance.
A well-designed hash function treats the table as if keys were scattered randomly and uniformly across it, even though the function itself is deterministic.
Common Construction Methods
Once we understand what makes a hash function valuable, the question becomes: how do we construct one? Two fundamental approaches dominate in practice.
The Division Method
The division method is the simplest construction:
$$h(k) = k \bmod m$$
Here, we simply divide the key by the table size and use the remainder as the index. This is fast to compute and straightforward to understand.
However, the choice of $m$ is critical for achieving uniform distribution. The table size $m$ should be a prime number. Why? If $m$ is composite (not prime), certain patterns in the keys can lead to non-uniform hashing. For example, if $m$ is even and your keys have a regular pattern in their least significant bits, many keys will hash to the same slots. Prime numbers avoid this problem because they share few common factors with typical key patterns.
Example: With a prime table size like $m = 17$, keys distributed modulo 17 tend to spread uniformly across all indices. If we instead chose $m = 16$ (a power of two), keys that differ only in low-order bits would cluster in predictable ways.
The Multiplication Method
The multiplication method is more sophisticated:
$$h(k) = \lfloor (k \cdot A \bmod 1) \cdot m \rfloor$$
This method multiplies the key by a constant $A$ (where $0 < A < 1$), takes the fractional part of this product, scales it to the table size, and floors the result.
The constant $A$ is typically chosen as the fractional part of an irrational number. A particularly effective choice is $A = \frac{\sqrt{5} - 1}{2} \approx 0.618$, which is the reciprocal of the golden ratio. This choice has been empirically shown to produce excellent uniform distribution across many types of keys.
Why the golden ratio? Irrational numbers have the property that their multiples are equidistributed modulo 1, meaning they naturally spread values uniformly across the fractional range. This mathematical property translates directly into uniform hash values.
The multiplication method has an advantage: it works well with table sizes that are powers of two ($m = 2^p$), which are computationally efficient. In contrast, the division method requires prime table sizes for good performance.
Advanced Considerations
Beyond basic construction, hash function design must account for how collisions are resolved in your specific collision-resolution scheme.
Avoiding Clustering
When using open addressing (where collisions are resolved by finding alternative empty slots in the same table), a poor hash function can create clustering: a situation where colliding keys are placed in sequences of consecutive or nearby slots. This exacerbates the collision problem—once a region becomes crowded, subsequent collisions are more likely to occur nearby, creating larger and larger clusters.
A good hash function distributes keys uniformly not just across the table, but also avoids creating patterns that lead to clustering. The multiplication method with the golden ratio is particularly effective at preventing clustering because its mathematical properties create spreading behavior.
Table Size Considerations
The design of your hash function should align with your choice of table size:
For division method: Use prime numbers like 53, 97, 199, etc. Primes resist clustering and provide guaranteed good distribution across the table.
For multiplication method: Powers of two like 128, 256, 1024 are efficient computationally (since multiplying and masking are fast) and work well with the multiplication method's properties.
Mixing these—for instance, using the division method with a power-of-two table size—typically results in poor distribution and should be avoided.
<extrainfo>
Additional Practical Notes
In real-world applications, hash functions are often specialized for the types of keys being hashed. String keys, for example, might process multiple characters in a specialized way to improve distribution. Similarly, some applications use multiple hash functions in combination (as in bloom filters or hash chains) where the specific mathematical properties of individual functions become less critical. These advanced applications go beyond the fundamental hash function design covered in most introductory courses.
</extrainfo>
Flashcards
How does a hash function map a key $k$ to an integer index within a table of size $m$?
$h(k) = \text{hash}(k) \bmod m$
What is the primary characteristic of a perfect hash function for a given static key set?
It is injective (gives a unique index for each key)
Why is it essential for hash values to have a uniform distribution?
To minimize collisions
In the division method for hashing, how is the hash value $h(k)$ calculated given a table size $m$?
$h(k) = k \bmod m$
What type of number is typically chosen for the table size $m$ when using the division method?
A prime number
What is the formula for the multiplication method of hashing, using table size $m$ and constant $A$?
$h(k) = \lfloor (k \cdot A \bmod 1) \cdot m \rfloor$
In the context of open addressing, what phenomenon should hash functions avoid to prevent consecutive slots from being filled by colliding keys?
Clustering
What are two common types of table sizes $m$ for which a hash function must remain uniform?
Powers of two
Primes
Quiz
Hash table - Designing Effective Hash Functions Quiz Question 1: What is the general form of a hash function that maps a key to an index within a table of size $m$?
- h(k) = hash(k) mod m (correct)
- h(k) = hash(k) * m
- h(k) = (hash(k) + m) mod size
- h(k) = hash(k) / m
Hash table - Designing Effective Hash Functions Quiz Question 2: Why is a uniform distribution of hash values important?
- It minimizes the number of collisions (correct)
- It increases the size of hash values
- It simplifies the computation of the hash
- It ensures all keys are prime numbers
Hash table - Designing Effective Hash Functions Quiz Question 3: In the division method of hashing, what is the typical choice for the table size $m$?
- A prime number (correct)
- A power of two
- The exact number of keys
- A composite number
What is the general form of a hash function that maps a key to an index within a table of size $m$?
1 of 3
Key Concepts
Hash Function Concepts
Hash function
Perfect hash function
Uniform hashing
Hash Construction Methods
Division method
Multiplication method
Collision Resolution Techniques
Open addressing
Clustering
Table size (hash table)
Definitions
Hash function
A deterministic algorithm that maps keys from a large universe to integer indices within a fixed-size table.
Perfect hash function
A hash function that is injective on a specific static key set, assigning a unique index to each key without collisions.
Uniform hashing
The property of a hash function that distributes keys evenly across all table slots, minimizing the likelihood of collisions.
Division method
A simple hash construction where the key is reduced modulo the table size, often using a prime number for the modulus.
Multiplication method
A hash construction that multiplies the key by a constant (commonly derived from the golden ratio) and extracts the high-order bits after scaling to the table size.
Open addressing
A collision resolution technique in hash tables where a colliding key probes alternative slots according to a defined sequence.
Clustering
The phenomenon where consecutive slots become filled due to collisions, leading to degraded performance in open addressing schemes.
Table size (hash table)
The number of slots in a hash table, typically chosen as a prime number or a power of two to complement the hash function’s distribution properties.