Subjects/Science/Computer and Information Science/Computer Science/Hash function

Hash function - String Hashing Techniques

Learn why hashing every character matters, how folding and radix methods create efficient string hashes, and how rolling hashes enable fast substring matching.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

How are individual characters and character pairs distributed in natural-language strings?

1 of 11

Summary

Comprehensive Guide to String Hashing Techniques Introduction to String Hashing When we hash a string, we need to convert an entire sequence of characters into a single hash value that can serve as a table index. The key insight is that a good hash function must consider all characters in the string and treat each one appropriately. This is more complex than hashing a single integer, and the techniques we use have evolved significantly to handle both correctness and efficiency. At the most basic level, a hash function takes a key (like a string) and produces a hash value that indicates where that key should be stored in a hash table. But how do we design such a function for strings? Why All Characters Matter: The Distribution Problem Natural language text and most real-world strings have a non-uniform distribution of characters. This means certain characters appear much more frequently than others, and character pairs are not randomly distributed. For example, in English text, the letter 'e' appears far more often than 'q', and certain combinations like 'th' or 'ing' are common while others are rare. Because of this non-uniformity, if our hash function only looks at part of the string—say, only the first few characters or the last few characters—we ignore the variations that actually exist in the parts we skip. This creates a dangerous scenario: When you use simplistic hashing (like summing only the first and last n characters, or using only the middle 4 characters), you can have many different strings that hash to the same value because the ignored portions can vary widely. This produces linear-time behavior through increased collisions, defeating the purpose of hashing. For example, if you only hash the first 3 characters of people's names, then "John Smith" and "John Doe" would hash identically, as would all names starting with "John". You've effectively thrown away important distinguishing information. The only time simplistic hashing works acceptably is when the portions of strings you ignore are invariant constants (the same for all keys). For instance, if all your strings have zeros padded to a certain position, ignoring that position doesn't lose information. In practice, this is rare. Character Folding: A Better Approach To fix the problems with simplistic hashing, we use character folding techniques that process every character while combining them in mathematically sophisticated ways. Multiplicative Folding with Prime Constants The standard approach is multiplicative folding: maintain a running hash total, and before adding each new character's value, multiply the current total by a large prime number. Then add the character's numeric code. The algorithm looks like this conceptually: hashvalue = 0 for each character in string: hashvalue = hashvalue PRIME + charactercode Why multiply by a prime number? The prime acts as a multiplier that spreads bits around non-uniformly. Because prime numbers have specific mathematical properties, they ensure that small changes in the input create large changes in the intermediate hash values. This distribution property helps prevent collisions. The Bit Clustering Problem Here's a subtle but important issue: ASCII character codes have a particular structure. All printable ASCII characters have the high bit cleared (they use only the lower 7 bits), and common strings avoid many of the lowest-value codes. This means the information in the string is clustered in certain bit positions. When we fold characters together with simple operations like addition, this bit clustering can persist in the final folded result. The bits that carry information remain concentrated in certain positions, creating patterns that increase collisions. This is why the multiplicative folding approach is superior—the multiplication by a prime scrambles these bits across more positions. Other Folding Methods Some hash functions use XOR (exclusive or) instead of addition when combining character values: hashvalue = 0 for each character in string: hashvalue = hashvalue ^ charactercode XOR is another plausible folding method, though it typically requires careful choice of the multiplier/constant to work as effectively as multiplicative folding. Reducing to a Table Index After processing all characters, the accumulated hash value is usually very large. We need to convert it to a valid table index. This final step uses: A modulo operation (hashvalue % tablesize) A bit mask (hashvalue & (tablesize - 1)) Some other reduction function The choice depends on the implementation details, but the goal is always to map the full range of hash values into the range of available table indices. Modern Word-Based Hashing Modern CPUs can process data much more efficiently if we work with multiple bytes at once rather than individual characters. This has led to word-length folding strategies. Instead of processing the string one character at a time, contemporary hash functions interpret the string as an array of 32-bit or 64-bit integers (depending on the processor architecture). Each word is then combined into the hash total using operations like: Multiplication by a constant (similar to multiplicative folding) Bit shifting (moving bits left or right to change their position) XOR operations across words For example, a function might take 8 bytes at a time, interpret them as a 64-bit integer, multiply by a prime, and add the result to the accumulating hash. This processes strings 8× faster than per-character hashing while maintaining all the benefits of considering every byte. <extrainfo> PJW Hash Method The PJW (Aho, Sethi, Ullman) hash method is a specific character folding technique that uses bit patterns intelligently. While it was important historically, modern implementations have adapted it for 64-bit processors to process larger word chunks at once. </extrainfo> Radix Conversion Hashing: The Polynomial Approach An elegant way to think about string hashing is to treat the string as a polynomial in some base (radix). If a string has length $k$ with characters having numeric codes $x0, x1, \ldots, x{k-1}$, the polynomial representation is: $$\sum{i=0}^{k-1} xi a^{i}$$ where $a$ is the radix (the base). This formula treats the first character as the constant term, the second character multiplied by $a$, the third by $a^2$, and so on. It's like the positional number system we use in decimal (base 10) or binary (base 2), but applied to string characters. Choosing the Radix The radix $a$ should be chosen as a prime number larger than the number of distinct characters in your character set. For instance: For lowercase English letters only (26 characters), you might use 29 or 31 For extended ASCII (256 characters), you might use 257 or 263 Using a prime number greater than the alphabet size ensures that the polynomial values distribute well and minimizes collisions for short strings. Using the Polynomial as a Hash Once you compute the polynomial value, you can either: Use it directly as the hash code (if the table is large enough) Apply a reduction function like modulo to fit it into your table size This approach is mathematically clean and performs well in practice. The key insight is that this polynomial naturally incorporates every character's position and value, creating a different hash for almost every distinct string. Rolling Hash Techniques: Efficient Substring Hashing Suppose you need to find all occurrences of a pattern string in a text. One approach is to hash every substring of the text that matches the pattern's length and compare hashes. But computing each substring's hash from scratch is expensive. The Rolling Hash Concept A rolling hash uses a clever technique: instead of recomputing the full hash for each substring position, we update the previous hash value incrementally. Imagine sliding a fixed-size window across the text: When we move the window one position to the right, we remove the leftmost character and add a new rightmost character We can update the hash in constant time rather than recomputing it Complexity Improvement Naive approach: Extract each substring and hash it independently. For each of the $n$ positions in the text, we spend $k$ operations (where $k$ is the pattern length). This gives $O(k \cdot n)$ time. Rolling hash approach: With an appropriate hash function, each update takes constant time. The total complexity becomes $O(mk + n)$, where $m$ is the number of pattern occurrences. In cases where $m$ is small, this is dramatically faster. The Rabin-Karp Algorithm The Rabin-Karp algorithm is a pattern-matching algorithm that uses the Rabin fingerprint as its rolling hash function. The Rabin fingerprint is specifically designed for string searching and works well at avoiding collisions in 8-bit character strings. The algorithm works as follows: Compute the hash of the pattern Use a rolling hash to compute hashes of all substrings of the text with the same length When a hash matches the pattern's hash, verify with a direct character comparison (to rule out hash collisions) This approach is remarkably efficient for most real-world inputs. Worst-Case Behavior Like many hashing algorithms, Rabin-Karp has a weakness in pathological cases. When both the text and pattern consist entirely of repeated characters (like "aaaaaaa..."), the rolling hash produces many collisions, and the algorithm degrades to $O(n \cdot k)$ time. However, these cases are extremely rare in practice, so the average performance remains excellent. <extrainfo> In some implementations, you might encounter variations where the worst case is handled by using multiple independent hash functions simultaneously, or by adding randomization to make the worst case probability negligible. </extrainfo>

Flashcards

How are individual characters and character pairs distributed in natural-language strings?

They have highly non-uniform distributions.

What two properties must a good hash function for strings possess regarding its characters?

It must depend on every character of the string. It must treat each character differently.

What is the primary risk of using simplistic hashing shortcuts like only processing the first and last $n$ characters?

It can produce linear-time behavior due to collisions from redundancies or clusters.

How does multiplicative folding process the next character value to improve the hash?

It multiplies the current total by a sizable prime number before adding the next character.

Why does bit clustering in ASCII (like cleared high bits) pose a problem for folding?

The clustering can remain in the folded result and increase collisions.

How do modern CPUs improve hashing efficiency over the traditional byte-at-a-time approach?

By interpreting strings as arrays of 32-bit or 64-bit integers.

How is a string represented mathematically in radix conversion hashing?

As a polynomial: $\sum{i=0}^{k-1} xi a^{i}$ (where $xi$ is the character code and $a$ is the radix).

What criteria are typically used to choose the radix $a$ in polynomial hashing?

It is usually a prime number larger than the number of distinct characters in the set.

What is the time complexity of the naïve approach to hashing all $n$ substrings of length $k$?

$O(k \cdot n)$.

What specific rolling hash does the Rabin-Karp algorithm utilize?

The Rabin fingerprint.

Under what pathological condition does the Rabin-Karp algorithm degrade to $O(n \cdot k)$ time?

When both the text and the pattern consist of a repeated single character.

Quiz

When representing a string as a polynomial for hashing, what does the variable a represent?

1 of 4

Key Concepts

Hashing Techniques

Hash function

String hashing

Character folding

PJW hash

Rolling hash

Polynomial rolling hash

Word‑oriented hashing

XOR folding

String Search Algorithms

Rabin–Karp algorithm

Character Analysis

Letter frequency distribution

Definitions

Hash function

A deterministic algorithm that maps data of arbitrary size to a fixed-size hash value.

String hashing

Techniques for computing hash values specifically for sequences of characters.

Character folding

A method that combines character values by multiplying the accumulator by a prime (or using XOR) before adding the next character.

PJW hash

A hash algorithm introduced by Peter J. Weinberger that uses bit shifting and folding of word-sized chunks.

Rolling hash

A hash that can be updated in constant time as a fixed‑size window slides over a string.

Rabin–Karp algorithm

A string‑search algorithm that employs a rolling hash (Rabin fingerprint) to locate pattern occurrences efficiently.

Polynomial rolling hash

A hashing technique that treats a string as a polynomial evaluated at a chosen radix, often a prime number.

Word‑oriented hashing

Processing strings as arrays of 32‑bit or 64‑bit words, using multiplication and bit‑shifts to accumulate the hash.

XOR folding

Combining character values in a hash computation using the exclusive‑or operation instead of addition.

Letter frequency distribution

The non‑uniform occurrence of individual characters and character pairs in natural‑language texts.