Hash function Study Guide
Study Guide
📖 Core Concepts
Hash Function – deterministic algorithm that maps data of any size to a fixed‑size hash value (digest).
Hash Table – array indexed by hash values; enables near‑constant‑time lookup, insertion, deletion.
Collision – two distinct keys produce the same hash value; must be resolved.
Uniformity – ideal hash distributes keys evenly across the table, giving each slot probability $1/m$ (where m = #slots).
Determinism – same input always yields the same output; essential for retrieval.
Universal Hashing – pick a hash function randomly from a family; any two keys collide with probability $1/m$.
Cryptographic vs. Non‑cryptographic – cryptographic hashes emphasize security (pre‑image, collision resistance); non‑cryptographic prioritize speed and low collision rates for data structures.
📌 Must Remember
Division hashing: $h(K) = K \bmod M$, choose M prime ≈ table size.
Multiplicative hashing: $h(K)=\left\lfloor\frac{(a·K \bmod 2^{w})}{2^{w-m}}\right\rfloor$, a odd, w word size, m output bits.
Chained hashing – each bucket holds a linked list of colliding keys.
Open addressing – probe sequence (linear, quadratic, double hashing) finds the next empty slot.
Universal hash collision probability: $1/m$.
Rolling hash update (Rabin–Karp):
$$h{new}= (h{old} - x{old}\cdot a^{k-1})·a + x{new} \pmod{p}$$
Character folding (prime‑multiplier): hash = hash P + char, where P is a sizable prime.
Polynomial string hash: $H = \sum{i=0}^{k-1} xi a^{i} \pmod{p}$, a prime > alphabet size.
🔄 Key Processes
Insert into a hash table
Compute hash code → index.
If bucket empty → store.
If occupied → apply collision‑resolution (chain or probing).
Open addressing probing (linear example)
$i = 0$; while slot $(h(K)+i) \bmod m$ occupied, set $i \leftarrow i+1$.
Insert when an empty slot is found.
Universal hashing selection
Choose random parameters (e.g., a, b) from the family $h{a,b}(K) = ((aK+b) \bmod p) \bmod m$.
String hashing with folding
Initialise hash = 0.
For each character c: hash = (hash P + ord(c)) (modulo if needed).
Rolling hash computation
Compute initial hash for first k characters.
For each slide: subtract contribution of outgoing char, multiply by a, add incoming char, reduce modulo p.
🔍 Key Comparisons
Chained vs. Open Addressing
Chaining: unlimited bucket size, easy deletions, extra memory for pointers.
Open addressing: all data stored in table, cache‑friendly, requires careful probing to avoid clustering.
Division vs. Multiplicative Hashing
Division: simple mod M; quality depends heavily on choice of prime M.
Multiplicative: spreads bits more uniformly, less sensitive to poor M choice; uses bit shifts instead of division.
Simple character folding vs. Polynomial (radix) hashing
Folding: fast, but may retain ASCII bit‑clustering.
Polynomial: treats each position with a different weight, better for non‑uniform character distributions.
⚠️ Common Misunderstandings
“Hash functions must be cryptographically strong.” – For hash tables you only need speed and uniformity; cryptographic strength adds unnecessary overhead.
“Using only first/last characters is sufficient.” – Ignores most of the key; leads to many collisions unless ignored parts are constant.
“Collisions mean failure.” – Collisions are inevitable; proper resolution guarantees correctness.
“Universal hashing guarantees no collisions.” – It only reduces expected collision probability to $1/m$; collisions can still occur.
🧠 Mental Models / Intuition
“Hash as a random dice roll.” – Imagine each key throws a fair m-sided die; uniform hashing means each side (slot) is equally likely.
“Buckets as mailboxes.” – The hash value is the mailbox number; collisions mean two letters in the same box, requiring a list (chaining) or moving to the next box (probing).
“Rolling hash as a conveyor belt.” – When the window moves one step, you drop the oldest item and add a new one, updating the total in constant time.
🚩 Exceptions & Edge Cases
Dynamic resizing: When table grows, recompute hash codes or use minimal movement schemes to avoid moving most entries.
Bad key distribution: If keys are already clustered (e.g., many share low bits), simple truncation (least‑significant bits) yields many collisions.
All‑zero or constant fields: Can be safely ignored in custom hash functions; otherwise they waste computation.
Adversarial inputs: Uniform hashing assumption may be broken; universal hashing mitigates but does not eliminate risk.
📍 When to Use Which
Use chaining when: expected load factor > 0.7, deletions frequent, memory for pointers is acceptable.
Use open addressing when: memory must stay compact, cache performance is critical, deletions are rare (or use lazy deletion).
Division hashing for quick prototypes or when table size is a prime close to a power of two.
Multiplicative hashing for high‑performance integer keys on modern CPUs (avoids costly division).
Polynomial/radix hashing for strings when character distribution is highly non‑uniform.
Rolling hash (Rabin–Karp) when you need to compare many substrings (e.g., plagiarism detection).
👀 Patterns to Recognize
Clustered low‑order bits → suspect need for multiplicative or folding with prime multiplier.
Repeated collisions in a single bucket → likely poor hash function or non‑uniform key set; consider universal hashing or a different method.
Long probe sequences → table too full or poor probing strategy; resize or switch to chaining.
Identical prefixes/suffixes in many keys → avoid hash functions that only look at ends; use full‑character folding.
🗂️ Exam Traps
“Pick the first prime number larger than the table size for division hashing.” – The prime should be close to the table size, not necessarily larger; a larger prime can waste space.
“Open addressing never needs a secondary hash.” – Double hashing (a secondary hash) is essential to avoid primary clustering in linear probing.
“Universal hashing eliminates all collisions.” – It only reduces expected collision probability; collisions still occur.
“Cryptographic hash functions are the best choice for hash tables.” – They are slower; non‑cryptographic functions with good uniformity are preferred.
“If the average‑case lookup is $O(1)$, worst‑case can be ignored.” – In adversarial settings or poorly chosen hash functions, worst‑case $O(n)$ can appear on exams; know the distinction.
---
Study tip: Memorize the core formulas (division, multiplicative, polynomial, rolling update) and when each collision‑resolution strategy shines. Practice converting a sample key set into hash indices using at least two different methods to cement the intuition. Good luck!
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or