Hash table Study Guide
Study Guide
📖 Core Concepts
Hash table – data structure that implements an associative array (map) by converting a key into an array index with a hash function.
Hash function – deterministic function h(k) = hash(k) mod m that maps any key k to a bucket index 0 … m‑1. Good hash functions give a uniform distribution to keep collisions rare.
Load factor (α) – ratio of stored entries n to bucket count m: α = n / m. Controls space‑time trade‑off.
Collision – two distinct keys produce the same bucket index. Must be resolved by a collision‑resolution strategy.
Separate chaining – each bucket stores a list (or vector) of all keys that hash there.
Open addressing – all keys live in the bucket array; a probe sequence searches for the next free slot.
Amortized O(1) – average cost of lookup, insertion, deletion when α stays within the recommended range.
Worst‑case O(n) – can happen with many collisions or a poor resolution method.
---
📌 Must Remember
Average‑case cost: O(1) for lookup/insert/delete (assuming good hash & α in range).
Worst‑case cost: O(n) if many keys collide or table is full.
Division method: h(k) = k mod m (choose m prime).
Multiplication method: h(k) = ⌊ (k·A mod 1)·m ⌋ (A ≈ (√5−1)/2).
Load‑factor guidelines
Chaining: 1 ≤ α ≤ 3 (performance degrades gracefully).
Open addressing: 0.6 ≤ α ≤ 0.75 (keeps probe length short).
Resize trigger: when α exceeds the upper threshold → allocate new array (usually 2·m) and rehash every entry.
Open‑addressing probe formulas
Linear: index = (h(k) + i) mod m.
Quadratic: index = (h(k) + c₁·i + c₂·i²) mod m.
Double hashing: index = (h₁(k) + i·h₂(k)) mod m.
Cuckoo hashing gives worst‑case O(1) lookup by using two tables and two hash functions; insertion may trigger a cascade of evictions.
Robin Hood hashing swaps a new entry with an existing one that has a smaller probe distance, equalizing probe lengths.
---
🔄 Key Processes
Insertion (separate chaining)
Compute bucket b = h(k).
Append (k, v) to the list at bucket b.
If list length > constant, consider rehashing (increase m).
Insertion (open addressing – linear probing)
Compute h = h(k).
For i = 0,1,2,…
idx = (h + i) mod m.
If slot idx empty or marked tombstone, place (k, v) and stop.
Lookup (any method)
Compute b = h(k).
Follow the collision‑resolution rule (list traversal or probing) until key found or an empty slot/list end is reached.
Deletion (open addressing)
Locate key using lookup procedure.
Mark slot as tombstone (not empty) to preserve probe chains.
Optional: periodic clean‑up re‑inserts tombstones to avoid long probes.
Full Rehash (resize)
Allocate new bucket array size m' (often 2·m).
For each existing entry (k, v) in old table: compute new index h'(k) = hash(k) mod m' and insert into new array using the chosen resolution method.
---
🔍 Key Comparisons
Separate chaining vs. Open addressing
Space: chaining needs extra pointers/containers; open addressing stores only keys/values.
Cache: open addressing has superior locality (sequential memory).
α limit: chaining works for any α; open addressing requires α < 1.
Linear probing vs. Quadratic probing vs. Double hashing
Clustering: linear → primary clustering; quadratic → reduces primary clustering; double hashing → minimizes both primary & secondary clustering.
Complexity: linear is simplest; quadratic needs careful constants c₁,c₂; double hashing needs a second hash that is relatively prime to m.
Cuckoo hashing vs. Robin Hood hashing
Lookup guarantee: cuckoo = worst‑case O(1); Robin Hood = O(1) average, but worst‑case can be longer.
Insertion cost: cuckoo may trigger multiple evictions; Robin Hood swaps once per probe.
---
⚠️ Common Misunderstandings
“Hash tables are always O(1).”
True only average‑case with low α and a good hash function; worst‑case can be O(n).
“Higher α always saves memory without penalty.”
In open addressing, α > 0.75 dramatically increases probe length; in chaining, α > 3 makes lists long and defeats O(1) expectation.
“Any hash function works if it’s fast.”
Speed matters, but uniformity is critical; a fast but clustered hash leads to many collisions.
“Deletion in open addressing leaves a hole that can be ignored.”
Must use tombstones; otherwise later lookups may stop prematurely.
---
🧠 Mental Models / Intuition
Buckets as “rooms” – keys are guests; the hash function hands each guest a room number. Collisions mean multiple guests share a room; chaining puts a list of guests on a coat‑rack, while open addressing asks guests to walk down the hallway until they find an empty room.
Load factor α as “crowd density” – low α = spacious hallway (short walks), high α = crowded hallway (long walks, more chance of bumping into others).
Rehashing = “renovating the building” – when the hallway gets too crowded, you build a bigger building and move every guest to a new room based on the same address rules.
---
🚩 Exceptions & Edge Cases
Prime vs. power‑of‑2 table sizes – Division method works best with a prime m to avoid patterns; multiplication method works well with powers of two.
Quadratic probing may not probe all slots unless m is prime and constants satisfy certain conditions.
Double hashing step size must be non‑zero; often h₂(k) = 1 + (hash₂(k) mod (m‑1)).
Cuckoo hashing can enter an infinite loop if a cycle of evictions occurs; implementation typically aborts after a fixed number of moves and triggers a rehash.
---
📍 When to Use Which
Separate chaining – preferred when memory overhead of pointers is acceptable, when deletions are frequent, or when the table may become very full (α > 1).
Open addressing (linear probing) – best for cache‑friendly workloads with moderate load (α ≤ 0.7) and when memory is tight.
Quadratic or double hashing – choose if primary clustering of linear probing hurts performance (e.g., many similar keys).
Cuckoo hashing – use when worst‑case lookup time must be guaranteed O(1) (e.g., real‑time systems).
Robin Hood hashing – useful when you want to minimize variance of probe lengths, leading to more predictable latency.
---
👀 Patterns to Recognize
“Many keys map to the same bucket” → suspect poor hash function or non‑prime m.
Long probe sequences in open addressing → α likely > 0.75 or clustering caused by linear probing.
Buckets with long linked lists → either α is high for chaining or hash distribution is non‑uniform.
Frequent rehashes → either threshold too low or load factor not being monitored.
---
🗂️ Exam Traps
Choosing the wrong m for division method – picking an even m (e.g., 1000) leads to many collisions for keys with common factors; exam answers often include “prime m”.
Confusing average‑case O(1+α) for chaining with O(1) – remember the extra α term; if α=3, expected search ≈ 4 steps.
Assuming linear probing never clusters – primary clustering is a classic distractor; expect a question about its effect on performance.
For double hashing, forgetting to make the second hash relatively prime to m – leads to probe cycles; answer choices may list a step size that can be zero.
Believing deletions automatically shrink the table – resizing down occurs only when α falls below a lower threshold, not after each delete.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or