Subjects/Science/Computer and Information Science/Computer Science/Hash table

Hash table Study Guide

Study Guide

📖 Core Concepts Hash table – data structure that implements an associative array (map) by converting a key into an array index with a hash function. Hash function – deterministic function h(k) = hash(k) mod m that maps any key k to a bucket index 0 … m‑1. Good hash functions give a uniform distribution to keep collisions rare. Load factor (α) – ratio of stored entries n to bucket count m: α = n / m. Controls space‑time trade‑off. Collision – two distinct keys produce the same bucket index. Must be resolved by a collision‑resolution strategy. Separate chaining – each bucket stores a list (or vector) of all keys that hash there. Open addressing – all keys live in the bucket array; a probe sequence searches for the next free slot. Amortized O(1) – average cost of lookup, insertion, deletion when α stays within the recommended range. Worst‑case O(n) – can happen with many collisions or a poor resolution method. --- 📌 Must Remember Average‑case cost: O(1) for lookup/insert/delete (assuming good hash & α in range). Worst‑case cost: O(n) if many keys collide or table is full. Division method: h(k) = k mod m (choose m prime). Multiplication method: h(k) = ⌊ (k·A mod 1)·m ⌋ (A ≈ (√5−1)/2). Load‑factor guidelines Chaining: 1 ≤ α ≤ 3 (performance degrades gracefully). Open addressing: 0.6 ≤ α ≤ 0.75 (keeps probe length short). Resize trigger: when α exceeds the upper threshold → allocate new array (usually 2·m) and rehash every entry. Open‑addressing probe formulas Linear: index = (h(k) + i) mod m. Quadratic: index = (h(k) + c₁·i + c₂·i²) mod m. Double hashing: index = (h₁(k) + i·h₂(k)) mod m. Cuckoo hashing gives worst‑case O(1) lookup by using two tables and two hash functions; insertion may trigger a cascade of evictions. Robin Hood hashing swaps a new entry with an existing one that has a smaller probe distance, equalizing probe lengths. --- 🔄 Key Processes Insertion (separate chaining) Compute bucket b = h(k). Append (k, v) to the list at bucket b. If list length > constant, consider rehashing (increase m). Insertion (open addressing – linear probing) Compute h = h(k). For i = 0,1,2,… idx = (h + i) mod m. If slot idx empty or marked tombstone, place (k, v) and stop. Lookup (any method) Compute b = h(k). Follow the collision‑resolution rule (list traversal or probing) until key found or an empty slot/list end is reached. Deletion (open addressing) Locate key using lookup procedure. Mark slot as tombstone (not empty) to preserve probe chains. Optional: periodic clean‑up re‑inserts tombstones to avoid long probes. Full Rehash (resize) Allocate new bucket array size m' (often 2·m). For each existing entry (k, v) in old table: compute new index h'(k) = hash(k) mod m' and insert into new array using the chosen resolution method. --- 🔍 Key Comparisons Separate chaining vs. Open addressing Space: chaining needs extra pointers/containers; open addressing stores only keys/values. Cache: open addressing has superior locality (sequential memory). α limit: chaining works for any α; open addressing requires α < 1. Linear probing vs. Quadratic probing vs. Double hashing Clustering: linear → primary clustering; quadratic → reduces primary clustering; double hashing → minimizes both primary & secondary clustering. Complexity: linear is simplest; quadratic needs careful constants c₁,c₂; double hashing needs a second hash that is relatively prime to m. Cuckoo hashing vs. Robin Hood hashing Lookup guarantee: cuckoo = worst‑case O(1); Robin Hood = O(1) average, but worst‑case can be longer. Insertion cost: cuckoo may trigger multiple evictions; Robin Hood swaps once per probe. --- ⚠️ Common Misunderstandings “Hash tables are always O(1).” True only average‑case with low α and a good hash function; worst‑case can be O(n). “Higher α always saves memory without penalty.” In open addressing, α > 0.75 dramatically increases probe length; in chaining, α > 3 makes lists long and defeats O(1) expectation. “Any hash function works if it’s fast.” Speed matters, but uniformity is critical; a fast but clustered hash leads to many collisions. “Deletion in open addressing leaves a hole that can be ignored.” Must use tombstones; otherwise later lookups may stop prematurely. --- 🧠 Mental Models / Intuition Buckets as “rooms” – keys are guests; the hash function hands each guest a room number. Collisions mean multiple guests share a room; chaining puts a list of guests on a coat‑rack, while open addressing asks guests to walk down the hallway until they find an empty room. Load factor α as “crowd density” – low α = spacious hallway (short walks), high α = crowded hallway (long walks, more chance of bumping into others). Rehashing = “renovating the building” – when the hallway gets too crowded, you build a bigger building and move every guest to a new room based on the same address rules. --- 🚩 Exceptions & Edge Cases Prime vs. power‑of‑2 table sizes – Division method works best with a prime m to avoid patterns; multiplication method works well with powers of two. Quadratic probing may not probe all slots unless m is prime and constants satisfy certain conditions. Double hashing step size must be non‑zero; often h₂(k) = 1 + (hash₂(k) mod (m‑1)). Cuckoo hashing can enter an infinite loop if a cycle of evictions occurs; implementation typically aborts after a fixed number of moves and triggers a rehash. --- 📍 When to Use Which Separate chaining – preferred when memory overhead of pointers is acceptable, when deletions are frequent, or when the table may become very full (α > 1). Open addressing (linear probing) – best for cache‑friendly workloads with moderate load (α ≤ 0.7) and when memory is tight. Quadratic or double hashing – choose if primary clustering of linear probing hurts performance (e.g., many similar keys). Cuckoo hashing – use when worst‑case lookup time must be guaranteed O(1) (e.g., real‑time systems). Robin Hood hashing – useful when you want to minimize variance of probe lengths, leading to more predictable latency. --- 👀 Patterns to Recognize “Many keys map to the same bucket” → suspect poor hash function or non‑prime m. Long probe sequences in open addressing → α likely > 0.75 or clustering caused by linear probing. Buckets with long linked lists → either α is high for chaining or hash distribution is non‑uniform. Frequent rehashes → either threshold too low or load factor not being monitored. --- 🗂️ Exam Traps Choosing the wrong m for division method – picking an even m (e.g., 1000) leads to many collisions for keys with common factors; exam answers often include “prime m”. Confusing average‑case O(1+α) for chaining with O(1) – remember the extra α term; if α=3, expected search ≈ 4 steps. Assuming linear probing never clusters – primary clustering is a classic distractor; expect a question about its effect on performance. For double hashing, forgetting to make the second hash relatively prime to m – leads to probe cycles; answer choices may list a step size that can be zero. Believing deletions automatically shrink the table – resizing down occurs only when α falls below a lower threshold, not after each delete. ---

Hash table Study Guide

Study Guide

Or, immediately create your own study flashcards: