Subjects/Science/Computer and Information Science/Computer Science/Hash function

Hash function Study Guide

Study Guide

📖 Core Concepts Hash Function – deterministic algorithm that maps data of any size to a fixed‑size hash value (digest). Hash Table – array indexed by hash values; enables near‑constant‑time lookup, insertion, deletion. Collision – two distinct keys produce the same hash value; must be resolved. Uniformity – ideal hash distributes keys evenly across the table, giving each slot probability $1/m$ (where m = #slots). Determinism – same input always yields the same output; essential for retrieval. Universal Hashing – pick a hash function randomly from a family; any two keys collide with probability $1/m$. Cryptographic vs. Non‑cryptographic – cryptographic hashes emphasize security (pre‑image, collision resistance); non‑cryptographic prioritize speed and low collision rates for data structures. 📌 Must Remember Division hashing: $h(K) = K \bmod M$, choose M prime ≈ table size. Multiplicative hashing: $h(K)=\left\lfloor\frac{(a·K \bmod 2^{w})}{2^{w-m}}\right\rfloor$, a odd, w word size, m output bits. Chained hashing – each bucket holds a linked list of colliding keys. Open addressing – probe sequence (linear, quadratic, double hashing) finds the next empty slot. Universal hash collision probability: $1/m$. Rolling hash update (Rabin–Karp): $$h{new}= (h{old} - x{old}\cdot a^{k-1})·a + x{new} \pmod{p}$$ Character folding (prime‑multiplier): hash = hash P + char, where P is a sizable prime. Polynomial string hash: $H = \sum{i=0}^{k-1} xi a^{i} \pmod{p}$, a prime > alphabet size. 🔄 Key Processes Insert into a hash table Compute hash code → index. If bucket empty → store. If occupied → apply collision‑resolution (chain or probing). Open addressing probing (linear example) $i = 0$; while slot $(h(K)+i) \bmod m$ occupied, set $i \leftarrow i+1$. Insert when an empty slot is found. Universal hashing selection Choose random parameters (e.g., a, b) from the family $h{a,b}(K) = ((aK+b) \bmod p) \bmod m$. String hashing with folding Initialise hash = 0. For each character c: hash = (hash P + ord(c)) (modulo if needed). Rolling hash computation Compute initial hash for first k characters. For each slide: subtract contribution of outgoing char, multiply by a, add incoming char, reduce modulo p. 🔍 Key Comparisons Chained vs. Open Addressing Chaining: unlimited bucket size, easy deletions, extra memory for pointers. Open addressing: all data stored in table, cache‑friendly, requires careful probing to avoid clustering. Division vs. Multiplicative Hashing Division: simple mod M; quality depends heavily on choice of prime M. Multiplicative: spreads bits more uniformly, less sensitive to poor M choice; uses bit shifts instead of division. Simple character folding vs. Polynomial (radix) hashing Folding: fast, but may retain ASCII bit‑clustering. Polynomial: treats each position with a different weight, better for non‑uniform character distributions. ⚠️ Common Misunderstandings “Hash functions must be cryptographically strong.” – For hash tables you only need speed and uniformity; cryptographic strength adds unnecessary overhead. “Using only first/last characters is sufficient.” – Ignores most of the key; leads to many collisions unless ignored parts are constant. “Collisions mean failure.” – Collisions are inevitable; proper resolution guarantees correctness. “Universal hashing guarantees no collisions.” – It only reduces expected collision probability to $1/m$; collisions can still occur. 🧠 Mental Models / Intuition “Hash as a random dice roll.” – Imagine each key throws a fair m-sided die; uniform hashing means each side (slot) is equally likely. “Buckets as mailboxes.” – The hash value is the mailbox number; collisions mean two letters in the same box, requiring a list (chaining) or moving to the next box (probing). “Rolling hash as a conveyor belt.” – When the window moves one step, you drop the oldest item and add a new one, updating the total in constant time. 🚩 Exceptions & Edge Cases Dynamic resizing: When table grows, recompute hash codes or use minimal movement schemes to avoid moving most entries. Bad key distribution: If keys are already clustered (e.g., many share low bits), simple truncation (least‑significant bits) yields many collisions. All‑zero or constant fields: Can be safely ignored in custom hash functions; otherwise they waste computation. Adversarial inputs: Uniform hashing assumption may be broken; universal hashing mitigates but does not eliminate risk. 📍 When to Use Which Use chaining when: expected load factor > 0.7, deletions frequent, memory for pointers is acceptable. Use open addressing when: memory must stay compact, cache performance is critical, deletions are rare (or use lazy deletion). Division hashing for quick prototypes or when table size is a prime close to a power of two. Multiplicative hashing for high‑performance integer keys on modern CPUs (avoids costly division). Polynomial/radix hashing for strings when character distribution is highly non‑uniform. Rolling hash (Rabin–Karp) when you need to compare many substrings (e.g., plagiarism detection). 👀 Patterns to Recognize Clustered low‑order bits → suspect need for multiplicative or folding with prime multiplier. Repeated collisions in a single bucket → likely poor hash function or non‑uniform key set; consider universal hashing or a different method. Long probe sequences → table too full or poor probing strategy; resize or switch to chaining. Identical prefixes/suffixes in many keys → avoid hash functions that only look at ends; use full‑character folding. 🗂️ Exam Traps “Pick the first prime number larger than the table size for division hashing.” – The prime should be close to the table size, not necessarily larger; a larger prime can waste space. “Open addressing never needs a secondary hash.” – Double hashing (a secondary hash) is essential to avoid primary clustering in linear probing. “Universal hashing eliminates all collisions.” – It only reduces expected collision probability; collisions still occur. “Cryptographic hash functions are the best choice for hash tables.” – They are slower; non‑cryptographic functions with good uniformity are preferred. “If the average‑case lookup is $O(1)$, worst‑case can be ignored.” – In adversarial settings or poorly chosen hash functions, worst‑case $O(n)$ can appear on exams; know the distinction. --- Study tip: Memorize the core formulas (division, multiplicative, polynomial, rolling update) and when each collision‑resolution strategy shines. Practice converting a sample key set into hash indices using at least two different methods to cement the intuition. Good luck!

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or