Web search engine Study Guide
Study Guide
📖 Core Concepts
Search Engine – Software that returns hyperlinks (and snippets) to web content matching a user’s query.
Crawler (Spider/Ant) – Automated agent that visits pages, respects robots.txt, and collects titles, headings, HTML, CSS, and JavaScript.
Index – Giant database linking tokens (words, phrases) to the URLs where they appear; enables fast lookup without hitting the live web.
Query Processing – Parse → normalize → look up tokens in the index → score → rank → display results.
Ranking – Ordering of results by a relevance score that mixes keyword relevance, link‑based authority, freshness, user‑engagement, and machine‑learning signals.
Inverted Index – Data structure mapping each term to the list of documents containing it (vs. pre‑defined hierarchical keyword catalogs).
Personalization / Filter Bubble – Results may be tweaked by location, past clicks, and search history, potentially limiting exposure to diverse information.
---
📌 Must Remember
Crawl Policy – Limits crawling by page count, data volume, or time to keep the process tractable.
Indexing Rule – Search engines retrieve from the index, not the live web, at query time.
Ranking Signals (high‑yield):
Keyword Frequency & Placement – Title, <h1>, meta tags > body text.
Link‑Based Authority – Inbound links = votes; quality matters more than quantity.
Freshness – Newer pages get a temporary boost for time‑sensitive queries.
User Engagement – Click‑through rate (CTR) & dwell time influence score.
Geography – User and server location affect ordering for local queries.
Spam Penalties – Keyword stuffing, hidden text, cloaking, link farms, duplicate content → rank drop.
Advanced Query Operators – AND, OR, NOT, proximity (e.g., "climate NEAR change"), date‑range filters.
Submission – Sitemaps accelerate discovery; manual one‑page submission is optional because crawlers eventually find well‑structured sites.
---
🔄 Key Processes
Crawling
Start with seed URLs → follow outbound links → obey robots.txt → respect crawl policy → store fetched content.
Indexing
Extract tokens → map each token to document IDs (inverted index) → store metadata (title, headings, URL).
Query Processing Flow
User types query → parser normalizes (lower‑case, stop‑word removal) → lookup tokens in index → retrieve candidate docs → compute relevance score → rank → present snippets.
Ranking Algorithm (simplified)
$$\text{Score} = w1\cdot\text{KeywordWeight} + w2\cdot\text{LinkAuthority} + w3\cdot\text{Freshness} + w4\cdot\text{Engagement} + \dots$$
Spam Detection Loop
Scan new pages for stuffing, hidden text, cloaking → flag → apply penalty or de‑index.
---
🔍 Key Comparisons
Crawler‑Based vs. Human‑Powered vs. Hybrid
Crawler‑Based: Automated spiders, continuous refresh.
Human‑Powered: Manual submissions, limited scale.
Hybrid: Combines automated crawling with curated human input.
Inverted Index vs. Predefined Keywords
Inverted Index: Built from full‑text analysis; adapts automatically.
Predefined Keywords: Fixed taxonomy programmed by humans; slower to evolve.
Organic Result vs. Paid Listing
Organic: Ranked by relevance signals.
Paid: Advertisers bid; may appear above or alongside organic results.
---
⚠️ Common Misunderstandings
“Search hits the live web” – Wrong; the engine queries its index which may be out‑of‑date.
“More keyword repetitions = higher rank” – Over‑use triggers keyword‑stuffing penalties.
“All links are equal votes” – Links from reputable, high‑authority sites carry far more weight.
“Submitting a URL guarantees instant indexing” – Submission speeds discovery but the page still goes through crawling and indexing pipelines.
---
🧠 Mental Models / Intuition
“Library Catalog” Analogy – The index is a library’s card catalog; you look up a term, get a list of “books” (URLs), then decide which to read.
“Vote + Weight” Model – Each inbound link is a vote; the voter’s own authority acts as the weight of that vote.
“Signal Fusion” – Think of ranking as mixing several ingredients (keywords, links, freshness, behavior) into a smoothie; the more balanced the mix, the tastier (higher) the result.
---
🚩 Exceptions & Edge Cases
Dead Links – Index may still contain URLs of pages that have disappeared until the next crawl refreshes them.
Local Search Boost – A page far from the user’s location may be outranked by a closer, less‑authoritative page for location‑sensitive queries.
Temporal Boost – Freshness helps for news queries but can be a disadvantage for evergreen content if over‑emphasized.
Link‑Scheme Penalties – Even a few low‑quality links from a link farm can trigger a site‑wide rank drop.
---
📍 When to Use Which
Boolean Operators – Use AND to narrow, OR to broaden, NOT to exclude irrelevant terms.
Proximity Search – When the exact phrase order is not required but the terms must be close (e.g., “climate NEAR change”).
Date‑Range Filter – For time‑sensitive queries (news, product releases).
Site‑Specific Search (site:example.com) – When you know the domain that should contain the answer.
Image/Video Filters – When the query explicitly requests visual media.
---
👀 Patterns to Recognize
High‑Authority Signals – Presence of the query term in <title> or <h1> + many reputable inbound links → likely top result.
Spam Patterns – Excessive keyword repetition, hidden text (same color as background), many low‑quality inbound links from the same domain cluster.
Freshness Spike – Sudden surge of new pages for a breaking news term → expect a temporary ranking shift.
Geographic Cue – Local business listings often appear with a map snippet when the query includes a location.
---
🗂️ Exam Traps
Distractor: “The more pages a crawler visits, the better the ranking.” – Crawling volume alone does not affect rank; relevance signals do.
Distractor: “All search engines use the same ranking algorithm.” – Algorithms vary by engine and evolve over time.
Distractor: “Submitting a sitemap guarantees a higher rank.” – Sitemaps only help discovery; ranking still depends on quality signals.
Distractor: “Keyword frequency is the only factor.” – Ignoring link authority, freshness, and user‑engagement leads to an incomplete answer.
Distractor: “Personalization always improves relevance.” – Can create filter bubbles; exam may ask about the downside.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or