Search engine Study Guide
Study Guide
📖 Core Concepts
Search engine – software that returns hyperlinks (and summaries) to web content that matches a user’s query.
Web crawler (spider) – automated program that visits pages, reads HTML/metadata, follows links, and respects robots.txt.
Index – large database mapping keywords (and other tokens) to document identifiers and their locations for fast retrieval.
Query processing – parsing a user’s request, matching keywords against the index, ranking results, and generating snippets.
Ranking – ordering results by relevance, authority, popularity, or paid placement using proprietary algorithms.
Personalization / filter bubble – tailoring results using a user’s location, history, and click behavior, which can limit exposure to diverse information.
---
📌 Must Remember
Crawl policy: stop after a set number of pages, data volume, or time limit to cope with the infinite web.
Inverted index: stores term → list of documents for rapid lookup (vs. static keyword hierarchies).
Ranking signals (high‑yield): keyword location (title > heading > body), backlink count/anchor text, freshness, and anti‑spam measures.
Advanced query operators: AND, OR, NOT, proximity search, date‑range filtering.
Submission vs. crawling: Webmasters can submit URLs or a sitemap to accelerate discovery, but major engines will eventually crawl a well‑structured site.
Link‑building risk: Unnatural link patterns can trigger penalties (e.g., Google’s “spam” algorithm).
Distributed architecture: Thousands of servers, redundancy, and fault tolerance keep query latency in milliseconds.
---
🔄 Key Processes
Crawling
Fetch page → read HTML, CSS, JS, metadata → extract text & outbound links → respect robots.txt.
Indexing
Tokenize text → build inverted index (term → document IDs) → store field info (title, heading, meta).
Query Processing
Parse query → apply Boolean/proximity operators → retrieve posting lists → intersect/union → rank.
Ranking
Score each candidate using factors (keyword location, backlinks, freshness, user signals) → sort descending.
Result Presentation
Generate snippets showing query terms in context → display ranked list with ads/paid listings.
---
🔍 Key Comparisons
Crawler‑Based vs. Human‑Powered
Crawler‑Based: Automated crawling, large coverage, updates via recrawling.
Human‑Powered: Manual URL submission, better for niche topics, limited scalability.
Inverted Index vs. Predefined Keywords
Inverted Index: Dynamic, built from full‑text analysis, supports any term.
Predefined Keywords: Fixed hierarchy, human‑curated, limited flexibility.
Organic Results vs. Paid Listings
Organic: Ranked by relevance/authority, no direct payment.
Paid: Appear higher due to advertising contracts; labeled as ads.
---
⚠️ Common Misunderstandings
“More keywords = better rank.” Over‑stuffing triggers penalties; relevance and placement matter more.
“Submitting a sitemap guarantees instant indexing.” Crawlers still schedule fetches; sitemap only speeds discovery.
“All search engines give identical results.” Different indexes & ranking algorithms produce varied result sets.
“Personalization only improves relevance.” It can create filter bubbles that hide opposing viewpoints.
---
🧠 Mental Models / Intuition
Library Analogy: The crawler is the librarian who brings books (pages) to the back‑room; the index is the card catalog; the query is a patron’s request; the ranking is the librarian’s judgment of which books are most useful now.
Web as a Graph: Pages are nodes, hyperlinks are edges. Authority flows like “vote power” through the graph (think PageRank).
---
🚩 Exceptions & Edge Cases
Stale Index: A deleted page may still appear in results until the next refresh cycle.
Robots.txt Disallow: Legitimate pages can be omitted from the index if blocked, even if highly relevant.
Local Search Bias: Queries with location intent may prioritize nearby businesses regardless of overall authority.
Legal/Political Censorship: Engines may suppress results to comply with local regulations, skewing relevance.
---
📍 When to Use Which
Choosing a search engine type for a project:
Crawler‑based → need broad, up‑to‑date web coverage.
Human‑powered → niche, curated directories (e.g., specialized industry listings).
Hybrid → combine automated breadth with human quality control.
When to rely on advanced operators:
Use AND/OR/NOT for precise Boolean logic.
Use proximity (e.g., "climate NEAR/5 change") when phrase order is flexible but closeness matters.
When to submit a sitemap:
After a major site redesign or for newly launched sites to accelerate initial crawling.
---
👀 Patterns to Recognize
Keyword location weighting: Title → heading → meta → body → alt‑text.
Link graph signals: High‑quality inbound links with relevant anchor text → boost authority.
Freshness cue: News queries → newer timestamps rank higher.
Spam patterns: Repeated exact‑match keyword blocks, hidden text, link farms.
---
🗂️ Exam Traps
“More backlinks always outrank.” Ignoring link quality and relevance leads to wrong answer.
“Robots.txt blocks are never indexed.” Engines may still cache pages from other sources; the statement is too absolute.
“Sitemaps eliminate the need for crawling.” Crawling still occurs; sitemaps only guide it.
“All personalization is user‑controlled.” Many personalization factors are opaque and automatic, not always user‑chosen.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or