Subjects/Science/Computer and Information Science/Computer Science/Search engine

Search engine Study Guide

Study Guide

📖 Core Concepts Search engine – software that returns hyperlinks (and summaries) to web content that matches a user’s query. Web crawler (spider) – automated program that visits pages, reads HTML/metadata, follows links, and respects robots.txt. Index – large database mapping keywords (and other tokens) to document identifiers and their locations for fast retrieval. Query processing – parsing a user’s request, matching keywords against the index, ranking results, and generating snippets. Ranking – ordering results by relevance, authority, popularity, or paid placement using proprietary algorithms. Personalization / filter bubble – tailoring results using a user’s location, history, and click behavior, which can limit exposure to diverse information. --- 📌 Must Remember Crawl policy: stop after a set number of pages, data volume, or time limit to cope with the infinite web. Inverted index: stores term → list of documents for rapid lookup (vs. static keyword hierarchies). Ranking signals (high‑yield): keyword location (title > heading > body), backlink count/anchor text, freshness, and anti‑spam measures. Advanced query operators: AND, OR, NOT, proximity search, date‑range filtering. Submission vs. crawling: Webmasters can submit URLs or a sitemap to accelerate discovery, but major engines will eventually crawl a well‑structured site. Link‑building risk: Unnatural link patterns can trigger penalties (e.g., Google’s “spam” algorithm). Distributed architecture: Thousands of servers, redundancy, and fault tolerance keep query latency in milliseconds. --- 🔄 Key Processes Crawling Fetch page → read HTML, CSS, JS, metadata → extract text & outbound links → respect robots.txt. Indexing Tokenize text → build inverted index (term → document IDs) → store field info (title, heading, meta). Query Processing Parse query → apply Boolean/proximity operators → retrieve posting lists → intersect/union → rank. Ranking Score each candidate using factors (keyword location, backlinks, freshness, user signals) → sort descending. Result Presentation Generate snippets showing query terms in context → display ranked list with ads/paid listings. --- 🔍 Key Comparisons Crawler‑Based vs. Human‑Powered Crawler‑Based: Automated crawling, large coverage, updates via recrawling. Human‑Powered: Manual URL submission, better for niche topics, limited scalability. Inverted Index vs. Predefined Keywords Inverted Index: Dynamic, built from full‑text analysis, supports any term. Predefined Keywords: Fixed hierarchy, human‑curated, limited flexibility. Organic Results vs. Paid Listings Organic: Ranked by relevance/authority, no direct payment. Paid: Appear higher due to advertising contracts; labeled as ads. --- ⚠️ Common Misunderstandings “More keywords = better rank.” Over‑stuffing triggers penalties; relevance and placement matter more. “Submitting a sitemap guarantees instant indexing.” Crawlers still schedule fetches; sitemap only speeds discovery. “All search engines give identical results.” Different indexes & ranking algorithms produce varied result sets. “Personalization only improves relevance.” It can create filter bubbles that hide opposing viewpoints. --- 🧠 Mental Models / Intuition Library Analogy: The crawler is the librarian who brings books (pages) to the back‑room; the index is the card catalog; the query is a patron’s request; the ranking is the librarian’s judgment of which books are most useful now. Web as a Graph: Pages are nodes, hyperlinks are edges. Authority flows like “vote power” through the graph (think PageRank). --- 🚩 Exceptions & Edge Cases Stale Index: A deleted page may still appear in results until the next refresh cycle. Robots.txt Disallow: Legitimate pages can be omitted from the index if blocked, even if highly relevant. Local Search Bias: Queries with location intent may prioritize nearby businesses regardless of overall authority. Legal/Political Censorship: Engines may suppress results to comply with local regulations, skewing relevance. --- 📍 When to Use Which Choosing a search engine type for a project: Crawler‑based → need broad, up‑to‑date web coverage. Human‑powered → niche, curated directories (e.g., specialized industry listings). Hybrid → combine automated breadth with human quality control. When to rely on advanced operators: Use AND/OR/NOT for precise Boolean logic. Use proximity (e.g., "climate NEAR/5 change") when phrase order is flexible but closeness matters. When to submit a sitemap: After a major site redesign or for newly launched sites to accelerate initial crawling. --- 👀 Patterns to Recognize Keyword location weighting: Title → heading → meta → body → alt‑text. Link graph signals: High‑quality inbound links with relevant anchor text → boost authority. Freshness cue: News queries → newer timestamps rank higher. Spam patterns: Repeated exact‑match keyword blocks, hidden text, link farms. --- 🗂️ Exam Traps “More backlinks always outrank.” Ignoring link quality and relevance leads to wrong answer. “Robots.txt blocks are never indexed.” Engines may still cache pages from other sources; the statement is too absolute. “Sitemaps eliminate the need for crawling.” Crawling still occurs; sitemaps only guide it. “All personalization is user‑controlled.” Many personalization factors are opaque and automatic, not always user‑chosen. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or