Subjects/Science/Computer and Information Science/Computer Science/Web search engine

Web search engine Study Guide

Study Guide

📖 Core Concepts Search Engine – Software that returns hyperlinks (and snippets) to web content matching a user’s query. Crawler (Spider/Ant) – Automated agent that visits pages, respects robots.txt, and collects titles, headings, HTML, CSS, and JavaScript. Index – Giant database linking tokens (words, phrases) to the URLs where they appear; enables fast lookup without hitting the live web. Query Processing – Parse → normalize → look up tokens in the index → score → rank → display results. Ranking – Ordering of results by a relevance score that mixes keyword relevance, link‑based authority, freshness, user‑engagement, and machine‑learning signals. Inverted Index – Data structure mapping each term to the list of documents containing it (vs. pre‑defined hierarchical keyword catalogs). Personalization / Filter Bubble – Results may be tweaked by location, past clicks, and search history, potentially limiting exposure to diverse information. --- 📌 Must Remember Crawl Policy – Limits crawling by page count, data volume, or time to keep the process tractable. Indexing Rule – Search engines retrieve from the index, not the live web, at query time. Ranking Signals (high‑yield): Keyword Frequency & Placement – Title, <h1>, meta tags > body text. Link‑Based Authority – Inbound links = votes; quality matters more than quantity. Freshness – Newer pages get a temporary boost for time‑sensitive queries. User Engagement – Click‑through rate (CTR) & dwell time influence score. Geography – User and server location affect ordering for local queries. Spam Penalties – Keyword stuffing, hidden text, cloaking, link farms, duplicate content → rank drop. Advanced Query Operators – AND, OR, NOT, proximity (e.g., "climate NEAR change"), date‑range filters. Submission – Sitemaps accelerate discovery; manual one‑page submission is optional because crawlers eventually find well‑structured sites. --- 🔄 Key Processes Crawling Start with seed URLs → follow outbound links → obey robots.txt → respect crawl policy → store fetched content. Indexing Extract tokens → map each token to document IDs (inverted index) → store metadata (title, headings, URL). Query Processing Flow User types query → parser normalizes (lower‑case, stop‑word removal) → lookup tokens in index → retrieve candidate docs → compute relevance score → rank → present snippets. Ranking Algorithm (simplified) $$\text{Score} = w1\cdot\text{KeywordWeight} + w2\cdot\text{LinkAuthority} + w3\cdot\text{Freshness} + w4\cdot\text{Engagement} + \dots$$ Spam Detection Loop Scan new pages for stuffing, hidden text, cloaking → flag → apply penalty or de‑index. --- 🔍 Key Comparisons Crawler‑Based vs. Human‑Powered vs. Hybrid Crawler‑Based: Automated spiders, continuous refresh. Human‑Powered: Manual submissions, limited scale. Hybrid: Combines automated crawling with curated human input. Inverted Index vs. Predefined Keywords Inverted Index: Built from full‑text analysis; adapts automatically. Predefined Keywords: Fixed taxonomy programmed by humans; slower to evolve. Organic Result vs. Paid Listing Organic: Ranked by relevance signals. Paid: Advertisers bid; may appear above or alongside organic results. --- ⚠️ Common Misunderstandings “Search hits the live web” – Wrong; the engine queries its index which may be out‑of‑date. “More keyword repetitions = higher rank” – Over‑use triggers keyword‑stuffing penalties. “All links are equal votes” – Links from reputable, high‑authority sites carry far more weight. “Submitting a URL guarantees instant indexing” – Submission speeds discovery but the page still goes through crawling and indexing pipelines. --- 🧠 Mental Models / Intuition “Library Catalog” Analogy – The index is a library’s card catalog; you look up a term, get a list of “books” (URLs), then decide which to read. “Vote + Weight” Model – Each inbound link is a vote; the voter’s own authority acts as the weight of that vote. “Signal Fusion” – Think of ranking as mixing several ingredients (keywords, links, freshness, behavior) into a smoothie; the more balanced the mix, the tastier (higher) the result. --- 🚩 Exceptions & Edge Cases Dead Links – Index may still contain URLs of pages that have disappeared until the next crawl refreshes them. Local Search Boost – A page far from the user’s location may be outranked by a closer, less‑authoritative page for location‑sensitive queries. Temporal Boost – Freshness helps for news queries but can be a disadvantage for evergreen content if over‑emphasized. Link‑Scheme Penalties – Even a few low‑quality links from a link farm can trigger a site‑wide rank drop. --- 📍 When to Use Which Boolean Operators – Use AND to narrow, OR to broaden, NOT to exclude irrelevant terms. Proximity Search – When the exact phrase order is not required but the terms must be close (e.g., “climate NEAR change”). Date‑Range Filter – For time‑sensitive queries (news, product releases). Site‑Specific Search (site:example.com) – When you know the domain that should contain the answer. Image/Video Filters – When the query explicitly requests visual media. --- 👀 Patterns to Recognize High‑Authority Signals – Presence of the query term in <title> or <h1> + many reputable inbound links → likely top result. Spam Patterns – Excessive keyword repetition, hidden text (same color as background), many low‑quality inbound links from the same domain cluster. Freshness Spike – Sudden surge of new pages for a breaking news term → expect a temporary ranking shift. Geographic Cue – Local business listings often appear with a map snippet when the query includes a location. --- 🗂️ Exam Traps Distractor: “The more pages a crawler visits, the better the ranking.” – Crawling volume alone does not affect rank; relevance signals do. Distractor: “All search engines use the same ranking algorithm.” – Algorithms vary by engine and evolve over time. Distractor: “Submitting a sitemap guarantees a higher rank.” – Sitemaps only help discovery; ranking still depends on quality signals. Distractor: “Keyword frequency is the only factor.” – Ignoring link authority, freshness, and user‑engagement leads to an incomplete answer. Distractor: “Personalization always improves relevance.” – Can create filter bubbles; exam may ask about the downside. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or