Subjects/Technology/Data and AI/Data Science/Big data

Big data Study Guide

Study Guide

📖 Core Concepts Big Data – data sets so large or complex that traditional tools (single‑machine DBMS, desktop stats packages) cannot process them efficiently. The 4 Vs Volume – massive amount of records (often terabytes +). Variety – mix of structured, semi‑structured, and unstructured formats (text, images, sensor streams, logs). Velocity – rapid generation and ingestion speed. Veracity – reliability/quality of the data; low veracity raises false‑discovery risk. Low Information Density – each individual record carries little insight; you need huge sample sizes to detect patterns. Data Lake – a centralized repository that stores raw data in its native format, postponing schema design until analysis time. Advanced Analytics Goal – move from descriptive (what happened) to predictive/prescriptive (what will happen & what to do). --- 📌 Must Remember MapReduce = Map (parallel split of a task) → Shuffle/Sort → Reduce (aggregate results). Hadoop = open‑source implementation of MapReduce on a distributed file system (HDFS). Apache Spark = in‑memory processing; 10‑100× faster for iterative/ML workloads than Hadoop’s disk‑based MapReduce. Big Data vs. Business Intelligence BI: high‑information‑density data, descriptive stats. Big Data: low‑information‑density, inductive statistics, causal inference. False Discovery Rate (FDR) rises with many attributes/variables → need correction (e.g., Benjamini‑Hochberg). Sampling Bias occurs when you focus only on Volume/Variety/Velocity and ignore Veracity. Privacy Risk – large‑scale personal data collection can breach anonymity even after de‑identification. --- 🔄 Key Processes MapReduce Job Split input → Map function on each node → emit key/value pairs. Shuffle: group identical keys across nodes. Reduce aggregates each key’s values → final output. Data Lake Ingestion Capture raw data → store in distributed file system (e.g., S3, HDFS) → catalog metadata → downstream processing (Spark, SQL, ML). Real‑Time Analytics Pipeline Ingest (Kafka, IoT stream) → Buffer (burst buffer, direct‑attached memory) → Process (Spark Structured Streaming) → Decision (dashboard, automated actuation). Sampling for Massive Streams Define sampling rate (e.g., 1 % of sensor readings). Randomly select points → compute estimators → monitor error bounds. --- 🔍 Key Comparisons Big Data vs. Business Intelligence Data density: low vs. high. Goal: causal/predictive vs. descriptive/trend‑monitoring. Hadoop vs. Spark Storage: disk‑based vs. RAM‑resident. Performance: batch‑oriented vs. iterative/real‑time. Data Lake vs. Data Warehouse Schema: schema‑on‑read vs. schema‑on‑write. Flexibility: high (any format) vs. low (structured only). MapReduce vs. Spark Execution: two‑stage (map → reduce) on disk vs. DAG of in‑memory operations. --- ⚠️ Common Misunderstandings “More data = better results.” → Low veracity can worsen outcomes (higher FDR). “Big Data = only large volume.” → Variety, velocity, and veracity are equally critical. “Data lakes are just cheaper warehouses.” → Lakes store raw, uncurated data; warehouses store curated, structured data. “MapReduce is always the fastest parallel method.” → Spark outperforms for iterative algorithms. “Anonymized data is risk‑free.” – Re‑identification attacks can still succeed. --- 🧠 Mental Models / Intuition 4‑V Cube – imagine a cube where each axis is a V; the “sweet spot” for true insight sits where all four intersect. Data Lake as a River – water (raw data) flows in unfiltered; you dip a bucket (analysis) when needed, unlike a dam (warehouse) that stores only pre‑filtered water. MapReduce = Kitchen Brigade – chefs (maps) each prepare a dish (partial result); the head chef (reduce) combines them into the final menu. --- 🚩 Exceptions & Edge Cases High‑Info‑Density Data – small, clean datasets may be better served by traditional BI tools. Streaming‑Only Scenarios – ultra‑low latency (e.g., fraud detection) may bypass batch MapReduce entirely. Spark Overhead – for a single simple map operation on tiny data, Hadoop’s disk‑based approach can be faster. Veracity‑Critical Domains (medical, finance) – even modest volumes require strict quality checks; volume alone is insufficient. --- 📍 When to Use Which | Situation | Recommended Tool / Approach | |-----------|------------------------------| | Batch ETL on petabytes of static logs | Hadoop MapReduce on HDFS | | Iterative machine‑learning or graph algorithms | Apache Spark (in‑memory) | | Unknown schema, many source types | Data Lake + cataloging (e.g., AWS Glue) | | Multidimensional reporting (sales by region, time, product) | OLAP cubes or tensor‑based DB | | Real‑time alerting from IoT streams | Kafka → Spark Structured Streaming | | Simple descriptive dashboards on clean, structured data | Traditional BI warehouse (SQL, Tableau) | | Need to control false positives across thousands of tests | Apply multiple‑comparisons correction (FDR control) | --- 👀 Patterns to Recognize “Huge N, low signal” → Look for low information density → require large sample & robust statistical controls. “Rapid data influx + low latency requirement” → Real‑time pipeline (stream → in‑memory). Repeated mention of “sampling bias” → Check whether selection criteria ignore veracity. Multiple hypothesis testing → Spot potential spurious correlations; expect a need for FDR correction. Privacy‑related language (de‑identification, consent) → Flag ethical review steps. --- 🗂️ Exam Traps Distractor: “Big Data guarantees accurate predictions.” – Wrong; quality (veracity) and proper modeling matter. Distractor: “A data lake is a synonym for a data warehouse.” – Incorrect; lakes store raw data, warehouses store curated data. Distractor: “MapReduce is always faster than Spark.” – False; Spark’s in‑memory engine beats disk‑based MapReduce for most iterative tasks. Distractor: “Only volume matters for big‑data projects.” – Misleading; neglects variety, velocity, and especially veracity. Distractor: “Anonymized data eliminates privacy risk.” – Wrong; re‑identification techniques can still expose individuals. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or