Big data Study Guide
Study Guide
📖 Core Concepts
Big Data – data sets so large or complex that traditional tools (single‑machine DBMS, desktop stats packages) cannot process them efficiently.
The 4 Vs
Volume – massive amount of records (often terabytes +).
Variety – mix of structured, semi‑structured, and unstructured formats (text, images, sensor streams, logs).
Velocity – rapid generation and ingestion speed.
Veracity – reliability/quality of the data; low veracity raises false‑discovery risk.
Low Information Density – each individual record carries little insight; you need huge sample sizes to detect patterns.
Data Lake – a centralized repository that stores raw data in its native format, postponing schema design until analysis time.
Advanced Analytics Goal – move from descriptive (what happened) to predictive/prescriptive (what will happen & what to do).
---
📌 Must Remember
MapReduce = Map (parallel split of a task) → Shuffle/Sort → Reduce (aggregate results).
Hadoop = open‑source implementation of MapReduce on a distributed file system (HDFS).
Apache Spark = in‑memory processing; 10‑100× faster for iterative/ML workloads than Hadoop’s disk‑based MapReduce.
Big Data vs. Business Intelligence
BI: high‑information‑density data, descriptive stats.
Big Data: low‑information‑density, inductive statistics, causal inference.
False Discovery Rate (FDR) rises with many attributes/variables → need correction (e.g., Benjamini‑Hochberg).
Sampling Bias occurs when you focus only on Volume/Variety/Velocity and ignore Veracity.
Privacy Risk – large‑scale personal data collection can breach anonymity even after de‑identification.
---
🔄 Key Processes
MapReduce Job
Split input → Map function on each node → emit key/value pairs.
Shuffle: group identical keys across nodes.
Reduce aggregates each key’s values → final output.
Data Lake Ingestion
Capture raw data → store in distributed file system (e.g., S3, HDFS) → catalog metadata → downstream processing (Spark, SQL, ML).
Real‑Time Analytics Pipeline
Ingest (Kafka, IoT stream) → Buffer (burst buffer, direct‑attached memory) → Process (Spark Structured Streaming) → Decision (dashboard, automated actuation).
Sampling for Massive Streams
Define sampling rate (e.g., 1 % of sensor readings).
Randomly select points → compute estimators → monitor error bounds.
---
🔍 Key Comparisons
Big Data vs. Business Intelligence
Data density: low vs. high.
Goal: causal/predictive vs. descriptive/trend‑monitoring.
Hadoop vs. Spark
Storage: disk‑based vs. RAM‑resident.
Performance: batch‑oriented vs. iterative/real‑time.
Data Lake vs. Data Warehouse
Schema: schema‑on‑read vs. schema‑on‑write.
Flexibility: high (any format) vs. low (structured only).
MapReduce vs. Spark
Execution: two‑stage (map → reduce) on disk vs. DAG of in‑memory operations.
---
⚠️ Common Misunderstandings
“More data = better results.” → Low veracity can worsen outcomes (higher FDR).
“Big Data = only large volume.” → Variety, velocity, and veracity are equally critical.
“Data lakes are just cheaper warehouses.” → Lakes store raw, uncurated data; warehouses store curated, structured data.
“MapReduce is always the fastest parallel method.” → Spark outperforms for iterative algorithms.
“Anonymized data is risk‑free.” – Re‑identification attacks can still succeed.
---
🧠 Mental Models / Intuition
4‑V Cube – imagine a cube where each axis is a V; the “sweet spot” for true insight sits where all four intersect.
Data Lake as a River – water (raw data) flows in unfiltered; you dip a bucket (analysis) when needed, unlike a dam (warehouse) that stores only pre‑filtered water.
MapReduce = Kitchen Brigade – chefs (maps) each prepare a dish (partial result); the head chef (reduce) combines them into the final menu.
---
🚩 Exceptions & Edge Cases
High‑Info‑Density Data – small, clean datasets may be better served by traditional BI tools.
Streaming‑Only Scenarios – ultra‑low latency (e.g., fraud detection) may bypass batch MapReduce entirely.
Spark Overhead – for a single simple map operation on tiny data, Hadoop’s disk‑based approach can be faster.
Veracity‑Critical Domains (medical, finance) – even modest volumes require strict quality checks; volume alone is insufficient.
---
📍 When to Use Which
| Situation | Recommended Tool / Approach |
|-----------|------------------------------|
| Batch ETL on petabytes of static logs | Hadoop MapReduce on HDFS |
| Iterative machine‑learning or graph algorithms | Apache Spark (in‑memory) |
| Unknown schema, many source types | Data Lake + cataloging (e.g., AWS Glue) |
| Multidimensional reporting (sales by region, time, product) | OLAP cubes or tensor‑based DB |
| Real‑time alerting from IoT streams | Kafka → Spark Structured Streaming |
| Simple descriptive dashboards on clean, structured data | Traditional BI warehouse (SQL, Tableau) |
| Need to control false positives across thousands of tests | Apply multiple‑comparisons correction (FDR control) |
---
👀 Patterns to Recognize
“Huge N, low signal” → Look for low information density → require large sample & robust statistical controls.
“Rapid data influx + low latency requirement” → Real‑time pipeline (stream → in‑memory).
Repeated mention of “sampling bias” → Check whether selection criteria ignore veracity.
Multiple hypothesis testing → Spot potential spurious correlations; expect a need for FDR correction.
Privacy‑related language (de‑identification, consent) → Flag ethical review steps.
---
🗂️ Exam Traps
Distractor: “Big Data guarantees accurate predictions.” – Wrong; quality (veracity) and proper modeling matter.
Distractor: “A data lake is a synonym for a data warehouse.” – Incorrect; lakes store raw data, warehouses store curated data.
Distractor: “MapReduce is always faster than Spark.” – False; Spark’s in‑memory engine beats disk‑based MapReduce for most iterative tasks.
Distractor: “Only volume matters for big‑data projects.” – Misleading; neglects variety, velocity, and especially veracity.
Distractor: “Anonymized data eliminates privacy risk.” – Wrong; re‑identification techniques can still expose individuals.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or