Data science Study Guide
Study Guide
📖 Core Concepts
Data Science – interdisciplinary field that blends statistics, computing, algorithms, and domain knowledge to extract insight from noisy, structured or unstructured data.
Interdisciplinary Components – mathematics, statistics, computer science, information science, plus domain expertise (e.g., medicine, natural sciences).
Scope vs. Statistics – Statistics focuses on quantitative description; data science handles quantitative + qualitative data (images, text, sensor streams) and emphasizes prediction & action.
Data Scientist – a professional who writes code, applies statistical methods, and translates results into actionable decisions.
Data Science Process – a repeatable workflow: Prepare → Formulate → Analyze/Model → Develop Solution → Present & Support Decision.
Ethical Pillars – privacy, bias/fairness, and societal impact must be evaluated throughout a project.
---
📌 Must Remember
Data science = statistics + computing + domain knowledge.
Quantitative vs. Qualitative: statistics = mainly numbers; data science = numbers + text/images/etc.
Core technical skills: Python/R, statistics, data visualization, domain expertise.
Key workflow steps are sequential but iterative; you may loop back to preparation after modeling.
Ethical checklist: Privacy → Bias → Transparency → Impact.
Data science ≠ just “big data”; size is not the defining factor (per David Donoho).
---
🔄 Key Processes
Data Preparation
Clean (handle missing/outliers) → Integrate (merge sources) → Transform (scale, encode).
Problem Formulation
Define objective, success metric, constraints, and required domain knowledge.
Analysis & Modeling
Choose statistical analysis or machine‑learning technique → Feature engineering → Train / validate model.
Solution Development
Build deliverable (recommendation system, classifier, optimization routine).
Presentation & Decision Support
Create visualizations/dashboards → Explain model logic → Provide actionable recommendations.
---
🔍 Key Comparisons
Data Analysis vs. Data Science
Dataset size: small, structured → Data Analysis; large, complex → Data Science.
Data type: structured only → Data Analysis; structured + unstructured → Data Science.
Techniques: descriptive stats & hypothesis testing → Data Analysis; predictive modeling, ML, feature engineering → Data Science.
Scope: answer specific questions → Data Analysis; end‑to‑end solution (pre‑process → deploy) → Data Science.
Statistics vs. Data Science (by focus)
Emphasis: description & inference → Statistics; prediction & action → Data Science.
Tools: classic tests, small‑sample methods → Statistics; ML libraries, big‑data platforms → Data Science.
---
⚠️ Common Misunderstandings
“Big data = data science.” Size alone doesn’t define the field; the presence of predictive modeling and domain‑driven decision making does.
“Data scientists only code.” They must also design experiments, interpret results, and communicate insights.
“Statistics is obsolete.” Data science builds on statistics; you still need hypothesis testing, confidence intervals, etc.
“Ethics is optional.” Ignoring privacy or bias can invalidate a model and cause legal/social fallout.
---
🧠 Mental Models / Intuition
“Data Science is a Funnel.” Raw data → (clean & transform) → Feature set → (model) → Insight/Action.
“Prediction ≠ Causation.” A high‑accuracy model gives what will happen, not why it happens.
“Domain knowledge is the compass.” It guides problem framing, feature selection, and interpretation.
---
🚩 Exceptions & Edge Cases
Tiny datasets: traditional statistical inference may be more reliable than complex ML models.
Highly regulated domains (health, finance): privacy and fairness constraints may limit model choice or require explainability.
Unstructured data only: may need specialized pipelines (NLP for text, CNNs for images) before typical “data‑science” steps.
---
📍 When to Use Which
Descriptive stats / hypothesis testing → when the goal is to explain relationships in a relatively small, structured dataset.
Machine learning (supervised) → when you need to predict outcomes for new observations and have labeled data.
Unsupervised learning (clustering, dimensionality reduction) → when you want to discover hidden structure in unlabeled data.
Big‑data frameworks (Spark, Hadoop) → when data volume/velocity exceeds single‑machine memory or CPU limits.
Visualization vs. Sonification → visual for pattern spotting; sonification when auditory cues aid insight (e.g., time‑series anomalies).
---
👀 Patterns to Recognize
“Data → many missing values → imputation needed.”
“High‑dimensional data + few samples → risk of overfitting → regularization or dimensionality reduction required.”
“Model performance plateau after feature engineering → consider more data or different algorithm.”
“Discrepancy between training and validation metrics → data leakage or distribution shift.”
---
🗂️ Exam Traps
Choosing “big data” as the defining trait – exam will expect answer about techniques & purpose, not just size.
Confusing “data analysis” with “data science” – watch for wording that emphasizes end‑to‑end solution vs. single question.
Assuming all data scientists must be “machine‑learning experts.” – many roles focus on visualization, domain translation, or statistical inference.
Overlooking ethical considerations – questions about model deployment often ask about bias mitigation or privacy safeguards.
Mixing up “prediction” and “causation.” – a model that predicts well is not proof of causal relationship.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or