Subjects/Technology/Data and AI/Data Science/Data analysis

Data analysis Study Guide

Study Guide

📖 Core Concepts Data Analysis – Inspecting, cleaning, transforming, and modeling data to extract useful information and support decisions. Descriptive vs. Exploratory vs. Confirmatory – Descriptive: Summarize data (mean, median, SD). Exploratory: Find patterns without pre‑set hypotheses. Confirmatory: Test a specific hypothesis; must be separate from exploratory work. Predictive vs. Text Analytics – Predictive builds models for forecasting/classification; Text analytics extracts meaning from unstructured text. Data Product – An application that ingests data, runs a model/algorithm, and outputs actionable results. MECE Principle – Components of an analysis should be Mutually Exclusive and Collectively Exhaustive (e.g., profit = revenue − cost). 📌 Must Remember Key formulas Simple linear regression: $Y = aX + b + \varepsilon$ (where $a$ = slope, $b$ = intercept, $\varepsilon$ = error). Hypothesis‑Testing Errors Type I: Reject a true null. Type II: Fail to reject a false null. Validation Checklist – Check raw data anomalies, re‑calculate formulas, verify totals = sum of subtotals, confirm predictable ratios, normalize to a common base. Iterative Process – Later phases (e.g., modeling) can trigger revisiting earlier phases (e.g., data cleaning). 🔄 Key Processes Define Requirements – Identify experimental unit and needed variables. Collect Data – Sensors, interviews, downloads, custodians, etc. Process & Integrate – Structure into rows/columns; load into spreadsheet or software. Clean Data – Remove duplicates, handle missing values, detect outliers, perform record matching. Exploratory Data Analysis (EDA) – Compute descriptive stats, create visualizations (histograms, scatterplots, bar charts). Modeling – Choose appropriate algorithm (correlation, regression, classification). Validate – Cross‑validation, sensitivity/bootstrapping, check stability. Communicate – Use the right chart type for the quantitative message. 🔍 Key Comparisons Data Mining vs. Business Intelligence – Data mining = predictive modeling; BI = aggregation/reporting for decision support. Exploratory vs. Confirmatory – EDA discovers patterns; confirmatory tests a pre‑specified hypothesis (must use a different dataset or hold‑out). Bar Chart vs. Pie Chart – Bar = ranking/part‑to‑whole with clear comparison; Pie = part‑to‑whole only when few categories and relative sizes matter. Cross‑validation vs. Sensitivity Analysis – CV evaluates predictive performance on held‑out data; sensitivity (e.g., bootstrapping) studies how results change when assumptions/parameters vary. ⚠️ Common Misunderstandings “Exploratory results can be confirmed on the same data.” – Inflates Type I error; separate data or hold‑out needed. Confusing “fact” with “opinion.” – Decisions must be based on verified data, not analyst beliefs. Assuming normality without checks. – Always test skewness/kurtosis; transform (log, sqrt, inverse) if needed. Believing a single visualization tells the whole story. – Match chart type to the message (time‑series, ranking, correlation, etc.). 🧠 Mental Models / Intuition “Data as a story” – Treat the dataset like a narrative: clean (edit), explore (outline), model (plot), conclude (final chapter). MECE as a puzzle – Every piece must fit exactly once (no overlap) and together complete the picture (no gaps). Visualization as a translator – Choose the visual “language” that the audience already reads best (e.g., line for trend, scatter for relationship). 🚩 Exceptions & Edge Cases Cross‑validation not appropriate when data have strong internal correlations (e.g., panel data) – it can give overly optimistic performance. Pie charts become misleading with >5 categories or when slice sizes are similar. Outlier treatment – If outliers represent a true rare event, do not automatically delete; consider robust methods instead. 📍 When to Use Which Ranking message → Bar chart (ordered bars). Part‑to‑Whole → Stacked bar or pie (only few categories). Trend over time → Line chart (continuous X‑axis). Distribution shape → Histogram (numeric) or bar chart (categorical). Relationship between two variables → Scatter plot (quantitative) or bubble chart (adds third dimension). Geographic comparison → Cartogram or choropleth map. Predictive modeling → Regression or classification after confirming assumptions; validate with cross‑validation unless panel structure exists. 👀 Patterns to Recognize “U‑shaped” histogram → Possible bimodal distribution → may need separate subgroup analysis. Systematic missingness (e.g., all missing for a specific period) → Likely data‑collection issue, not random. Linear trend + seasonal spikes in time‑series → Model with trend + seasonal components. High correlation between predictors → Multicollinearity → consider dimensionality reduction or drop one predictor. 🗂️ Exam Traps Choosing a confirmatory test after EDA on the same data – will be marked wrong for inflating error risk. Selecting a pie chart for many categories – distractor; bar chart is the correct answer. Assuming normality without transformation – exam may present skewed data; correct answer is to log‑transform before parametric tests. Confusing Type I and Type II errors – often swapped in answer choices; remember Type I = false positive. Using cross‑validation on panel data – will be flagged; the right approach is to respect the data hierarchy (e.g., grouped CV). --- All bullets are drawn directly from the provided outline; no external information has been added.

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or