Biostatistics Study Guide
Study Guide
📖 Core Concepts
Biostatistics – application of statistical methods to biology, clinical medicine, and public health (design, collection, analysis, interpretation).
Population vs. Sample – population = all units of interest; sample = randomly chosen subset used to infer population traits.
Hypotheses – H₀: no association/effect; H₁: there is an association/effect.
Type I error (α) – false‑positive (rejecting a true H₀).
Type II error (β) & Power – false‑negative (failing to reject a false H₀); power = $1-\beta$.
p‑value – probability of observing data as extreme as ours if H₀ is true; compare to α.
Confidence Interval (CI) – range that likely contains the true population parameter at a chosen confidence level.
Correlation (Pearson r/ρ) – linear association strength; –1 to +1.
Model selection (AIC, BIC) – trade‑off between goodness‑of‑fit and complexity; lower = better.
Multiple‑testing corrections – Bonferroni (familywise error) vs. FDR (expected false discoveries).
---
📌 Must Remember
Mean: $\displaystyle \text{Mean} = \frac{\sum xi}{n}$
Standard Error of the Mean (SEM): $ \displaystyle \text{SEM} = \frac{s}{\sqrt{n}}$ (where s = sample SD).
Bonferroni α: $\displaystyle \alpha{\text{Bon}} = \frac{\alpha^{}}{m}$ ( m = number of tests).
Power: $1-\beta$; increase by larger n, larger effect size, higher α.
AIC: $ \displaystyle \text{AIC}=2k - 2\ln(L)$ (k = # parameters, L = likelihood).
BIC: $ \displaystyle \text{BIC}=k\ln(n) - 2\ln(L)$.
FDR control – e.g., Benjamini‑Hochberg procedure ranks p‑values and sets a cutoff based on $\frac{i}{m}\alpha$.
Randomized Controlled Trial (RCT) – gold‑standard for causal inference; randomization eliminates confounding.
---
🔄 Key Processes
Formulating a Research Question → concise, novel, scientifically valuable.
Defining Hypotheses → write H₀ (no effect) and H₁ (effect).
Sampling
Define target population.
Choose random sampling method.
Determine sample size (consider scope, resources, trial type).
Experimental Design Selection
Simple: completely randomized, randomized block, factorial.
Complex: lattice, split‑plot, augmented block, Latin‑square/row‑column.
Descriptive Analysis
Build frequency tables → absolute & relative frequencies.
Create appropriate graphs (line, bar, histogram, scatter, box plot).
Compute mean, median, mode, quartiles.
Inferential Steps
Estimate SE, construct CI, calculate p‑value.
Decide significance (p < α).
If multiple tests → apply Bonferroni or FDR.
Model Building & Selection
Fit candidate models.
Compute AIC/BIC → pick lowest.
Check assumptions → run robustness checks.
Validation for High‑Dimensional Data
Reduce dimensionality (PCA).
Split data → training & independent test set.
Compute residual sum of squares and $R^{2}$ on test set.
---
🔍 Key Comparisons
H₀ vs. H₁ – H₀ = “no association”; H₁ = “association exists”.
Type I vs. Type II Error – α = false positive; β = false negative.
Bonferroni vs. FDR – Bonferroni = strict familywise control (more conservative); FDR = allows some false positives for greater power.
AIC vs. BIC – AIC penalizes complexity less (better for predictive focus); BIC penalizes more heavily (better for true model selection).
Supervised vs. Unsupervised Learning – Supervised uses labeled outcomes (e.g., classification, regression); unsupervised finds structure without labels (e.g., k‑means clustering).
---
⚠️ Common Misunderstandings
p‑value = probability H₀ is true – false; it’s the probability of the data given H₀.
Statistical significance = practical importance – a tiny p‑value can correspond to a trivial effect size.
Correlation implies causation – correlation only measures linear association, not directionality.
Higher R² always means a better model – can be inflated by over‑fitting; check AIC/BIC and validation performance.
Bonferroni is always the safest correction – overly conservative when many tests, leading to many false negatives.
---
🧠 Mental Models / Intuition
Sampling as a “microscope” – a random sample lets you see the whole population’s features without looking at every individual.
Confidence interval as a “net” – the net’s width reflects uncertainty; a narrow net (small SE) catches the true parameter more precisely.
AIC/BIC as “price tags” – you pay for fit (lower residuals) but also for extra parameters; the cheapest (lowest score) balances both.
Multicollinearity = “crowded hallway” – when predictors are tightly packed (highly correlated), it’s hard to see each one’s individual effect.
---
🚩 Exceptions & Edge Cases
Small sample sizes → SE estimates unreliable; use exact tests (e.g., Fisher’s exact) or resampling (bootstrapping).
Non‑normal data → Pearson correlation may be misleading; consider Spearman rank correlation.
Zero‑inflated or highly skewed outcomes → use generalized linear models with appropriate link functions.
High‑dimensional data (p ≫ n) → traditional regression fails; employ regularization (LASSO) or dimensionality reduction (PCA).
---
📍 When to Use Which
Choose R vs. Python – R for extensive statistical packages & graphics; Python for integration with machine‑learning pipelines & image analysis.
Bonferroni vs. FDR – Bonferroni when the cost of any false positive is high (e.g., clinical safety); FDR for exploratory ‘omics’ studies with many tests.
Parametric vs. Non‑parametric – parametric (t‑test, Pearson) when assumptions (normality, equal variance) hold; non‑parametric (Mann‑Whitney, Spearman) otherwise.
Randomized Block vs. Completely Randomized – block design when known sources of variation (e.g., batch, location) exist; completely randomized when no such structure.
PCA vs. Variable Selection – PCA when you need to reduce many correlated predictors while preserving variance; variable selection (stepwise, LASSO) when you need interpretable individual predictors.
---
👀 Patterns to Recognize
“Large n, tiny p‑value” → check effect size; may be statistically but not clinically important.
“High correlation + non‑linear scatter plot” → Pearson may underestimate association; consider transformation or non‑linear models.
“Many tests, many borderline p‑values” → likely need FDR control rather than Bonferroni.
“Model with lower AIC but higher BIC” – indicates modest improvement in fit that may not justify added complexity.
---
🗂️ Exam Traps
Choosing “significant” because p < 0.05 without looking at confidence interval – CI may include values of no practical relevance.
Assuming randomization eliminates all bias – still possible selection bias, measurement error, or protocol deviations.
Treating “mean = median” as evidence of normality – can also occur in symmetric but non‑normal distributions; verify with plots or normality tests.
Using Bonferroni correction for a small number of tests – over‑conservative, reduces power unnecessarily.
Interpreting a high R² from a model fitted on the training set as proof of predictive ability – ignore validation; look at test‑set $R^{2}$ or cross‑validated metrics.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or