Hypothesis test Study Guide
Study Guide
📖 Core Concepts
Statistical hypothesis test – a procedure that uses sample data to decide whether there is enough evidence to reject a stated null hypothesis (\(H0\)).
Null hypothesis (\(H0\)) – the default claim (often “no effect” or a specific parameter value).
Alternative hypothesis (\(H1\)) – the claim researchers hope to support (any effect different from \(H0\)).
Test statistic (\(T\)) – a single number calculated from the data; its sampling distribution under \(H0\) is known (e.g., Student’s t, normal).
p‑value – probability, assuming \(H0\) true, of obtaining a test‑statistic at least as extreme as the observed value.
Significance level (\(\alpha\)) – pre‑chosen maximum tolerable Type I error rate (commonly 0.05).
Decision rule – reject \(H0\) if the observed statistic falls in the critical (rejection) region (or equivalently if p ≤ α).
Error types:
Type I error: falsely rejecting a true \(H0\) (probability = \(\alpha\)).
Type II error: failing to reject a false \(H0\) (probability = \(\beta\)).
Power – \(1-\beta\); the chance of correctly rejecting a false \(H0\).
Fisher vs. Neyman–Pearson – Fisher emphasized p‑values as evidence, no explicit alternative; Neyman–Pearson added \(H1\), fixed \(\alpha\) and \(\beta\), and a decision‑theoretic framework.
NHST – the hybrid practice used today (Fisher’s p‑value + Neyman–Pearson error rates).
📌 Must Remember
p‑value ≠ probability that \(H0\) is true.
Rejecting \(H0\) does not prove \(H1\) is true; it only shows data are unlikely under \(H0\).
Not rejecting \(H0\) does not prove \(H0\) true; it indicates insufficient evidence.
\(\Pr(p \le \alpha \mid H0) \le \alpha\).
One‑sided test is appropriate only when theory predicts the direction of an effect.
Multiple unadjusted tests inflate the overall Type I error rate (family‑wise error).
As n → ∞, even trivial differences can become statistically significant (paradox of large samples).
🔄 Key Processes
Formulate hypotheses – write \(H0\) (often “no difference”) and \(H1\) (the effect of interest).
Choose test & statistic – e.g., two‑sample t test → statistic \(t\).
Derive null distribution – know the sampling distribution of \(T\) under \(H0\).
Set \(\alpha\) – decide acceptable Type I error (0.05, 0.01, etc.).
Compute observed statistic – calculate \(t{\text{obs}}\) from data.
Find p‑value or critical value –
p‑value: \(p = P(T \ge t{\text{obs}} \mid H0)\).
Critical value: from \(\alpha\) and null distribution.
Apply decision rule – reject \(H0\) if \(p \le \alpha\) or \(t{\text{obs}}\) lies in rejection region.
Interpret – state result in terms of evidence, not truth of hypotheses.
🔍 Key Comparisons
Fisher significance testing vs. Neyman–Pearson decision theory
Fisher: no explicit \(H1\), no Type II error, p‑value is a continuous measure of evidence.
Neyman–Pearson: includes \(H1\), defines \(\alpha\) and \(\beta\), uses likelihood‑ratio test for optimality.
One‑sided vs. Two‑sided tests
One‑sided: critical region only on one tail; used when direction is pre‑specified.
Two‑sided: critical regions on both tails; default when direction is unknown.
Exact test vs. Approximate (asymptotic) test
Exact: computes true null distribution (e.g., Fisher’s exact test).
Approximate: relies on large‑sample theory (e.g., normal approximation).
⚠️ Common Misunderstandings
“A p‑value of .03 means there is a 3 % chance the null hypothesis is true.” – false; p‑value is conditional on \(H0\) being true.
“If the result is not significant, the effect does not exist.” – false; could be low power.
“Statistical significance equals practical importance.” – false; effect size and confidence intervals are needed.
“The test’s α automatically controls the overall error when many tests are run.” – false; need Bonferroni, Holm, FDR, etc.
🧠 Mental Models / Intuition
Evidence as a weight: Think of the p‑value as the “weight” of evidence against \(H0\); the smaller the weight, the stronger the push to reject.
Error trade‑off: Raising \(\alpha\) makes it easier to reject (more false positives), but reduces \(\beta\) (fewer false negatives). Visualize a seesaw between Type I and Type II errors.
Critical region as a “danger zone”: If the test statistic lands in the danger zone, we conclude the data are too unlikely under \(H0\) to stay there.
🚩 Exceptions & Edge Cases
Composite null hypotheses – the null does not specify all parameters; the test’s size is the worst‑case Type I error over all null parameter values.
Conservative tests – actual \(\alpha\) is smaller than nominal; reduces false positives but also power.
Uniformly most powerful (UMP) tests – exist only for certain families of distributions (e.g., exponential family with monotone likelihood ratio).
Bootstrap hypothesis testing – useful when parametric null distribution is unknown; relies on resampling under the null.
📍 When to Use Which
Student’s t vs. normal – use t when population variance is unknown and sample size is small; normal when variance known or n large.
Exact test vs. asymptotic – choose exact for small samples or discrete data (e.g., Fisher’s exact test).
One‑sided test – only when theory a priori predicts direction; otherwise default to two‑sided.
Bootstrap test – when assumptions (normality, equal variances) are questionable or sample size is moderate.
Likelihood‑ratio test (Neyman–Pearson) – optimal for simple vs. simple hypothesis comparison; extend to composite via generalized LR.
👀 Patterns to Recognize
p‑value tiny + large sample → may indicate a statistically significant but practically trivial effect.
Non‑significant result + small n → suspect low power; consider a power analysis.
Multiple related outcomes tested together → look for a pattern of inflated Type I error; expect a correction method in the question.
Reported “trend” (p ≈ 0.07) – often a hint that the test is under‑powered or that the author is stretching significance.
🗂️ Exam Traps
Distractor: “p = 0.04 means 4 % chance H₀ is true.” – wrong interpretation of p‑value.
Distractor: “Failing to reject H₀ proves there is no effect.” – confuses lack of evidence with evidence of no effect.
Distractor: “A one‑tailed test is always more powerful than a two‑tailed test.” – only true when the direction is truly known before looking at the data.
Distractor: “If α = 0.05, the probability of a Type I error is always 5 % regardless of the test.” – only holds when the test’s size equals the nominal α (conservative tests may be lower).
Distractor: “Multiple testing does not affect the α level if each test uses p < 0.05.” – ignores family‑wise error inflation.
---
Keep this guide handy; the bullet format makes it easy to scan quickly before the exam.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or