Hypothesis test - Historical Evolution and NHST Foundations
Understand the historical evolution of hypothesis testing, the contrasting Fisher and Neyman‑Pearson frameworks, and how they combine in modern NHST.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
How does a more precise experiment affect the test of a theory when the theory predicts the null hypothesis?
1 of 6
Summary
History of Hypothesis Testing
Introduction
The modern hypothesis testing framework we use today is not the product of a single genius moment, but rather a hybrid of competing philosophies that developed throughout the early 20th century. Understanding this history is essential because it explains why modern practice can seem contradictory—we're actually using a blend of two fundamentally different approaches. This context will help you understand not just what hypothesis testing is, but why it works the way it does.
The Choice of Null Hypothesis Matters
Before we dive into the history, it's important to understand a subtle but critical point: the choice of what we call the "null hypothesis" dramatically affects how an experiment tests our theory.
When a scientific theory predicts a specific outcome, we can make that outcome our null hypothesis. In this case, a more precise experiment (one with less random variation or a larger sample size) makes a more severe test of the theory. Why? Because the experiment is less likely to detect the predicted effect by accident—if we find it, it's strong evidence the theory is correct.
Conversely, when we default to the null hypothesis of "no difference" or "no effect" (which is common in modern practice), a more precise experiment makes a less severe test of our motivating theory. The precise experiment makes it easier to detect small effects, so finding an effect becomes less impressive evidence for the theory.
This distinction becomes important when we discuss the different philosophical approaches that shaped modern statistics.
The Pioneers: Gosset, Fisher, and Neyman–Pearson
William Sealy Gosset (Student's t-distribution)
The first major breakthrough came from an unexpected source: a chemist working for the Guinness brewery. William Sealy Gosset needed to make reliable decisions from small samples—he couldn't afford to run large experiments. In 1908, he developed what became known as Student's t-distribution, a probability distribution that accounts for the extra uncertainty when working with small samples.
The t-distribution is crucial because when sample sizes are small, the normal distribution doesn't accurately describe the variability of sample means. Gosset's work laid the practical foundation for modern hypothesis testing.
Ronald Fisher (The Foundation of Modern Significance Testing)
Ronald Fisher transformed statistics with a revolutionary idea: instead of asking "What is the probability the null hypothesis is true?", ask "If the null hypothesis were true, how surprising is our observed data?" This led to the development of the p-value.
Fisher's approach emphasized:
Experimental design as the foundation of valid inference
The null hypothesis as a specific claim to test (typically "no effect" or "no difference")
Significance testing: calculating the probability of observing data as extreme as ours, assuming the null hypothesis is true
Objective interpretation: a small p-value suggests the null hypothesis is unlikely
Critically, Fisher did not propose an alternative hypothesis. He viewed hypothesis testing as a way to detect whether an effect exists, not to evaluate competing theories.
Jerzy Neyman and Egon Pearson (Formalizing the Framework)
While Fisher was developing significance testing, Jerzy Neyman and Egon Pearson took a different approach. They created a more formal decision-making framework:
They introduced the alternative hypothesis (the claim we're considering if we reject the null)
They formalized Type I and Type II errors:
Type I error: Rejecting the null hypothesis when it's actually true (false positive)
Type II error: Failing to reject the null hypothesis when it's actually false (false negative)
They proposed a critical region: a predetermined range of test statistics where we reject the null hypothesis
They emphasized controlling both error rates before conducting the experiment
This approach is fundamentally different from Fisher's. Rather than asking "How surprising is our data?", Neyman and Pearson ask "What decision rule minimizes our risk of making a costly mistake?"
The Philosophical Clash: Fisher vs. Neyman–Pearson
The two approaches were in direct philosophical conflict. Here are the key differences:
Fisher's approach:
Focus on the p-value as a continuous measure of evidence
The null hypothesis is the theory being tested
Interpretation is flexible and depends on context
No predetermined alternative hypothesis
No explicit discussion of Type II errors (since there's no alternative to measure error against)
Neyman–Pearson approach:
Focus on controlling error rates through a fixed decision rule
The null and alternative hypotheses are equally important
Decision is binary: reject or fail to reject (no "maybe")
Both Type I and Type II error rates are controlled and specified in advance
The approach is more rigid but more prescriptive for decision-making
Fisher and Neyman even had a bitter public dispute about whose approach was correct. Fisher saw Neyman–Pearson as overly mechanical; Neyman and Pearson saw Fisher as imprecise.
The Compromise: Null Hypothesis Significance Testing (NHST)
Here's the practical reality: most researchers today use neither Fisher's nor Neyman–Pearson's approach in its pure form. Instead, we use a hybrid called Null Hypothesis Significance Testing (NHST).
What NHST Borrows from Each Approach
From Fisher:
The p-value as the primary tool for assessing evidence
The null hypothesis as the baseline claim
Informal interpretation of what counts as "significant"
From Neyman–Pearson:
Fixed error rates (typically $\alpha = 0.05$ for a Type I error)
A critical region (the set of test statistics that lead to rejection)
The concept of alternative hypotheses (though not always used formally)
How NHST Works
In NHST, we:
State a null hypothesis (usually "no effect") and often an alternative hypothesis
Choose a significance level $\alpha$ (usually 0.05), which controls the maximum acceptable Type I error rate
Calculate a test statistic from our data
Determine whether the test statistic falls in the critical region
If it does, we reject the null hypothesis and claim our result is "statistically significant"
The p-value tells us: "If the null hypothesis is true, what's the probability of observing a test statistic at least as extreme as the one we got?" If this probability is less than $\alpha$, we reject.
Why the Hybrid?
The hybrid approach became standard because it's practical: it combines Fisher's intuitive p-value interpretation with Neyman–Pearson's formal error control. However, this compromise creates philosophical inconsistencies that statisticians still debate today. You're essentially using two different frameworks simultaneously, which can lead to confusion about what a p-value actually means.
A Critical Philosophical Issue: Hypothesis Testing is Not Causal Inference
One of the most important philosophical critiques of hypothesis testing is this: finding a statistically significant relationship does not prove causation. This is often summarized as "correlation does not imply causation."
Hypothesis testing can detect whether two variables are related in your data. However, it cannot by itself determine whether one variable causes changes in another. There could be:
Reverse causality: The effect causes the supposed cause
Confounding variables: A third variable causes changes in both variables you're observing
Mere coincidence: The relationship is real but not causal
For example, suppose you find a statistically significant relationship between ice cream sales and drowning deaths. Hypothesis testing can confirm this relationship is real (not due to random chance), but it cannot tell you whether ice cream causes drowning, whether drowning causes people to eat ice cream, or whether both are caused by warm weather.
This distinction is critical: hypothesis testing answers the question "Is there a relationship?" while causal inference (which requires careful experimental design, logical reasoning, or specialized statistical methods) answers "Does one variable cause the other?"
Many research papers, media reports, and policy decisions unfortunately confuse these two questions. Understanding this distinction will help you interpret statistical results more critically.
Flashcards
How does a more precise experiment affect the test of a theory when the theory predicts the null hypothesis?
It makes a more severe test of the theory.
What common confusion about hypothesis testing do philosophical critics often point out?
Confusing hypothesis testing with causal inference.
Who formalized hypothesis testing by introducing explicit alternative hypotheses and error types?
Jerzy Neyman and Egon Pearson.
What specific components did Neyman and Pearson introduce to statistical testing?
Alternative hypothesis
Type I error probabilities
Type II error probabilities
Decision rules based on a critical region
Null Hypothesis Significance Testing (NHST) is a hybrid of which two historical approaches?
Fisher’s significance test and Neyman–Pearson’s decision framework.
How does NHST combine the philosophies of Fisher and Neyman-Pearson?
Retains Fisher’s informal interpretation of evidence
Adopts Neyman–Pearson’s fixed error rates
Quiz
Hypothesis test - Historical Evolution and NHST Foundations Quiz Question 1: According to historical perspectives, increasing experimental precision when a theory predicts the null hypothesis has what effect on the test of that theory?
- It makes the test more severe (correct)
- It makes the test less severe
- It has no effect on test severity
- It changes the null hypothesis
Hypothesis test - Historical Evolution and NHST Foundations Quiz Question 2: When the null hypothesis defaults to “no difference” or “no effect,” how does a more precise experiment affect the test of the motivating theory?
- It makes the test less severe (correct)
- It makes the test more severe
- It eliminates the need for a null hypothesis
- It increases Type II error rates
Hypothesis test - Historical Evolution and NHST Foundations Quiz Question 3: Which statistician introduced the null hypothesis, analysis of variance, and significance testing?
- Ronald Fisher (correct)
- William Sealy Gosset
- Jerzy Neyman
- Egon Pearson
Hypothesis test - Historical Evolution and NHST Foundations Quiz Question 4: Which approach to hypothesis testing did not define Type II errors?
- Fisher’s significance testing (correct)
- Neyman–Pearson decision framework
- Bayesian inference
- Chi‑squared testing
Hypothesis test - Historical Evolution and NHST Foundations Quiz Question 5: In NHST, what statistic is used to determine whether the observed test statistic falls within the critical region?
- p‑value (correct)
- t‑statistic
- confidence interval
- effect size
Hypothesis test - Historical Evolution and NHST Foundations Quiz Question 6: Critics often warn that hypothesis testing is sometimes confused with what type of inference?
- Causal inference (correct)
- Descriptive statistics
- Predictive modeling
- Non‑parametric analysis
According to historical perspectives, increasing experimental precision when a theory predicts the null hypothesis has what effect on the test of that theory?
1 of 6
Key Concepts
Hypothesis Testing Frameworks
Null hypothesis significance testing (NHST)
Fisher’s significance test
Neyman–Pearson hypothesis testing
Alternative hypothesis
Critical region
Philosophical Considerations
Philosophy of hypothesis testing
Student’s t‑distribution
p‑value
Type I error
Type II error
Definitions
Null hypothesis significance testing (NHST)
A hybrid statistical method combining Fisher’s significance testing with Neyman–Pearson decision rules, using p‑values and fixed error rates to evaluate hypotheses.
Fisher’s significance test
An early approach to statistical inference that assesses evidence against a null hypothesis via p‑values without specifying an alternative hypothesis.
Neyman–Pearson hypothesis testing
A formal decision‑theoretic framework that defines null and alternative hypotheses, Type I and Type II error probabilities, and optimal critical regions.
Student’s t‑distribution
A probability distribution introduced by William Sealy Gosset (under the pseudonym “Student”) for inference about means when sample sizes are small and variance is unknown.
p‑value
The probability, under the assumption that the null hypothesis is true, of obtaining a test statistic at least as extreme as the one observed.
Type I error
The error of incorrectly rejecting a true null hypothesis, often denoted by the significance level α.
Type II error
The error of failing to reject a false null hypothesis, associated with the test’s power and denoted by β.
Alternative hypothesis
The hypothesis that contradicts the null hypothesis, representing the effect or difference a researcher seeks to detect.
Critical region
The set of values for a test statistic that leads to rejection of the null hypothesis at a pre‑specified significance level.
Philosophy of hypothesis testing
The ongoing debate over the interpretation, validity, and limitations of statistical hypothesis testing, including concerns about conflating statistical significance with causal inference.