RemNote Community
Community

Hypothesis test - Historical Evolution and NHST Foundations

Understand the historical evolution of hypothesis testing, the contrasting Fisher and Neyman‑Pearson frameworks, and how they combine in modern NHST.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

How does a more precise experiment affect the test of a theory when the theory predicts the null hypothesis?
1 of 6

Summary

History of Hypothesis Testing Introduction The modern hypothesis testing framework we use today is not the product of a single genius moment, but rather a hybrid of competing philosophies that developed throughout the early 20th century. Understanding this history is essential because it explains why modern practice can seem contradictory—we're actually using a blend of two fundamentally different approaches. This context will help you understand not just what hypothesis testing is, but why it works the way it does. The Choice of Null Hypothesis Matters Before we dive into the history, it's important to understand a subtle but critical point: the choice of what we call the "null hypothesis" dramatically affects how an experiment tests our theory. When a scientific theory predicts a specific outcome, we can make that outcome our null hypothesis. In this case, a more precise experiment (one with less random variation or a larger sample size) makes a more severe test of the theory. Why? Because the experiment is less likely to detect the predicted effect by accident—if we find it, it's strong evidence the theory is correct. Conversely, when we default to the null hypothesis of "no difference" or "no effect" (which is common in modern practice), a more precise experiment makes a less severe test of our motivating theory. The precise experiment makes it easier to detect small effects, so finding an effect becomes less impressive evidence for the theory. This distinction becomes important when we discuss the different philosophical approaches that shaped modern statistics. The Pioneers: Gosset, Fisher, and Neyman–Pearson William Sealy Gosset (Student's t-distribution) The first major breakthrough came from an unexpected source: a chemist working for the Guinness brewery. William Sealy Gosset needed to make reliable decisions from small samples—he couldn't afford to run large experiments. In 1908, he developed what became known as Student's t-distribution, a probability distribution that accounts for the extra uncertainty when working with small samples. The t-distribution is crucial because when sample sizes are small, the normal distribution doesn't accurately describe the variability of sample means. Gosset's work laid the practical foundation for modern hypothesis testing. Ronald Fisher (The Foundation of Modern Significance Testing) Ronald Fisher transformed statistics with a revolutionary idea: instead of asking "What is the probability the null hypothesis is true?", ask "If the null hypothesis were true, how surprising is our observed data?" This led to the development of the p-value. Fisher's approach emphasized: Experimental design as the foundation of valid inference The null hypothesis as a specific claim to test (typically "no effect" or "no difference") Significance testing: calculating the probability of observing data as extreme as ours, assuming the null hypothesis is true Objective interpretation: a small p-value suggests the null hypothesis is unlikely Critically, Fisher did not propose an alternative hypothesis. He viewed hypothesis testing as a way to detect whether an effect exists, not to evaluate competing theories. Jerzy Neyman and Egon Pearson (Formalizing the Framework) While Fisher was developing significance testing, Jerzy Neyman and Egon Pearson took a different approach. They created a more formal decision-making framework: They introduced the alternative hypothesis (the claim we're considering if we reject the null) They formalized Type I and Type II errors: Type I error: Rejecting the null hypothesis when it's actually true (false positive) Type II error: Failing to reject the null hypothesis when it's actually false (false negative) They proposed a critical region: a predetermined range of test statistics where we reject the null hypothesis They emphasized controlling both error rates before conducting the experiment This approach is fundamentally different from Fisher's. Rather than asking "How surprising is our data?", Neyman and Pearson ask "What decision rule minimizes our risk of making a costly mistake?" The Philosophical Clash: Fisher vs. Neyman–Pearson The two approaches were in direct philosophical conflict. Here are the key differences: Fisher's approach: Focus on the p-value as a continuous measure of evidence The null hypothesis is the theory being tested Interpretation is flexible and depends on context No predetermined alternative hypothesis No explicit discussion of Type II errors (since there's no alternative to measure error against) Neyman–Pearson approach: Focus on controlling error rates through a fixed decision rule The null and alternative hypotheses are equally important Decision is binary: reject or fail to reject (no "maybe") Both Type I and Type II error rates are controlled and specified in advance The approach is more rigid but more prescriptive for decision-making Fisher and Neyman even had a bitter public dispute about whose approach was correct. Fisher saw Neyman–Pearson as overly mechanical; Neyman and Pearson saw Fisher as imprecise. The Compromise: Null Hypothesis Significance Testing (NHST) Here's the practical reality: most researchers today use neither Fisher's nor Neyman–Pearson's approach in its pure form. Instead, we use a hybrid called Null Hypothesis Significance Testing (NHST). What NHST Borrows from Each Approach From Fisher: The p-value as the primary tool for assessing evidence The null hypothesis as the baseline claim Informal interpretation of what counts as "significant" From Neyman–Pearson: Fixed error rates (typically $\alpha = 0.05$ for a Type I error) A critical region (the set of test statistics that lead to rejection) The concept of alternative hypotheses (though not always used formally) How NHST Works In NHST, we: State a null hypothesis (usually "no effect") and often an alternative hypothesis Choose a significance level $\alpha$ (usually 0.05), which controls the maximum acceptable Type I error rate Calculate a test statistic from our data Determine whether the test statistic falls in the critical region If it does, we reject the null hypothesis and claim our result is "statistically significant" The p-value tells us: "If the null hypothesis is true, what's the probability of observing a test statistic at least as extreme as the one we got?" If this probability is less than $\alpha$, we reject. Why the Hybrid? The hybrid approach became standard because it's practical: it combines Fisher's intuitive p-value interpretation with Neyman–Pearson's formal error control. However, this compromise creates philosophical inconsistencies that statisticians still debate today. You're essentially using two different frameworks simultaneously, which can lead to confusion about what a p-value actually means. A Critical Philosophical Issue: Hypothesis Testing is Not Causal Inference One of the most important philosophical critiques of hypothesis testing is this: finding a statistically significant relationship does not prove causation. This is often summarized as "correlation does not imply causation." Hypothesis testing can detect whether two variables are related in your data. However, it cannot by itself determine whether one variable causes changes in another. There could be: Reverse causality: The effect causes the supposed cause Confounding variables: A third variable causes changes in both variables you're observing Mere coincidence: The relationship is real but not causal For example, suppose you find a statistically significant relationship between ice cream sales and drowning deaths. Hypothesis testing can confirm this relationship is real (not due to random chance), but it cannot tell you whether ice cream causes drowning, whether drowning causes people to eat ice cream, or whether both are caused by warm weather. This distinction is critical: hypothesis testing answers the question "Is there a relationship?" while causal inference (which requires careful experimental design, logical reasoning, or specialized statistical methods) answers "Does one variable cause the other?" Many research papers, media reports, and policy decisions unfortunately confuse these two questions. Understanding this distinction will help you interpret statistical results more critically.
Flashcards
How does a more precise experiment affect the test of a theory when the theory predicts the null hypothesis?
It makes a more severe test of the theory.
What common confusion about hypothesis testing do philosophical critics often point out?
Confusing hypothesis testing with causal inference.
Who formalized hypothesis testing by introducing explicit alternative hypotheses and error types?
Jerzy Neyman and Egon Pearson.
What specific components did Neyman and Pearson introduce to statistical testing?
Alternative hypothesis Type I error probabilities Type II error probabilities Decision rules based on a critical region
Null Hypothesis Significance Testing (NHST) is a hybrid of which two historical approaches?
Fisher’s significance test and Neyman–Pearson’s decision framework.
How does NHST combine the philosophies of Fisher and Neyman-Pearson?
Retains Fisher’s informal interpretation of evidence Adopts Neyman–Pearson’s fixed error rates

Quiz

According to historical perspectives, increasing experimental precision when a theory predicts the null hypothesis has what effect on the test of that theory?
1 of 6
Key Concepts
Hypothesis Testing Frameworks
Null hypothesis significance testing (NHST)
Fisher’s significance test
Neyman–Pearson hypothesis testing
Alternative hypothesis
Critical region
Philosophical Considerations
Philosophy of hypothesis testing
Student’s t‑distribution
p‑value
Type I error
Type II error