Subjects/Math/Statistics and Discrete Math/Statistics/Hypothesis test

Hypothesis test - Historical Evolution and NHST Foundations

Understand the historical evolution of hypothesis testing, the contrasting Fisher and Neyman‑Pearson frameworks, and how they combine in modern NHST.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

How does a more precise experiment affect the test of a theory when the theory predicts the null hypothesis?

1 of 6

Summary

History of Hypothesis Testing Introduction The modern hypothesis testing framework we use today is not the product of a single genius moment, but rather a hybrid of competing philosophies that developed throughout the early 20th century. Understanding this history is essential because it explains why modern practice can seem contradictory—we're actually using a blend of two fundamentally different approaches. This context will help you understand not just what hypothesis testing is, but why it works the way it does. The Choice of Null Hypothesis Matters Before we dive into the history, it's important to understand a subtle but critical point: the choice of what we call the "null hypothesis" dramatically affects how an experiment tests our theory. When a scientific theory predicts a specific outcome, we can make that outcome our null hypothesis. In this case, a more precise experiment (one with less random variation or a larger sample size) makes a more severe test of the theory. Why? Because the experiment is less likely to detect the predicted effect by accident—if we find it, it's strong evidence the theory is correct. Conversely, when we default to the null hypothesis of "no difference" or "no effect" (which is common in modern practice), a more precise experiment makes a less severe test of our motivating theory. The precise experiment makes it easier to detect small effects, so finding an effect becomes less impressive evidence for the theory. This distinction becomes important when we discuss the different philosophical approaches that shaped modern statistics. The Pioneers: Gosset, Fisher, and Neyman–Pearson William Sealy Gosset (Student's t-distribution) The first major breakthrough came from an unexpected source: a chemist working for the Guinness brewery. William Sealy Gosset needed to make reliable decisions from small samples—he couldn't afford to run large experiments. In 1908, he developed what became known as Student's t-distribution, a probability distribution that accounts for the extra uncertainty when working with small samples. The t-distribution is crucial because when sample sizes are small, the normal distribution doesn't accurately describe the variability of sample means. Gosset's work laid the practical foundation for modern hypothesis testing. Ronald Fisher (The Foundation of Modern Significance Testing) Ronald Fisher transformed statistics with a revolutionary idea: instead of asking "What is the probability the null hypothesis is true?", ask "If the null hypothesis were true, how surprising is our observed data?" This led to the development of the p-value. Fisher's approach emphasized: Experimental design as the foundation of valid inference The null hypothesis as a specific claim to test (typically "no effect" or "no difference") Significance testing: calculating the probability of observing data as extreme as ours, assuming the null hypothesis is true Objective interpretation: a small p-value suggests the null hypothesis is unlikely Critically, Fisher did not propose an alternative hypothesis. He viewed hypothesis testing as a way to detect whether an effect exists, not to evaluate competing theories. Jerzy Neyman and Egon Pearson (Formalizing the Framework) While Fisher was developing significance testing, Jerzy Neyman and Egon Pearson took a different approach. They created a more formal decision-making framework: They introduced the alternative hypothesis (the claim we're considering if we reject the null) They formalized Type I and Type II errors: Type I error: Rejecting the null hypothesis when it's actually true (false positive) Type II error: Failing to reject the null hypothesis when it's actually false (false negative) They proposed a critical region: a predetermined range of test statistics where we reject the null hypothesis They emphasized controlling both error rates before conducting the experiment This approach is fundamentally different from Fisher's. Rather than asking "How surprising is our data?", Neyman and Pearson ask "What decision rule minimizes our risk of making a costly mistake?" The Philosophical Clash: Fisher vs. Neyman–Pearson The two approaches were in direct philosophical conflict. Here are the key differences: Fisher's approach: Focus on the p-value as a continuous measure of evidence The null hypothesis is the theory being tested Interpretation is flexible and depends on context No predetermined alternative hypothesis No explicit discussion of Type II errors (since there's no alternative to measure error against) Neyman–Pearson approach: Focus on controlling error rates through a fixed decision rule The null and alternative hypotheses are equally important Decision is binary: reject or fail to reject (no "maybe") Both Type I and Type II error rates are controlled and specified in advance The approach is more rigid but more prescriptive for decision-making Fisher and Neyman even had a bitter public dispute about whose approach was correct. Fisher saw Neyman–Pearson as overly mechanical; Neyman and Pearson saw Fisher as imprecise. The Compromise: Null Hypothesis Significance Testing (NHST) Here's the practical reality: most researchers today use neither Fisher's nor Neyman–Pearson's approach in its pure form. Instead, we use a hybrid called Null Hypothesis Significance Testing (NHST). What NHST Borrows from Each Approach From Fisher: The p-value as the primary tool for assessing evidence The null hypothesis as the baseline claim Informal interpretation of what counts as "significant" From Neyman–Pearson: Fixed error rates (typically $\alpha = 0.05$ for a Type I error) A critical region (the set of test statistics that lead to rejection) The concept of alternative hypotheses (though not always used formally) How NHST Works In NHST, we: State a null hypothesis (usually "no effect") and often an alternative hypothesis Choose a significance level $\alpha$ (usually 0.05), which controls the maximum acceptable Type I error rate Calculate a test statistic from our data Determine whether the test statistic falls in the critical region If it does, we reject the null hypothesis and claim our result is "statistically significant" The p-value tells us: "If the null hypothesis is true, what's the probability of observing a test statistic at least as extreme as the one we got?" If this probability is less than $\alpha$, we reject. Why the Hybrid? The hybrid approach became standard because it's practical: it combines Fisher's intuitive p-value interpretation with Neyman–Pearson's formal error control. However, this compromise creates philosophical inconsistencies that statisticians still debate today. You're essentially using two different frameworks simultaneously, which can lead to confusion about what a p-value actually means. A Critical Philosophical Issue: Hypothesis Testing is Not Causal Inference One of the most important philosophical critiques of hypothesis testing is this: finding a statistically significant relationship does not prove causation. This is often summarized as "correlation does not imply causation." Hypothesis testing can detect whether two variables are related in your data. However, it cannot by itself determine whether one variable causes changes in another. There could be: Reverse causality: The effect causes the supposed cause Confounding variables: A third variable causes changes in both variables you're observing Mere coincidence: The relationship is real but not causal For example, suppose you find a statistically significant relationship between ice cream sales and drowning deaths. Hypothesis testing can confirm this relationship is real (not due to random chance), but it cannot tell you whether ice cream causes drowning, whether drowning causes people to eat ice cream, or whether both are caused by warm weather. This distinction is critical: hypothesis testing answers the question "Is there a relationship?" while causal inference (which requires careful experimental design, logical reasoning, or specialized statistical methods) answers "Does one variable cause the other?" Many research papers, media reports, and policy decisions unfortunately confuse these two questions. Understanding this distinction will help you interpret statistical results more critically.

Flashcards

How does a more precise experiment affect the test of a theory when the theory predicts the null hypothesis?

It makes a more severe test of the theory.

What common confusion about hypothesis testing do philosophical critics often point out?

Confusing hypothesis testing with causal inference.

Who formalized hypothesis testing by introducing explicit alternative hypotheses and error types?

Jerzy Neyman and Egon Pearson.

What specific components did Neyman and Pearson introduce to statistical testing?

Alternative hypothesis Type I error probabilities Type II error probabilities Decision rules based on a critical region

Null Hypothesis Significance Testing (NHST) is a hybrid of which two historical approaches?

Fisher’s significance test and Neyman–Pearson’s decision framework.

How does NHST combine the philosophies of Fisher and Neyman-Pearson?

Retains Fisher’s informal interpretation of evidence Adopts Neyman–Pearson’s fixed error rates

Quiz

According to historical perspectives, increasing experimental precision when a theory predicts the null hypothesis has what effect on the test of that theory?

1 of 6

Key Concepts

Hypothesis Testing Frameworks

Null hypothesis significance testing (NHST)

Fisher’s significance test

Neyman–Pearson hypothesis testing

Alternative hypothesis

Critical region

Philosophical Considerations

Philosophy of hypothesis testing

Student’s t‑distribution

p‑value

Type I error

Type II error

Definitions

Null hypothesis significance testing (NHST)

A hybrid statistical method combining Fisher’s significance testing with Neyman–Pearson decision rules, using p‑values and fixed error rates to evaluate hypotheses.

Fisher’s significance test

An early approach to statistical inference that assesses evidence against a null hypothesis via p‑values without specifying an alternative hypothesis.

Neyman–Pearson hypothesis testing

A formal decision‑theoretic framework that defines null and alternative hypotheses, Type I and Type II error probabilities, and optimal critical regions.

Student’s t‑distribution

A probability distribution introduced by William Sealy Gosset (under the pseudonym “Student”) for inference about means when sample sizes are small and variance is unknown.

p‑value

The probability, under the assumption that the null hypothesis is true, of obtaining a test statistic at least as extreme as the one observed.

Type I error

The error of incorrectly rejecting a true null hypothesis, often denoted by the significance level α.

Type II error

The error of failing to reject a false null hypothesis, associated with the test’s power and denoted by β.

Alternative hypothesis

The hypothesis that contradicts the null hypothesis, representing the effect or difference a researcher seeks to detect.

Critical region

The set of values for a test statistic that leads to rejection of the null hypothesis at a pre‑specified significance level.

Philosophy of hypothesis testing

The ongoing debate over the interpretation, validity, and limitations of statistical hypothesis testing, including concerns about conflating statistical significance with causal inference.