RemNote Community
Community

Hypothesis test - Major Testing Frameworks and Comparisons

Understand the key differences between Fisher and Neyman‑Pearson testing, how likelihood ratios and power analysis guide decisions, and how priors and costs can be incorporated.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What does the term "hypothesis testing" often refer to in a way that leads to confusion?
1 of 9

Summary

Variations and Sub-Classes of Hypothesis Testing Hypothesis testing has evolved over time, and today we practice several distinct approaches that differ fundamentally in how they interpret probability and make decisions. Understanding these variations is essential because they lead to different conclusions from the same data, and confusion between them is common in practice. The Two Major Philosophical Approaches Fisher's Significance Testing Ronald Fisher developed his approach with a focus on evidence and conclusion. In Fisher's framework, you conduct a test to determine whether your data provides sufficient evidence to conclude something meaningful about the world. The key concept is the p-value—the probability of observing data as extreme as (or more extreme than) what you actually observed, assuming the null hypothesis is true. Fisher's method is inherently conservative. You only draw a conclusion when the evidence is strong—conventionally, when the p-value is less than 0.05. If your p-value is 0.10, for instance, Fisher would argue that you simply don't have strong enough evidence yet. This approach emphasizes avoiding false conclusions by only accepting claims with substantial supporting evidence. The image above illustrates this concept: the shaded red area represents extreme values in the tail of a distribution. If your test statistic falls in this region (typically representing p < 0.05), Fisher's approach suggests the data provide evidence against the null hypothesis. Neyman–Pearson Hypothesis Testing Jerzy Neyman and Egon Pearson developed a different framework focused on decisions and actions. Rather than asking "what does the evidence suggest?", they ask "what decision should we make?" Their approach explicitly acknowledges that hypothesis testing exists in a context where decisions have consequences. The Neyman–Pearson framework considers two types of errors: Type I Error (α): Rejecting the null hypothesis when it's actually true (false positive) Type II Error (β): Failing to reject the null hypothesis when it's actually false (false negative) These errors have real costs. In medical testing, a Type I error might lead to unnecessary treatment, while a Type II error might miss a genuine disease. The Neyman–Pearson approach explicitly incorporates the decision-maker's tolerance for these errors and the relative costs of each. The Likelihood Ratio: The Core Decision Criterion Both major approaches rely on a fundamental principle: the likelihood ratio. This is the ratio of the probability of observing your data under one hypothesis versus under another hypothesis. $$\text{Likelihood Ratio} = \frac{L(H1 | \text{data})}{L(H0 | \text{data})}$$ The Neyman–Pearson Lemma establishes that the likelihood ratio is the most powerful test available. This means it maximizes your ability to correctly reject a false null hypothesis for a given level of Type I error. In practical terms, this ratio tells you how many times more likely your data are under one hypothesis compared to another—and this is the optimal way to make a decision. This principle is remarkable because it works for both approaches, despite their philosophical differences. The likelihood ratio directly indicates which hypothesis better explains your observed data. Frequentist vs. Bayesian Approaches Another important variation involves how you interpret probability itself. Frequentist approaches (both Fisher and Neyman–Pearson fall into this category) define probability as the long-run frequency of an event if an experiment were repeated infinitely. In frequentist thinking, the parameters of a distribution are fixed but unknown. Your sample data are random. Bayesian approaches treat probability as a degree of belief about something. In Bayesian thinking, you explicitly incorporate prior beliefs about parameters before seeing the data, then update these beliefs using the observed data to get a posterior distribution. This approach naturally accommodates prior information and allows for a more intuitive interpretation: given the data I observed, what's the probability that my hypothesis is true? While both frequentist and Bayesian methods use hypothesis testing, they interpret results differently. A frequentist p-value answers the question "If the null hypothesis were true, how often would I see data this extreme?" A Bayesian approach answers "Given my prior beliefs and the data I observed, what's the probability the hypothesis is true?" <extrainfo> Bayesian methods have gained popularity in recent years because they naturally incorporate prior information, are intuitive to interpret, and handle complex models well. However, frequentist methods remain standard in most fields, partly due to their long history and partly due to philosophical positions that prior information should not influence inference. </extrainfo> Incorporating Prior Information and Decision Costs A key advantage of the Neyman–Pearson framework (and Bayesian approaches) is the ability to incorporate prior probabilities and decision costs directly into your analysis. Prior probabilities reflect how likely each hypothesis is before you collect data. If you're testing whether a coin is fair, you might assign equal prior probability to fairness and bias. But if you're testing whether a new medicine works, you might assign high prior probability to "no effect" because most new drugs fail in trials. Decision costs recognize that errors have different consequences. In quality control, failing to detect a defective batch might be catastrophically expensive, while rejecting a good batch just causes minor inconvenience. These costs should rationally influence your threshold for decision-making. When you incorporate these elements, your decision rule changes. You might choose a different significance level or set different critical values, depending on how much you're willing to tolerate each type of error given its consequences. Composite Hypotheses Most real hypothesis tests involve composite hypotheses—hypotheses where the distribution includes unknown parameters beyond the specific value being tested. For example, consider testing whether a population mean equals some value $\mu0$. The null hypothesis might be "$\mu = \mu0$" (simple), but the alternative is usually "$\mu \neq \mu0$" (composite, because it includes all values except the null). The alternative hypothesis doesn't specify what the true mean is; it only says it's not the null value. This creates a challenge: to calculate the likelihood under a composite hypothesis, you must consider all possible values of the unknown parameters. Typically, you use the maximum likelihood estimate—the parameter values that best explain your observed data. This approach ensures you're giving the composite hypothesis its best chance to fit the data. Power Analysis and Sample Size Determination Before collecting data, you should ask: "If I use these test procedures and collect this many observations, how likely am I to correctly detect a real effect if one exists?" Statistical power is the probability of correctly rejecting the null hypothesis when it's actually false. It equals $1 - \beta$, where $\beta$ is the Type II error rate. Power depends on several factors: Effect size: How large is the difference you're trying to detect? Larger effects are easier to detect. Sample size: More observations provide more information, increasing power. Significance level (α): A stricter significance level (e.g., 0.01 instead of 0.05) reduces power. Type II error tolerance (β): How often are you willing to miss a real effect? Power analysis uses these relationships to plan your study. Typically, you specify your desired effect size, choose a significance level, and set a target power (conventionally 0.80, meaning you want an 80% chance of detecting a real effect). Then you calculate the required sample size. This is crucial because collecting data is expensive and time-consuming. Power analysis prevents you from collecting an inadequate sample (too little data to detect real effects) or wastefully collecting far more data than necessary. $$n \propto \frac{\sigma^2(Z{\alpha/2} + Z{\beta})^2}{(\mu1 - \mu0)^2}$$ This formula shows that sample size increases with variability ($\sigma^2$) and stricter error tolerances, but decreases with larger effect sizes. Comparing Fisher and Neyman–Pearson Approaches The Philosophical Distinction The fundamental difference is summarized simply: Fisher seeks conclusions; Neyman–Pearson seeks decisions. Fisher asks, "What does the evidence tell us?" His p-value is a measure of how surprising the data are under the null hypothesis. A small p-value means the data are surprising, suggesting the null hypothesis might be wrong. But Fisher doesn't demand that you make a binary decision; you might find moderate evidence and simply note that the question requires further investigation. Neyman and Pearson ask, "What decision should we make?" They require you to choose: reject the null hypothesis or fail to reject it. This reflects real-world situations where you must actually do something—implement a policy, approve a drug, adjust a manufacturing process. From this perspective, saying "we need more evidence" isn't an option; you must decide. Why They Usually Agree (But Aren't Identical) Despite their philosophical differences, the two approaches typically produce the same numerical answer and lead to the same decisions. This is because both are founded on the likelihood ratio, which is mathematically optimal. However, they differ in interpretation: A Fisher-based rejection means "the data provide strong evidence against the null hypothesis" A Neyman-Pearson rejection means "we make the decision to reject the null hypothesis, knowing we're willing to accept a Type I error rate of α" These sound similar, but they're conceptually distinct. Fisher's interpretation is about evidence; Neyman-Pearson's is about decision costs and long-run error rates. <extrainfo> The terminology in this field is unfortunately inconsistent. The term "hypothesis testing" often refers to a hybrid of the two approaches—you calculate a p-value (Fisher) but use a fixed significance level (Neyman-Pearson) to make a decision. Most applied researchers don't carefully distinguish between these philosophies, which can lead to misinterpretations of results. Understanding this distinction helps you interpret results more carefully and communicate them more accurately. </extrainfo> Practical Implications In practice, when you see researchers report results like "we rejected the null hypothesis at the 0.05 significance level," they're often mixing the two frameworks without realizing it. A pure Fisherian would say "p = 0.04, suggesting moderate evidence against the null," while a pure Neyman-Pearson follower would say "using our predetermined α = 0.05 threshold, we reject the null hypothesis." For your exam and your practice as a statistician, the important takeaway is that hypothesis testing isn't a single unified theory—it's a collection of related approaches that usually agree on the numbers but may differ on the interpretation. Understanding these differences helps you use hypothesis testing more thoughtfully and interpret others' results more accurately.
Flashcards
What does the term "hypothesis testing" often refer to in a way that leads to confusion?
It often refers to mixtures of the Fisher and Neyman–Pearson formulations.
What does a power calculation determine in statistical testing?
The probability of correctly rejecting a false null hypothesis.
When are power calculations typically used in the research process?
For planning sample sizes before data collection.
What does the Neyman–Pearson lemma identify as the optimal decision rule for selecting a hypothesis?
The ratio of the likelihoods of two hypotheses.
What is the purpose of including the costs of actions in statistical decisions?
To incorporate economic considerations into the decision-making process.
What characterizes the distributions of composite hypotheses?
They include unknown parameters.
What additional factors can be accommodated by the Neyman–Pearson decision-focused framework?
Prior probabilities Costs of actions resulting from decisions
In terms of goals, how does Fisher’s method differ from the Neyman–Pearson method?
Fisher’s method seeks conclusions, while Neyman–Pearson’s method seeks decisions.
What is the typical relationship between the numerical answers produced by the Fisher and Neyman–Pearson methods?
They usually produce the same numerical answer despite having different interpretations.

Quiz

According to the Neyman–Pearson lemma, which statistic provides the optimal decision rule for comparing two hypotheses?
1 of 4
Key Concepts
Statistical Inference Methods
Frequentist inference
Bayesian inference
Statistical decision theory
Hypothesis Testing Techniques
Hypothesis testing
Neyman–Pearson lemma
Likelihood ratio test
Fisher's significance testing
Composite hypothesis
Test Performance Metrics
Statistical power