Subjects/Math/Statistics and Discrete Math/Statistics/Replication crisis

Replication crisis - Statistical Foundations and Key Concepts

Learn the fundamentals of hypothesis testing, common p‑value pitfalls, and alternative approaches to improve statistical inference.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

When is a hypothesis considered to be supported by results?

1 of 28

Summary

Statistical Foundations: Hypothesis Testing and Inference Introduction Hypothesis testing is the foundation of modern statistical inference in research. It provides a formal framework to determine whether observed data provide evidence for or against a particular scientific claim. Understanding hypothesis testing requires knowledge of several interconnected concepts: p-values, effect sizes, statistical significance, and the types of errors that can occur. This chapter will guide you through these essentials and address common misinterpretations that can lead to flawed conclusions. Hypothesis Testing Basics At its core, hypothesis testing is a method for evaluating whether data support a proposed hypothesis. A hypothesis is considered supported when two conditions are met: (1) the results match the predicted pattern, and (2) the result is statistically significant. Statistical significance is not about whether a finding is important or meaningful—it's a specific statistical determination. A result is statistically significant when the probability of obtaining that result, assuming the null hypothesis is true, falls below a predetermined threshold called the significance level (denoted $\alpha$). The null hypothesis is the default assumption we test against. It typically states that no effect exists or that a parameter equals zero. For example: In drug testing: "The drug has no effect on patient outcomes" In psychology: "There is no difference between the two groups" The alternative hypothesis is what we hope to find evidence for—that an effect does exist. P-Values and Significance Thresholds The p-value is one of the most important (and frequently misunderstood) concepts in statistics. The p-value is the probability of obtaining results at least as extreme as what was observed, assuming the null hypothesis is true. This is a crucial point: the p-value does NOT tell you the probability that the null hypothesis is true. It tells you how likely your observed results would be if the null hypothesis were correct. Think of it this way: if the null hypothesis is true and there's really no effect, how surprising would your results be? Common Significance Thresholds Researchers choose a significance level $\alpha$ before analyzing data. The most common thresholds are: $p < 0.05$: This is the standard in many fields. It means we're willing to accept a 5% false-positive rate. In other words, if we repeat this experiment many times and the null hypothesis is always true, we'd expect to incorrectly reject it about 5% of the time. $p < 0.01$: A stricter threshold that reduces the false-positive rate to 1%. This requires larger sample sizes to achieve the same statistical power. $p < 0.001$: An even stricter threshold with only a 0.1% false-positive rate, but demanding even larger samples. Understanding P-Value Distribution Under a true null hypothesis (when there truly is no effect), p-values follow a uniform distribution on the interval $[0, 1]$. This means that if the null hypothesis is true, all p-values from 0 to 1 are equally likely. This property is important to remember when considering multiple tests. Effect Size and Test Statistics Understanding effect size is critical for interpreting what hypothesis tests actually measure. Effect size is a real number quantifying the magnitude of a phenomenon. Key properties: Effect size = 0 when the null hypothesis is true (no effect) Effect size becomes larger when the null hypothesis is false (a true effect exists and is strong) Effect size is independent of sample size Cohen's d is one of the most common estimators of effect size, particularly used when comparing means between groups: $$d = \frac{\text{mean}1 - \text{mean}2}{\text{standard deviation}}$$ A test statistic is a numerical value computed from your data that estimates the effect size. Test statistics (like t-statistics, z-scores, or F-statistics) are what we use to calculate p-values. Different research designs use different test statistics, but they all serve the same purpose: quantifying evidence against the null hypothesis. Important distinction: The effect size is what we care about scientifically (how big is the effect?), while the test statistic is the tool we use to test whether that effect exists. Outcomes of Null-Hypothesis Tests A hypothesis test produces a binary decision: either "reject the null hypothesis" or "fail to reject the null hypothesis." However, this binary decision can be correct or incorrect, leading to four possible outcomes. When a null hypothesis test is performed, there are four possible outcomes based on whether the null hypothesis is actually true and whether we reject it: True Negatives (Correct Rejection): The null hypothesis is true (no real effect), and we fail to reject it. We correctly concluded there's no effect. False Positives (Type I Error): The null hypothesis is true (no real effect), but we reject it anyway. We incorrectly concluded an effect exists when it doesn't. False Negatives (Type II Error): The null hypothesis is false (a real effect exists), but we fail to reject it. We incorrectly concluded there's no effect when one actually exists. True Positives (Correct Detection): The null hypothesis is false (a real effect exists), and we reject it. We correctly identified the effect. This framework is essential for understanding the errors that can occur in hypothesis testing. Significance Level, Power, and Error Rates Three key concepts define the performance characteristics of any hypothesis test: Significance Level ($\alpha$): This is the probability of a false-positive error (Type I error). When you set $\alpha = 0.05$, you're saying: "I'm willing to accept a 5% chance of incorrectly rejecting the null hypothesis when it's actually true." You choose this threshold before conducting the test. Statistical Power: Also called the true-positive rate, statistical power is the probability of correctly detecting a true effect when it actually exists. If power = 0.80, this means there's an 80% chance you'll find the effect if it's really there. False-Negative Rate ($\beta$): The probability of a false-negative error (Type II error). Importantly: $$\beta = 1 - \text{Power}$$ If power is 0.80, then $\beta = 0.20$. These relationships highlight a key trade-off: you can't minimize both error types simultaneously without changing sample size. Larger samples allow you to reduce both $\alpha$ and $\beta$. Critical Problems with P-Values While p-values are widely used, they're frequently misinterpreted, leading to serious errors in scientific conclusions. Common Misinterpretations The most dangerous misinterpretation is treating the p-value as the probability that the null hypothesis is true. This is incorrect. If your p-value is 0.03, this does NOT mean there's only a 3% chance your hypothesis is true. Rather, it means: "If the null hypothesis were true, there would be only a 3% chance of observing results this extreme." This distinction matters enormously because it touches on conditional probability. A p-value is $P(\text{data} | \text{null hypothesis true})$, not $P(\text{null hypothesis true} | \text{data})$. False Positive Risk Here's a crucial reality check: the false-positive risk (FPR) is often much higher than researchers expect, even at $p < 0.05$. The relationship between p-value and actual false-positive risk depends on: How common true effects are in your field (the prior probability) The sample size used in the study Whether you tested multiple hypotheses In fields where true effects are rare and sample sizes modest, a p-value of 0.05 might actually correspond to a 20-30% false-positive risk rather than 5%. This is a critical insight that changes how we should interpret conventional significance thresholds. Multiple Testing Problem When researchers test many hypotheses, the false-discovery rate skyrockets. If you conduct 20 independent tests at $\alpha = 0.05$, you'd expect about 1 false positive by chance alone, even if all null hypotheses are true. Many modern studies test hundreds or thousands of relationships, making this problem severe. Proper corrections (like Bonferroni adjustment or false-discovery rate control) are necessary when multiple tests are performed. Alternatives to and Complements for P-Values Given these limitations, several approaches can improve statistical inference: Confidence Intervals: Rather than a binary significant/not-significant conclusion, report a range of plausible values for the parameter you're estimating. A 95% confidence interval has the interpretation: if you repeated your study many times and calculated the interval each time using the same method, 95% of those intervals would contain the true parameter value. Reporting Effect Sizes: Always report the magnitude of effects, not just whether they're statistically significant. A small effect that's statistically significant may not be practically important. Bayesian Methods: These provide direct probability statements about hypotheses given the data: $P(\text{hypothesis} | \text{data})$. Bayesian posterior probabilities directly answer the question researchers often want to ask. Bayesian methods also naturally incorporate prior information about which effects are likely, helping reduce the false-positive risk problem. Estimation Statistics: This emerging approach focuses on effect-size estimation and confidence intervals rather than binary significance testing. It emphasizes what the effect probably is, rather than whether it's "significantly" different from zero. Key Sources of Error and Bias in Research Beyond hypothesis testing mechanics, several systematic issues can distort research conclusions: Sampling Bias and Selection Bias: These closely related problems occur when the participants studied don't represent the target population. Sampling bias arises from how participants are selected (e.g., only recruiting volunteers). Selection bias occurs when participants self-select based on characteristics related to the outcome (e.g., people volunteering for a weight-loss study are particularly motivated). Both lead to results that don't generalize to broader populations. Data Dredging (P-Hacking): Researchers often explore many possible analyses and report only the statistically significant ones. By testing enough hypotheses, you're virtually guaranteed to find "significant" results by chance. This might involve: testing multiple outcome measures, analyzing various subgroups, trying different statistical approaches, or selectively reporting analyses. Even unconscious p-hacking inflates false-positive rates. Correlation vs. Causation: A fundamental principle: observing a statistical association between two variables does not prove one causes the other. Correlation requires only a statistical relationship. Causation requires that changes in one variable directly cause changes in the other. Confounding variables (variables affecting both measures) can create spurious correlations. Extension Neglect: Researchers sometimes generalize findings far beyond what their data support. A study of undergraduate psychology students may find interesting patterns, but these might not apply to the general population, different age groups, or different cultures. Always consider the boundary conditions of your findings. <extrainfo> Additional Concepts: The base-rate fallacy describes errors that arise when ignoring the underlying probability of an event (the base rate). When testing a rare hypothesis with imperfect evidence, even "significant" results might be more likely false positives than true discoveries. The decline effect (or "replication crisis") refers to the observation that effect sizes often appear larger in initial studies and smaller in replication attempts. This often results from publication bias (only reporting significant results) and random variation. The problem of induction is a philosophical issue: we can never logically prove that a general principle holds universally based on limited observations. Statistical methods help quantify uncertainty, but cannot eliminate this fundamental logical problem. Falsifiability is the principle that scientific theories must be testable and potentially provable false. If a theory can explain any possible outcome, it has no scientific value. Hypothesis testing formalizes this principle by specifying what evidence would falsify the null hypothesis. </extrainfo> Summary Hypothesis testing is a powerful framework, but it's only as good as our understanding of its mechanics and limitations. The p-value tells us about data given a hypothesis, not about the hypothesis given data. Effect sizes matter more than p-values for scientific interpretation. Multiple sources of bias can distort results, from sampling issues to data dredging. Modern best practice combines p-values with confidence intervals, effect sizes, and ideally Bayesian posterior probabilities to create a complete picture of evidence. As you progress in statistical thinking, remember that the goal is not to obtain a small p-value—it's to draw true, generalizable conclusions about how the world works.

Flashcards

When is a hypothesis considered to be supported by results?

When results match the predicted pattern and are statistically significant.

How is statistical significance determined in relation to the null hypothesis?

When the probability of the result under the null hypothesis falls below a predetermined significance level.

What does the p-value represent assuming the null hypothesis is true?

The probability of obtaining results at least as extreme as observed.

What is the distribution of p-values under a true null hypothesis?

Uniformly distributed on the interval $[0,1]$.

What is a common misinterpretation of p-values regarding the null hypothesis?

Interpreting them as the probability that the null hypothesis is true.

What false-positive rate corresponds to the standard significance threshold of $p < 0.05$?

$5\%$

What are the trade-offs of using a stricter significance threshold like $p < 0.01$ or $p < 0.001$?

They reduce the false-positive rate but require larger sample sizes.

What value does the effect size take when the null hypothesis is true?

$0$

What is Cohen’s d used for in statistics?

It is a common estimator of effect size.

What is the definition of a test statistic in the context of effect sizes?

An estimator of effect size used for statistical testing.

What are the two possible outputs of a null-hypothesis test?

Reject the null hypothesis Fail to reject the null hypothesis

What are the four possible outcomes of a null-hypothesis test?

False negative True negative False positive True positive

What specific error probability does the significance level $\alpha$ represent?

The probability of a false-positive error.

What is statistical power (also known as the true-positive rate)?

The probability of detecting a true effect.

How is the false-negative rate $\beta$ calculated using statistical power?

$\beta = 1 - \text{statistical power}$

How do Bayesian methods differ from traditional p-values in terms of the statements they provide?

They provide direct probability statements about hypotheses.

What do FPR calculators suggest about the conventional $\alpha = 0.05$ threshold?

It often yields a higher false positive risk than researchers expect.

What is the primary error involved in the base-rate fallacy?

Ignoring the underlying probability of an event.

What does the black-swan theory highlight in scientific conclusions?

The impact of rare, unpredictable events.

Does observing a statistical association between variables prove a causal relationship?

No (Correlation does not imply causation).

What practice defines data dredging (p-hacking)?

Exploiting flexible analysis decisions to obtain statistically significant results.

What is the decline effect in scientific replication?

The observation that effect sizes often appear smaller in subsequent replication attempts.

What is the primary focus of estimation statistics compared to traditional testing?

Effect-size estimation and confidence intervals (rather than binary significance testing).

What techniques does EDA emphasize before formal hypothesis testing begins?

Visual and informal techniques to discover patterns.

What error occurs when researchers commit extension neglect?

They fail to consider how far findings can be generalized beyond the studied sample.

What is the principle of falsifiability in scientific theories?

The requirement that theories must be testable and capable of being proven false.

What is the core concern of the problem of induction?

The logical difficulty of justifying generalizations from limited observations.

When does sampling bias arise in a study?

When the sampled participants are not representative of the target population.

Quiz

Under what conditions is a hypothesis considered supported?

1 of 4

Key Concepts

Statistical Testing Concepts

Hypothesis testing

p‑value

Statistical significance

Effect size

Statistical power

False discovery rate

Statistical Methodologies

Bayesian statistics

Confidence interval

p‑hacking (data dredging)

Base‑rate fallacy

Decline effect

Sampling bias

Definitions

Hypothesis testing

A statistical method for deciding whether observed data are compatible with a specified null hypothesis.

p‑value

The probability of obtaining results at least as extreme as those observed, assuming the null hypothesis is true.

Statistical significance

A determination that a result is unlikely to have occurred by chance, typically when the p‑value falls below a pre‑set α level.

Effect size

A quantitative measure of the magnitude of a phenomenon, such as Cohen’s d, indicating how far the true effect deviates from zero.

Statistical power

The probability that a test correctly detects a true effect (the true‑positive rate), equal to 1 − β.

False discovery rate

The expected proportion of false positives among all rejected null hypotheses, especially relevant in multiple testing.

Bayesian statistics

An approach that uses prior probabilities and observed data to compute posterior probabilities for hypotheses.

Confidence interval

A range of values derived from sample data that, with a specified confidence level, is believed to contain the true population parameter.

p‑hacking (data dredging)

The practice of flexibly analyzing data until statistically significant results are found, inflating false‑positive risk.

Base‑rate fallacy

The error of ignoring the prior probability of an event when evaluating the likelihood of a hypothesis.

Decline effect

The observed reduction in effect size or significance in subsequent replication studies compared to original findings.

Sampling bias

A systematic error that occurs when the sampled participants are not representative of the target population.