Subjects/Arts and Humanities/Philosophy and Religion/Philosophy/Replication crisis

Replication crisis Study Guide

Study Guide

📖 Core Concepts Replication vs. Reproducibility – Replication: repeat an experiment with new data to test the original conclusion. Reproducibility: re‑run the same analysis on the original data set. Types of Replication – Direct: identical procedures; Systematic: intentional protocol changes; Conceptual: different methods testing the same hypothesis. Statistical Significance – The p‑value is the probability of data as extreme or more than observed, assuming the null hypothesis is true. Conventional cut‑offs: $p<0.05$ (5 % false‑positive rate), $p<0.01$ (1 %), $p<0.001$ (0.1 %). Effect Size – Quantifies the magnitude of a true effect; e.g., Cohen’s $d$. The null hypothesis usually posits an effect size of $0$. Power ($1-\beta$) – Probability of detecting a true effect; depends on α, effect size, and sample size. Low power (< 80 %) inflates false‑negative rates and exaggerates observed effects. False‑Positive Risk (FPR) – The actual probability that a “significant” finding is a false positive; often higher than the nominal α when the prior probability of a true effect is low. Publication Bias & File‑Drawer Effect – Journals preferentially publish significant, positive results, leaving null findings unpublished and distorting the literature. Questionable Research Practices (QRPs) – Data‑dredging, HARKing, selective reporting, optional stopping, and undisclosed analytic flexibility that boost false‑positive rates. Metascience – The empirical study of research practices themselves; provides evidence‑based reforms (preregistration, result‑blind review, open data). --- 📌 Must Remember p‑value: $P(\text{data} \mid H0)$. Not the probability that $H0$ is true. α = 0.05 → 5 % false‑positive rate if all hypotheses are equally plausible (rare in practice). Power ≥ 0.80 is the widely recommended minimum for adequately powered studies. Typical power: psychology ≈ 33‑36 %; fMRI ≈ 8‑31 %; neuroscience median ≈ 21 %; economics median ≈ 18 %. Effect‑size bias: Underpowered studies tend to over‑estimate effect sizes (the “decline effect”). Multiple comparisons without correction raise the family‑wise error rate dramatically (e.g., 20 tests at α=0.05 → ≈ 64 % chance of at least one false positive). Optional stopping invalidates the nominal p‑value; preregistration prevents it. Base‑rate fallacy: When most tested hypotheses are false a priori, even a low α yields many false positives. --- 🔄 Key Processes Designing a Well‑Powered Replication Estimate expected effect size (e.g., Cohen’s $d$ from original study). Conduct an a priori power analysis → choose $N$ that gives $1-\beta \ge 0.80$ at chosen α. Pre‑register hypotheses, sampling plan, and analysis pipeline. Conducting a Null‑Hypothesis Test Define $H0$: effect size = 0. Compute test statistic (e.g., $t$, $F$) → obtain $p$. Compare $p$ to α: if $p < α$, “reject $H0$”; else “fail to reject $H0$”. Multiverse / Sensitivity Analysis Enumerate plausible analytic decisions (e.g., inclusion criteria, covariates). Run all pipelines; examine variation in effect sizes and $p$‑values. Report the range to show robustness (or lack thereof). Result‑Blind Peer Review Workflow Submit methods + analysis plan → editorial decision before seeing results. After data collection, submit results for a post‑hoc check; acceptance depends on adherence to the pre‑registered plan. --- 🔍 Key Comparisons Direct vs. Conceptual Replication Direct: same procedures → tests exact reproducibility. Conceptual: different procedures → tests generalizability of the underlying theory. p‑value Thresholds $p<0.05$ → 5 % false‑positive rate (standard). $p<0.01$ → 1 % false‑positive rate (requires larger $N$). $p<0.001$ → 0.1 % false‑positive rate (highly conservative). QRPs vs. Transparent Practices QRPs (e.g., HARKing, optional stopping) increase false positives. Transparent practices (preregistration, open code) reduce researcher degrees of freedom. Frequentist vs. Bayesian Interpretation Frequentist: p‑value ≠ probability hypothesis is true. Bayesian: posterior probability directly reflects evidence, given a prior. --- ⚠️ Common Misunderstandings “A p‑value of .04 proves the effect is real.” – It only says data are unlikely under $H0$; does not quantify the probability the effect exists. “Failing to reject $H0$ means the null is true.” – It may be a false negative due to low power. “Replication failure means the original study was sloppy.” – Failures can arise from contextual differences, low power, or genuine null effects. “All QRPs are scientific misconduct.” – QRPs are questionable but not outright fraud; they still inflate false‑positive rates. “Lowering α to .005 will automatically fix the crisis.” – Without addressing power, QRPs, and bias, stricter thresholds alone have limited impact. --- 🧠 Mental Models / Intuition “Fishing Net” Model – Each analytic decision is a mesh in a net; the more meshes (degrees of freedom), the higher the chance you’ll catch a “significant” fish by accident. Preregistration removes extra meshes. “Signal vs. Noise” Analogy – Low‑power studies treat noise as signal; the observed effect is often an over‑inflated version of the true effect. “Base‑Rate Funnel” – Imagine a funnel where many hypotheses enter (most false). The α‑cutoff acts as a sieve; if the funnel is wide (low base rate), many false hypotheses slip through. --- 🚩 Exceptions & Edge Cases Very Large Samples – With huge $N$, even trivially small effects become “significant” (p‑value ≪ .05). Emphasize effect size and practical relevance. Rare Events / Black‑Swan Findings – Standard α may be inappropriate; Bayesian priors help assess plausibility. Meta‑analyses with Heterogeneity – High $I^2$ (> 75 %) indicates that a simple fixed‑effect model is inappropriate; random‑effects or subgroup analyses are needed. --- 📍 When to Use Which Choose Direct Replication when you need to verify exact methodological fidelity (e.g., clinical trial protocols). Choose Conceptual Replication to test generalizability across populations, settings, or measurement tools. Use p‑value thresholds: default $0.05$ for exploratory work; $0.01$ or $0.005$ for high‑stakes claims (e.g., drug efficacy). Apply Bayesian analysis when prior information is strong or when you need a probability statement about hypotheses. Opt for multiverse analysis when the analytic pipeline is inherently flexible (e.g., many possible covariates). --- 👀 Patterns to Recognize Uniformly low power + high reported effect sizes → likely inflation and low replicability. Significant result reported without effect size or confidence interval → red flag for QRPs. “We stopped data collection when p < .05” → optional stopping; expect inflated significance. Absence of pre‑registration or open data in a field with known QRPs → higher risk of false positives. --- 🗂️ Exam Traps Distractor: “A p‑value of .03 guarantees the null hypothesis is false.” – Incorrect; it only indicates incompatibility at the chosen α, not certainty. Distractor: “Low power only increases false negatives, not false positives.” – Wrong; low power also inflates effect‑size estimates and can increase false positives via publication bias. Distractor: “Replication failures automatically invalidate the original theory.” – Misleading; failures may reflect contextual differences or insufficient power. Distractor: “All QRPs are illegal.” – Not true; they are questionable but not necessarily fraudulent. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or