Subjects/Social Science/Economics/Economics/Econometrics

Applied Econometrics and Critical Issues

Understand causal inference methods in applied econometrics, common pitfalls such as omitted variable bias, and alternative estimation techniques.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What type of data does econometrics primarily rely on instead of controlled experiments?

1 of 13

Summary

Methods in Applied Econometrics Introduction Applied econometrics aims to understand causal economic relationships using real-world data. However, unlike laboratory scientists who can conduct controlled experiments, economists typically work with observational data—data generated from naturally occurring economic activity rather than from randomized trials. This fundamental constraint creates both challenges and opportunities in economic analysis. To overcome these challenges, econometricians have developed a suite of techniques for identifying and estimating causal effects from observational data. Understanding these methods, their assumptions, and their limitations is essential for conducting credible empirical economic research. The Core Challenge: From Correlation to Causation The central problem in applied econometrics is that observational data shows us correlations, but we want to learn about causal effects. Why is this distinction crucial? Consider a simple question: does education cause higher wages? We might observe that people with more education earn more money, but this correlation could arise from many sources. Perhaps more educated people have wealthier parents who also helped them get better jobs. Perhaps education selects for talented individuals who would earn more regardless. Without careful analysis, we cannot distinguish the causal effect of education itself from these confounding factors. The econometric solution involves developing identification strategies—methods that cleverly use the structure of data or special circumstances to isolate causal effects. These strategies rely on making assumptions about the data-generating process and then using those assumptions to extract causal information. A Concrete Example: Wage and Education Let's make this concrete with an example that appears frequently in econometrics: estimating the effect of education on wages. The Basic Model We might model the relationship as: $$\ln(\text{Wage}) = \alpha + \beta \cdot \text{Education} + \varepsilon$$ Here, the natural logarithm of wage is expressed as a linear function of years of education. The coefficient $\beta$ represents the percentage increase in wages associated with one additional year of education. For instance, if $\beta = 0.10$, this means an extra year of education is associated with roughly a 10% wage increase. The Role of the Error Term The error term $\varepsilon$ is critical but often misunderstood. It represents all factors other than education that influence wages: ability, work ethic, family background, social connections, timing of labor market entry, luck, and countless other factors. The error term isn't simply "noise" or "measurement error"—it contains genuine economic factors we haven't measured. The Key Assumption for OLS When we estimate this model using ordinary least squares (OLS), we rely on a crucial assumption: the error term $\varepsilon$ must be uncorrelated with years of education. Why is this assumption so important? If our unmeasured factors (the error term) correlate with education, then we cannot isolate the effect of education itself from the effect of these other factors. For example, suppose wealthier families both encourage their children to pursue more education and provide social connections that lead to higher wages. In this case, part of the wage increase we observe from more education is actually caused by these family connections (part of $\varepsilon$), not education itself. Because family wealth influences both education and wages, the error term is correlated with education, and OLS will overestimate $\beta$. Omitted Variable Bias The problem described above is called omitted variable bias. When we fail to include a variable that: Affects the outcome we're studying (wages), and Is correlated with our regressor of interest (education), then our estimate of the effect of education will be biased. It will not accurately reflect the true causal effect. Example: Suppose birthplace influences both education and wages (perhaps due to regional economic differences or cultural factors). If we omit birthplace from our regression, we have omitted variable bias. The estimated effect of education will be contaminated by the effect of birthplace. Controlling for Additional Variables One natural approach is to add measured variables to the regression. If we observe birthplace and include it in our model, we can reduce or eliminate the bias from that particular omitted variable: $$\ln(\text{Wage}) = \alpha + \beta \cdot \text{Education} + \gamma \cdot \text{Birthplace} + \varepsilon$$ However, this solution is limited. We can only control for variables we've measured. If important factors remain unmeasured—and they almost always do—omitted variable bias persists. Moreover, simply adding more variables is not a solution to all problems. Variables must be chosen thoughtfully based on economic theory, not mechanically added to improve fit. Identification Strategies for Causal Inference Because controlling for variables alone is insufficient, econometricians have developed more sophisticated quasi-experimental techniques that exploit special features of real-world data to identify causal effects. Instrumental Variables (IV) When the relationship between a regressor and the error term cannot be broken by including additional variables, we can use an instrumental variable. An instrumental variable $Z$ is a variable that: Is correlated with the endogenous regressor (the variable whose causal effect we want to measure), but Is uncorrelated with the error term (exogenous to the outcome) The intuition is that the instrument creates "exogenous variation" in the regressor that is uncaused by the factors in the error term. We can then use this clean variation to estimate the causal effect. Example: Consider estimating the effect of education on wages when ability (in the error term) correlates with both. A valid instrument might be a policy that randomly increased school funding in some regions but not others during the years when certain cohorts were in school. School funding would influence educational attainment but would be uncorrelated with individual ability, allowing us to estimate the true effect of education. Finding valid instruments is one of the deepest challenges in applied econometrics. An instrument must be theoretically justified and empirically defensible. Regression Discontinuity Design Regression discontinuity design (RDD) exploits sharp cutoff rules that determine treatment. When individuals just above a cutoff receive a treatment that individuals just below the cutoff don't receive, we can compare outcomes near the cutoff to estimate the treatment effect. Example: Universities might admit students based on a test score cutoff (e.g., only students scoring 80+ are admitted). By comparing outcomes of students just barely admitted versus just barely rejected, we can estimate the effect of college attendance. Students on either side of the cutoff are likely very similar in ability, so this comparison is relatively free from confounding. Difference-in-Differences Difference-in-differences (DiD) compares changes over time between a treated group and a control group. The identifying assumption is that these two groups would have followed parallel trends absent treatment. Example: A new job training program is implemented in some states but not others. DiD would compare how employment changed before and after the program in treated states versus control states. The difference between these two changes (the "double difference") estimates the program's causal effect. Natural Experiments In the absence of controlled experiments, econometricians actively search for natural experiments—events that exogenously affect some people or regions but not others, in a way that researchers did not engineer. Policy changes, natural disasters, economic shocks, or historical events can serve as natural experiments if they provide variation we can exploit. Natural experiments are valuable because they provide transparent, theory-motivated sources of variation that are plausibly unrelated to unobserved confounders. Common Pitfalls and Limitations Even with these sophisticated techniques, applied econometrics faces serious challenges and potential misuses. Model Misspecification and Spurious Correlation Badly specified models can show strong statistical relationships that have no economic meaning. When researchers include many variables without theoretical guidance, correlations often emerge by chance rather than representing true relationships. The fact that a coefficient is statistically significant does not mean it represents a substantive economic effect. Confusing Statistical Significance with Economic Importance A related error is treating statistical significance (p-values) as the primary criterion for judging results. Statistical significance depends on both the true effect size and the sample size. A very small effect can be statistically significant with enough data, yet economically negligible. Conversely, a large effect can fail to be statistically significant in a small sample due to high uncertainty. Researchers should always report effect sizes and discuss their practical importance alongside p-values. P-Hacking and Multiple Testing A serious problem in empirical research is "p-hacking": running many different model specifications and reporting only those that yield statistically significant results. If a researcher tries 100 specifications, roughly 5 will show significance purely by chance, even if no true relationship exists. Modern econometrics strongly discourages this practice. Researchers should specify their analysis plan in advance and report results transparently, including non-significant findings. Two-Way Causality When two variables can causally affect each other, regression analysis alone cannot disentangle the directions of causation. For example, does education increase earnings, or do higher-earning individuals invest more in education? Both could be true simultaneously. In such cases, the researcher must carefully discuss the theoretical mechanisms and use identification strategies (such as instrumental variables) to isolate specific causal directions. Summary: Key Concepts and Estimators Applied econometrics rests on several foundational principles: Core Principles: Multiple linear regression serves as the starting point, but violations of its assumptions require alternative techniques Good estimators should be unbiased (in expectation, they equal the truth), consistent (they converge to the truth as sample size grows), and efficient (precise relative to other estimators) Observational data requires careful design and strong identifying assumptions to support causal claims Model misspecification, omitted variable bias, and misuse of significance tests are pervasive pitfalls When Standard OLS Fails: When the core assumption that errors are uncorrelated with regressors is violated, alternative estimators are needed. These include: Maximum likelihood estimation: useful when the error distribution is non-normal or the model is nonlinear Generalized method of moments (GMM): flexible approach for moment-based estimation when standard assumptions fail Bayesian estimation: incorporates prior information and is increasingly used in applied work Quasi-Experimental Designs: When correlational analysis is insufficient, researchers employ: Regression discontinuity design: exploits sharp cutoff rules to create comparable treated and control groups Instrumental variables: uses exogenous variation correlated with the treatment but uncorrelated with confounders Difference-in-differences: compares trends between treated and control groups to isolate causal effects The modern econometrician is thus part scientist (developing identification strategies and testing assumptions), part detective (searching for natural experiments and valid instruments), and part skeptic (questioning correlations and remaining alert to bias). Success requires both technical skill in estimation and careful economic reasoning about what the data can and cannot tell us about causal relationships.

Flashcards

What type of data does econometrics primarily rely on instead of controlled experiments?

Observational data

What do econometricians seek to provide credible sources of variation when controlled experiments are unavailable?

Natural experiments

In the model $\ln(\text{Wage}) = \alpha + \beta \text{Education} + \varepsilon$, what does the coefficient $\beta$ represent?

The effect of one additional year of education on log wages

What is the role of the error term $\varepsilon$ in a wage regression model?

It captures all other factors influencing wages not included in the regression

Under what condition can coefficients in a wage-education model be consistently estimated by ordinary least squares (OLS)?

If the error term $\varepsilon$ is uncorrelated with years of education

What phenomenon occurs if an excluded variable, such as birthplace, influences both education and wages?

Omitted variable bias

What can badly specified econometric models display that are not actually causal?

Spurious correlations

What is the term for the discouraged practice of running many specifications and only selecting those with significant results?

P-hacking

What must researchers discuss to avoid contradictory results when two variables can each cause the other?

Underlying theory

What are the three desired properties of an econometric estimator?

Unbiasedness Consistency Efficiency

How does a regression discontinuity design (RDD) create comparable groups for analysis?

By exploiting a cutoff rule

What two conditions must an instrumental variable satisfy to achieve identification?

Correlated with the endogenous regressor and uncorrelated with the error term

How does the difference-in-differences (DiD) method isolate causal effects?

By comparing changes over time between treated and control groups

Quiz

What type of data does econometrics typically rely on for analysis?

1 of 6

Key Concepts

Causal Inference Methods

Quasi‑experimental design

Natural experiment

Regression discontinuity design

Difference‑in‑differences

Instrumental variable

Modeling Issues

Omitted variable bias

Model misspecification

P‑hacking

Research Approaches

Observational study

Simultaneous equations model

Definitions

Observational study

A research approach that analyzes data collected without controlled experimentation, relying on naturally occurring variations.

Simultaneous equations model

A system of interdependent equations representing multiple endogenous variables, such as supply and demand, estimated jointly.

Quasi‑experimental design

Methods for causal inference that approximate experimental conditions using observational data, e.g., regression discontinuity or difference‑in‑differences.

Natural experiment

An empirical situation where external factors create random‑like variation, allowing researchers to study causal effects without manipulation.

Omitted variable bias

The distortion of estimated relationships caused by leaving out a relevant variable that influences both the dependent and independent variables.

Regression discontinuity design

A quasi‑experimental technique that exploits a cutoff rule to compare observations just above and below the threshold, approximating random assignment.

Instrumental variable

A variable correlated with an endogenous regressor but uncorrelated with the error term, used to achieve consistent estimation.

Difference‑in‑differences

A method that compares changes over time between a treatment group and a control group to isolate the effect of an intervention.

Model misspecification

Errors arising when a statistical model fails to capture the true underlying data‑generating process, leading to biased or inconsistent estimates.

P‑hacking

The practice of repeatedly testing multiple model specifications or hypotheses until statistically significant results are found, inflating false‑positive rates.