Introduction to Econometrics
Understand the fundamentals of econometrics, covering data types, OLS regression, hypothesis testing, model diagnostics, and basic causal inference techniques.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the primary function of econometrics in relation to economic theory?
1 of 24
Summary
Introduction to Econometrics: Connecting Theory, Data, and Methods
What is Econometrics?
Econometrics is the field that brings economic theory to life through statistical analysis. At its core, econometrics uses statistical tools to transform abstract economic predictions into testable, quantitative statements. Instead of saying "lower prices should increase quantity demanded," econometrics lets us estimate how much quantity increases when price falls by one unit, and whether that relationship is statistically meaningful or could be due to chance.
The Three Pillars: Theory, Data, and Methods
Econometrics operates at the intersection of three essential components:
Economic Theory guides what we should measure. Theory tells us which variables matter and how we expect them to be related. For example, economic theory predicts that a firm's output depends on the amount of labor and capital it employs. Without theory, we wouldn't know where to start.
Data provides the empirical observations we need to test predictions. Data are the actual measurements we collect from the real world—wage rates, employment levels, stock prices, and so on. Theory without data is just speculation; data without theory is just numbers.
Statistical Methods give us the tools to extract meaningful conclusions from data. These methods help us estimate relationships, quantify uncertainty, and test whether observed patterns reflect true relationships or mere chance.
Understanding Types of Economic Data
Before we can analyze economic data, we need to recognize what form that data takes. The structure of your data fundamentally affects how you conduct your analysis.
Cross-Sectional Data
Cross-sectional data consist of observations on different individuals, firms, or other entities at a single point in time. Think of a snapshot: you measure many different things, but all at once.
Example: A survey of 1,000 households in 2023 measuring income, education, and hours worked per week for each household. Each household is observed once.
Cross-sectional data are useful for understanding differences across units at a given moment. They work well for studying how characteristics differ between regions, people, or firms.
Time-Series Data
Time-series data consist of observations on one or a few variables collected repeatedly over many time periods. Instead of many units at one time, you have one (or few) units observed repeatedly.
Example: The monthly unemployment rate in the United States from January 2010 through December 2023. You have one variable (unemployment rate) measured many times (156 observations).
Time-series data reveal how variables evolve over time. They're essential for studying economic trends, business cycles, and the effects of policy changes that unfold gradually.
Panel Data
Panel data combine both dimensions: you observe the same units repeatedly over time. This gives you the richest structure.
Example: You track 500 firms' revenue, employment, and R&D spending from 2015 through 2023. You have 500 units × 9 years = 4,500 observations, but they're not independent—you're watching the same firms over time.
Panel data are powerful because they let you control for fixed characteristics of each unit that don't change over time. If some firms are naturally more efficient than others, panel data help you account for that.
Ordinary Least Squares Regression: The Foundation of Econometric Analysis
Ordinary Least Squares (OLS) regression is the most fundamental econometric tool. It's the method you'll use constantly, so understanding it deeply is essential.
Setting Up a Linear Model
In OLS regression, we model how one variable (the dependent variable, denoted $Y$) is related to one or more other variables (independent variables, denoted $X$). We assume this relationship is linear, meaning each independent variable's effect doesn't depend on the values of other variables.
The basic linear model is:
$$Y = \beta0 + \beta1 X1 + \beta2 X2 + \cdots + \betak Xk + u$$
where:
$Y$ is the dependent variable (what we want to explain)
$X1, X2, \ldots, Xk$ are independent variables (factors we think influence $Y$)
$\beta0$ is the intercept (the expected value of $Y$ when all $X$ variables equal zero)
$\beta1, \beta2, \ldots, \betak$ are slope coefficients (the effect of each $X$ variable on $Y$)
$u$ is the error term (the part of $Y$ we cannot explain)
Example: To model how house prices depend on square footage and the number of bedrooms:
$$\text{Price} = \beta0 + \beta1 \times \text{Square Feet} + \beta2 \times \text{Bedrooms} + u$$
We expect $\beta1 > 0$ (more square feet → higher price) and $\beta2 > 0$ (more bedrooms → higher price). But theory doesn't tell us the exact values—that's what data reveals.
The OLS Principle: Minimizing Squared Errors
OLS finds the best-fitting line (or hyperplane in multiple dimensions) by minimizing the sum of squared residuals. A residual is the difference between the actual value of $Y$ and the predicted value:
$$\text{Residual} = Y - \widehat{Y}$$
where $\widehat{Y}$ is the predicted value based on our estimated model.
OLS minimizes:
$$\sum{i=1}^{n} (Yi - \widehat{Y}i)^2 = \sum{i=1}^{n} (Yi - \hat{\beta}0 - \hat{\beta}1 X{1i} - \cdots - \hat{\beta}k X{ki})^2$$
Why square the residuals rather than just sum them? Because positive and negative errors would cancel out, giving a misleading picture. Squaring also penalizes large errors more heavily, encouraging the line to avoid extreme outliers.
This produces estimated coefficients $\hat{\beta}0, \hat{\beta}1, \ldots, \hat{\beta}k$ that represent our best guess about the true relationships, given the data we have.
Interpreting Estimated Coefficients
This is where econometrics becomes directly useful. Each estimated coefficient $\hat{\beta}j$ tells you the average effect of variable $Xj$ on $Y$, holding other variables constant.
More precisely: A coefficient $\hat{\beta}j$ means that a one-unit increase in $Xj$ is associated with a $\hat{\beta}j$-unit change in $Y$, on average, assuming we hold all other independent variables fixed.
Example: If we estimate:
$$\text{Wage} = 10,000 + 2,000 \times \text{Years of Education} + 500 \times \text{Years of Experience} + u$$
Then $\hat{\beta}1 = 2,000$ means: one additional year of education is associated with $2,000 more in annual wages (holding experience constant). Similarly, $\hat{\beta}2 = 500$ means: one additional year of experience adds $500 in annual wages (holding education constant).
A critical caveat: These are associations, not necessarily causal effects. We'll address how to identify true causal effects later. For now, recognize that the coefficient tells you the direction and magnitude of the observed relationship.
Measuring Uncertainty: Standard Errors, Confidence Intervals, and t-Statistics
Our coefficient estimates are based on a sample of data. If we collected a different sample, we'd get slightly different estimates. Standard errors quantify this sampling variation.
The standard error of an estimated coefficient $\hat{\beta}j$ is the estimated standard deviation of that coefficient across repeated samples. A smaller standard error means we can estimate the coefficient more precisely.
From the standard error, we construct a confidence interval, typically a 95% confidence interval:
$$\hat{\beta}j \pm 1.96 \times SE(\hat{\beta}j)$$
This interval should contain the true parameter $\betaj$ roughly 95% of the time across repeated samples.
The t-statistic tests whether a coefficient is meaningfully different from zero:
$$t = \frac{\hat{\beta}j}{SE(\hat{\beta}j)}$$
A large t-statistic (in absolute value) suggests the estimated effect is large relative to the uncertainty around it.
Hypothesis Testing: Making Statistical Inferences
Hypothesis testing is how we use sample data to make conclusions about population parameters—the true relationships in the broader world.
Setting Up Hypotheses
A null hypothesis ($H0$) is a claim we're skeptical of and want to test. In econometrics, the null is typically that a coefficient equals zero (no effect):
$$H0: \betaj = 0$$
The alternative hypothesis ($H1$ or $Ha$) is what we'd conclude if the null hypothesis is false. Often:
$$H1: \betaj \neq 0$$
Example: We test whether education affects wages. The null says education has no effect on wages ($\beta = 0$). The alternative says it does have some effect ($\beta \neq 0$).
Test Statistics and Decision Rules
We use the t-statistic as our test statistic. Here's the logic:
If the null hypothesis is true ($\betaj = 0$), then $\hat{\beta}j$ should be close to zero, and the t-statistic should be small.
If the null is false ($\betaj \neq 0$), then $\hat{\beta}j$ should be noticeably different from zero, and the t-statistic should be large.
We compare the calculated t-statistic to a critical value. For a two-tailed test with a typical significance level of 0.05:
If $|t| > 1.96$, we reject the null hypothesis
If $|t| \leq 1.96$, we fail to reject the null hypothesis
"Failing to reject" doesn't mean the null is true—it means the data don't provide strong enough evidence against it.
p-Values and Significance Levels
The p-value translates the t-statistic into a more intuitive measure: it's the probability of observing a t-statistic as extreme as (or more extreme than) the one we calculated, assuming the null hypothesis is true.
A small p-value (like 0.01) means our observed result would be very unlikely if the null were true—strong evidence against the null
A large p-value (like 0.50) means our result is quite plausible even if the null is true—weak evidence against the null
The significance level (often called $\alpha$) is our decision threshold, typically set at 0.05. We reject the null if the p-value $<$ significance level.
Interpretation: If $p = 0.03$, we say the result is "statistically significant at the 5% level" because $0.03 < 0.05$. This means: if there were truly no effect, we'd see a result this extreme only 3% of the time by chance.
Important distinction: Statistical significance ≠ practical significance. A large sample can make even tiny effects statistically significant, while a small sample might miss large effects.
Model Diagnostics: Ensuring Your Results are Valid
OLS regression relies on several assumptions. When these fail, your estimates might be misleading. Smart econometricians always check these assumptions.
The Linearity Assumption
The linearity assumption requires that the relationship between each independent variable and the dependent variable be linear in parameters. This means each coefficient multiplies its corresponding variable in a straightforward way:
$$Y = \beta0 + \beta1 X1 + \beta2 X2 + u \quad \text{✓ Linear}$$
$$Y = \beta0 + \beta1 X1 + \beta2 X1^2 + u \quad \text{✓ Linear (in parameters)}$$
$$Y = \beta0 + \beta1^{X1} + u \quad \text{✗ Not linear}$$
The second example is still linear in parameters—even though we include $X1^2$, the relationship is still linear in $\beta1$ and $\beta2$. The third is not, because the parameter $\beta1$ appears as an exponent.
When it's violated: Real relationships might curve (diminishing returns) or have other nonlinear patterns.
Solution: Transform variables. Instead of $Y = \beta0 + \beta1 X + u$, use $Y = \beta0 + \beta1 \ln(X) + u$ or $Y = \beta0 + \beta1 X + \beta2 X^2 + u$ to capture curvature.
No Perfect Multicollinearity
Multicollinearity occurs when independent variables are highly correlated with each other. Perfect multicollinearity means one variable is an exact linear combination of others—this makes estimation mathematically impossible.
Example of perfect multicollinearity:
You include both annual income and monthly income in the same model
Monthly income = Annual income / 12, so one is a perfect linear combination of the other
The regression cannot estimate separate effects for both
When it's violated: Your software either refuses to run the regression or produces results that are numerically unstable.
Less problematic (but still worth noting): Variables are highly correlated but not perfectly. This doesn't violate the assumption but does make coefficient estimates less precise (larger standard errors).
Solution: Remove redundant variables. Include annual income or monthly income, not both.
Homoscedasticity: Constant Error Variance
The homoscedasticity assumption requires that the variance of the error term $u$ be constant across all observations. In other words, the scatter around your regression line should be roughly the same whether you're looking at low values of $X$ or high values.
Homoscedastic errors: The spread of points around the line is consistent.
Heteroscedastic errors: The spread widens or narrows as $X$ changes.
When it's violated: Standard errors and confidence intervals become incorrect (usually too small), making your t-statistics unreliable.
Solution: Use robust standard errors (also called heteroscedasticity-robust standard errors), which adjust for non-constant variance. This preserves the validity of hypothesis tests without changing your coefficient estimates.
Normality of Error Terms
This assumption requires that the error term $u$ be normally distributed. This is most important for inference (hypothesis tests, confidence intervals) when your sample size is small.
When it's violated: With small samples, t-statistics and p-values become unreliable. With large samples (generally $n > 30$), violating normality matters less because of the Central Limit Theorem.
Solution: Use variable transformations (like taking logarithms) or use nonparametric statistical tests. For large samples, mild violations are usually not critical.
Checking Assumptions: Diagnostic Tests
Good practice involves checking these assumptions:
Linearity: Examine scatter plots of the dependent variable against each independent variable. Look for curved patterns.
Multicollinearity: Calculate correlation coefficients between independent variables. Look for very high correlations (above 0.8 or 0.9).
Homoscedasticity: Create a scatter plot of residuals versus fitted values. The spread should be roughly constant across the plot.
Normality: Create a histogram or Q-Q plot of residuals. They should roughly follow a bell curve.
Basic Causal Inference: From Correlation to Causation
One of econometrics' greatest challenges is moving beyond "X and Y are correlated" to "X causes Y." This matters enormously because policy decisions should be based on causal effects, not mere associations.
Why Correlation Isn't Causation
Consider this example: Ice cream consumption and drowning deaths are highly correlated (both peak in summer). But ice cream doesn't cause drowning—both increase because of warm weather. This is confounding: a third variable (temperature) causes both.
More generally, three problems prevent correlation from establishing causation:
Confounding: An omitted variable affects both $X$ and $Y$
Reverse causality: $Y$ causes $X$ rather than the other way around
Selection bias: The sample is not representative of the population we care about
OLS regression alone cannot distinguish these cases. You need to think carefully about the mechanism and structure of your data.
Natural Experiments
A natural experiment exploits an exogenous shock or event that mimics random assignment. The key word is exogenous—something that happens for reasons outside the system you're studying.
Example: A new law raising the school enrollment age affects some cohorts of students but not others. By comparing students just above and just below the age cutoff, you can estimate the effect of schooling on earnings, since the "treatment" (who is affected by the new law) is essentially random.
Another classic example: A card game or restaurant closed in one city but not nearby cities. This provides a natural control group, letting you estimate the effect on local employment by comparing the affected city to similar cities.
The power of natural experiments is that they make the assignment to treatment roughly random, eliminating confounding. You're not worried that people who chose more schooling are different in other ways—the law forced some people to get more schooling, regardless of their preferences.
Instrumental Variables
An instrumental variable (IV) is a variable that affects your independent variable $X$ but doesn't directly affect your dependent variable $Y$ (except through $X$). This lets you estimate the causal effect even when direct confounding exists.
Example: Suppose you want to estimate the effect of education on earnings, but ability affects both (confounding). Family education level (whether your parents are educated) is likely to influence your own education but shouldn't directly affect your earning ability beyond its effect on your education. So "parents' education" could be an instrumental variable.
The IV approach works in two stages:
Use the instrument to estimate the exogenous variation in $X$
Use this exogenous variation to estimate the effect on $Y$
Key requirement: The instrument must be correlated with $X$ (strong first stage) and must not directly affect $Y$ except through $X$ (exclusion restriction). Both are empirical questions that require careful justification based on economic theory and institutional knowledge.
Difference-in-Differences Designs
A difference-in-differences (DiD) design compares changes over time between a treated group and a control group. The key insight: if the treated and control groups follow parallel trends absent treatment, then the difference in their trends after treatment represents the causal effect.
Example: A state implements a new job training program in Year 2, but a neighboring state does not. You observe employment in both states before and after Year 2.
Without the program: both states should show similar employment trends
With the program: the treated state should show different employment changes
The difference in these changes estimates the program's causal effect
Why this works: Unlike simple before-after comparisons, DiD controls for time trends that affect everyone. Unlike simple comparisons of treated vs. untreated, DiD accounts for pre-existing differences between groups.
The identifying assumption is parallel trends: absent treatment, the treated and control groups would follow the same trend. This is unprovable but can be checked by examining trends in the pre-treatment period.
The Practical Workflow: Bringing It All Together
Empirical economic research doesn't unfold as a series of isolated steps. It's an iterative process where theory, method, and data constantly interact.
From Theory to Specification
Begin with economic theory. What does theory predict about relationships? This guides your choice of dependent variable, which independent variables to include, and their expected signs. Theory answers: "What should I measure?"
From Specification to Estimation
Next, specify your model and estimate it using OLS (or another appropriate method). This gives you coefficient estimates and standard errors. Theory and data together answer: "What are the estimated relationships?"
From Estimation to Diagnostics
Always check the assumptions. Do the diagnostic tests suggest your model is reliable? Are residuals reasonably normally distributed? Is there concerning multicollinearity? Is the relationship truly linear? These questions determine whether you trust your results.
From Diagnostics to Interpretation
If diagnostics pass, interpret your results. Are the estimated effects reasonable in magnitude? Do their signs match theory? Are they statistically significant? Interpretation connects numbers back to economic meaning.
From Interpretation to Robustness Checking
Responsible econometrics involves checking whether conclusions hold under different specifications. What if you include additional variables? Use a different data sample? Include squared terms to allow nonlinearity? If results hold across specifications, they're robust. If they change dramatically, that's a red flag.
Addressing Causality
Finally, consider whether your estimates have a causal interpretation. If correlation isn't enough for your purpose, you might need natural experiments, instrumental variables, or difference-in-differences designs. Often this requires rethinking your entire empirical strategy.
This workflow is iterative. You might discover through diagnostics that your model needs transformation, which sends you back to estimation. You might notice in robustness checks that results depend critically on one specification, which sends you back to theory to reconsider what should matter.
The goal is not to prove your theory right—it's to honestly assess what the data reveal and to report results transparently, acknowledging limitations.
Flashcards
What is the primary function of econometrics in relation to economic theory?
It uses statistical tools to transform economic theory into testable quantitative statements.
Which three elements intersect to form the field of econometrics?
Economic theory
Data
Statistical methods
In econometric model specification, what is the role of economic theory?
It determines which variables should matter and how they are expected to be related.
How is cross-sectional data defined?
Observations on different individuals or entities at a single point in time.
How is time-series data defined?
Observations on one variable collected over many time periods.
How is panel data defined?
The same units observed repeatedly over time, combining cross-sectional and time-series dimensions.
How is the dependent variable $Y$ modeled in OLS regression?
As a linear function of one or more independent variables $X$.
What is the estimation principle used to find the line or hyper-plane in OLS?
Minimizing the sum of squared differences between observed and predicted values.
What does each estimated coefficient in an OLS model quantify?
The average effect of its corresponding independent variable on the dependent variable.
Which three values are computed from estimated coefficients to assess uncertainty?
Standard errors
Confidence intervals
t-statistics
How do researchers indicate whether an estimated effect is statistically significant or due to chance?
By using a t-statistic and its associated p-value.
What are the two types of hypotheses formulated during testing?
The null hypothesis (claim holds) and the alternative hypothesis (claim does not hold).
What is the function of test statistics in the decision-making process?
They compare sample evidence to the null hypothesis to guide its rejection or non-rejection.
What does a p-value measure in the context of a null hypothesis?
The probability of observing data as extreme as those obtained if the null hypothesis were true.
In hypothesis testing, what is a significance level?
The threshold set for the rejection of the null hypothesis.
What does the linearity assumption require regarding independent variables?
That the relationship between each independent variable and the dependent variable is linear in parameters.
What is required by the "no perfect multicollinearity" assumption?
That no independent variable is an exact linear combination of the others.
What does the homoscedasticity assumption require regarding error terms?
That the variance of the error term is constant across all observations.
What does the normality assumption require for valid inference?
That the error terms are normally distributed.
What is the primary goal of basic causal inference techniques?
To separate genuine causal effects from mere correlations.
How do natural experiments identify causal effects?
By exploiting exogenous shocks or events that mimic random assignment.
How do instrumental variables achieve identification of causal effects?
They use external variables that affect the independent variable but not the dependent variable directly.
How do difference-in-differences designs isolate causal impacts?
By comparing changes over time between a treated group and a control group.
Through which three actions do researchers assess the reliability of their findings?
Checking model diagnostics
Testing hypotheses
Ensuring assumptions hold
Quiz
Introduction to Econometrics Quiz Question 1: Cross‑sectional data are characterized by which of the following?
- Observations on different individuals or entities at a single point in time (correct)
- Observations on one variable over many time periods
- Observations on the same units repeatedly over time
- Observations that combine both cross‑section and time dimensions
Introduction to Econometrics Quiz Question 2: In ordinary least squares regression, the dependent variable $Y$ is modeled as what type of function of the independent variables $X$?
- A linear function (correct)
- A logarithmic function
- A quadratic function
- A non‑parametric function
Introduction to Econometrics Quiz Question 3: The linearity assumption in OLS requires that the relationship between each independent variable and the dependent variable be what?
- Linear in parameters (correct)
- Constant across observations
- Normally distributed
- Heteroskedastic
Introduction to Econometrics Quiz Question 4: Econometrics is situated at the intersection of which three elements?
- Economic theory, data, and statistical methods (correct)
- Economic policy, accounting, and mathematics
- Finance, marketing, and operations research
- History, philosophy, and sociology
Introduction to Econometrics Quiz Question 5: Time‑series data consist of observations on a single variable recorded over what?
- Multiple time periods (correct)
- Different individuals at one point in time
- Various countries in a single year
- Multiple variables across several regions
Introduction to Econometrics Quiz Question 6: Panel data combine which two dimensions?
- Cross‑sectional units and time periods (correct)
- Geographic and demographic characteristics
- Experimental and control groups
- Qualitative and quantitative data types
Introduction to Econometrics Quiz Question 7: A p‑value measures the probability of observing data as extreme as those obtained under which condition?
- Assuming the null hypothesis is true (correct)
- Assuming the alternative hypothesis is true
- Assuming perfect multicollinearity
- Assuming normality of the error terms
Introduction to Econometrics Quiz Question 8: The no perfect multicollinearity assumption requires that no independent variable be …
- An exact linear combination of the others (correct)
- Correlated at any level with another variable
- Measured without error
- Qualitative rather than quantitative
Introduction to Econometrics Quiz Question 9: Homoscedasticity means the variance of the error term is …
- Constant across all observations (correct)
- Increasing with time
- Dependent on the level of the dependent variable
- Zero for all observations
Introduction to Econometrics Quiz Question 10: A key advantage of natural experiments in causal identification is that they …
- Mimic random assignment through exogenous shocks (correct)
- Allow researchers to directly control the treatment
- Eliminate the need for any data collection
- Guarantee perfect measurement of variables
Introduction to Econometrics Quiz Question 11: Difference‑in‑differences designs compare what to isolate causal impacts?
- Changes over time between treated and control groups (correct)
- Cross‑sectional differences at a single point in time
- Differences in means across unrelated groups
- Variations within the treatment group only
Cross‑sectional data are characterized by which of the following?
1 of 11
Key Concepts
Data Types
Cross‑sectional data
Time‑series data
Panel data
Econometric Methods
Ordinary least squares regression
Instrumental variable
Difference‑in‑differences
Natural experiment
Model Evaluation
Econometrics
Hypothesis testing
Model diagnostics
Definitions
Econometrics
The application of statistical methods to economic data for testing theories and estimating relationships.
Cross‑sectional data
Observations on multiple entities (individuals, firms, etc.) collected at a single point in time.
Time‑series data
Sequential observations of a single variable recorded over successive time periods.
Panel data
Data that track the same entities over time, combining cross‑sectional and time‑series dimensions.
Ordinary least squares regression
A linear estimation technique that minimizes the sum of squared residuals to fit a model.
Hypothesis testing
A statistical procedure for evaluating whether observed data are consistent with a specified null hypothesis.
Model diagnostics
Techniques for checking whether the assumptions underlying an econometric model are satisfied.
Instrumental variable
An external variable used to isolate causal effects when the explanatory variable is endogenous.
Difference‑in‑differences
A quasi‑experimental method that compares changes over time between a treatment and a control group.
Natural experiment
An observational study where external shocks approximate random assignment, enabling causal inference.