📌 "OLS is the default, but IV is your solution when causality is hidden." Both are fundamental tools for finding relationships in data, but they answer different questions under different conditions. This guide breaks down the key differences in plain English.
The Core Problem: Endogeneity
Before choosing a method, you must understand the problem of endogeneity. This occurs when an explanatory variable is correlated with the error term in your regression model. It's like trying to measure the effect of studying on test scores, but smarter students both study more and naturally score higher. The 'smartness' mixes with your variable, making the effect of 'studying' unclear.
Model: Wage = β₀ + β₁(Education) + ε
Goal: Find the true effect of one more year of education (Education) on wage.
Problem: Ability (talent, motivation) affects both Education and Wage, but Ability is unobserved and ends up in the error term ε. This makes Education correlated with ε (endogeneity). An OLS estimate of β₁ will be biased upward.
Model: Quantity Sold = β₀ + β₁(Price) + ε
Goal: Estimate how price changes affect quantity sold (demand elasticity).
Problem: Price is not set randomly. When demand is high (ε is positive), both Price and Quantity Sold go up. This creates a positive correlation between Price and ε, biasing the OLS estimate.
Ordinary Least Squares (OLS) Regression
OLS is the standard, workhorse method. It finds the line (or hyperplane) that minimizes the sum of squared vertical distances between observed data points and the predicted line.
- Assumption: All explanatory variables are exogenous (uncorrelated with the error term).
- When it works: In controlled experiments, or when you are confident there are no hidden, correlated factors.
- Output: Unbiased and efficient estimates if its assumptions hold.
Instrumental Variables (IV) Regression
IV is a clever fix for endogeneity. It uses a third variable, called an instrument, to isolate the variation in your problematic variable that is unrelated to the error term.
- Core Idea: Find an instrument (Z) that: 1) Affects your endogenous variable (X), and 2) Affects the outcome (Y) only through X.
- When to use it: When you suspect or have evidence of endogeneity (e.g., omitted variable bias, measurement error, simultaneity).
- Trade-off: It solves bias but produces estimates with larger standard errors (less precise) than OLS.
Problem: OLS is biased for Education's effect on Wage due to omitted Ability.
Instrument (Z): Distance to the nearest college.
Logic:
1. Relevance: Living farther from college reduces education (affects X).
2. Exclusion: Distance to college affects wages only by changing education level, not directly (no effect on Y except through X).
Result: IV uses only the variation in education caused by distance changes to estimate the wage effect, purging the 'Ability' bias.
Problem: OLS is biased for Price's effect on Quantity due to supply-demand simultaneity.
Instrument (Z): Cost of raw materials for the seller.
Logic:
1. Relevance: Higher material costs shift the supply curve, changing the market Price (affects X).
2. Exclusion: Material costs don't affect consumer demand directly, only through the resulting price change (no effect on Y except through X).
Result: IV isolates price changes that come from supply shocks (cost changes), allowing us to cleanly trace the demand curve.
⚠️ Critical Pitfalls & Common Mistakes
- Using IV when OLS is fine: IV estimates are less precise. If no endogeneity exists, OLS is better. Always test for endogeneity (e.g., Hausman test) first.
- Weak Instruments: If your instrument is only weakly correlated with the endogenous variable, IV fails spectacularly, producing massive bias and nonsense results. The rule of thumb: First-stage F-statistic should be > 10.
- Invalid Instruments: An instrument that directly affects Y (violates exclusion) is fatal. This is untestable and must be justified by economic theory and logic.
- Misinterpreting IV: IV estimates a Local Average Treatment Effect (LATE)—the effect for the subpopulation whose behavior was changed by the instrument. It may not equal the effect for everyone.
Direct Comparison Table
| Aspect | OLS Regression | IV Regression |
|---|---|---|
| Main Goal | Find best linear fit to data | Correct for endogeneity bias |
| Key Assumption | Exogeneity: Cov(X, ε)=0 | Instrument Validity & Relevance |
| When to Use | No suspected hidden variable bias; experimental data | Suspected omitted variable, measurement error, or simultaneity |
| Estimate Property | Best Linear Unbiased Estimator (BLUE) if assumptions hold | Consistent but less efficient (higher variance) |
| Interpretation | Average effect across all data | Effect for the 'compiler' subpopulation (LATE) |
| Primary Risk | Bias from violated exogeneity | Bias from weak or invalid instruments |
The Practical Decision Flow
Follow these steps to choose your method:
- Start with Economic Theory: Ask, "Could there be a hidden factor affecting both my X and Y?" If yes, suspect endogeneity.
- Use OLS as a Baseline: Always run OLS first. It's your benchmark.
- Test for Endogeneity: Use statistical tests (e.g., Durbin-Wu-Hausman). If you reject exogeneity, OLS is biased.
- Find a Valid Instrument: This is the hardest part. The instrument must be strongly correlated with X and have a solid theoretical argument for affecting Y only through X.
- Run IV and Compare: If the IV estimate is statistically different from OLS and your instrument is strong/valid, trust IV. Report both results.