📌 "R-squared always increases when you add more variables, even useless ones. Adjusted R-squared fixes this flaw." This single sentence captures the entire purpose of Adjusted R-squared. This article breaks down both metrics with clear examples.
In econometrics and quantitative research, regression models help us understand relationships between variables. Two common metrics to evaluate how well a model fits the data are R-squared and Adjusted R-squared. While they look similar, they serve different purposes. R-squared tells you the proportion of variance explained, but it has a critical weakness: it never decreases when you add more predictors. Adjusted R-squared corrects this by penalizing model complexity.
What is R-Squared?
R-squared, also known as the coefficient of determination, is a statistical measure that shows the percentage of the dependent variable's variation that is explained by the independent variables in the model. Its value ranges from 0 to 1 (or 0% to 100%). A higher R-squared generally indicates a better fit.
Model: Salary = β₀ + β₁(Experience) + ε
Result: R-squared = 0.65
Model: GDP Growth = β₀ + β₁(Investment) + β₂(Labor Force) + ε
Result: R-squared = 0.82
The Problem with R-Squared: It Always Increases
The formula for R-squared is: R² = 1 - (SSres / SStot). Because adding a new variable (even a random one) will always reduce the residual sum of squares (SSres) by at least a tiny amount, R-squared will never decrease when you add predictors. This makes it useless for comparing models with different numbers of variables.
⚠️ The Key Pitfall of R-Squared
- Flaw: R-squared mechanically increases with every added variable, encouraging overfitting.
- Consequence: You could add irrelevant predictors (like "number of pets owned" to a salary model) and still see R-squared rise slightly, misleading you into thinking the model improved.
- Solution: Use Adjusted R-squared for model comparison.
What is Adjusted R-Squared?
Adjusted R-squared adjusts the R-squared value based on the number of predictors (k) and the sample size (n). Its formula is: Adjusted R² = 1 - [(1 - R²) * (n - 1) / (n - k - 1)]. Unlike R-squared, Adjusted R-squared can decrease if a new predictor doesn't improve the model enough to justify the added complexity.
Original Model (2 predictors): House Price = β₀ + β₁(Size) + β₂(Bedrooms) + ε
R-squared: 0.75
Adjusted R-squared: 0.743
New Model (3 predictors): House Price = β₀ + β₁(Size) + β₂(Bedrooms) + β₃(Random Noise) + ε
R-squared: 0.751 (increased slightly)
Adjusted R-squared: 0.741 (decreased!)
Original Model (1 predictor): Car Fuel Efficiency = β₀ + β₁(Engine Size) + ε
R-squared: 0.60
Adjusted R-squared: 0.595
New Model (2 predictors): Car Fuel Efficiency = β₀ + β₁(Engine Size) + β₂(Weight) + ε
R-squared: 0.78 (increased)
Adjusted R-squared: 0.775 (also increased)
When to Use Which Metric?
| Scenario | Use R-Squared | Use Adjusted R-Squared |
|---|---|---|
| Describing a single model's fit | ✅ Yes. "This model explains 70% of the variance." | Optional, but Adjusted R-squared is more honest. |
| Comparing models with different # of predictors | ❌ No. It will mislead you. | ✅ Yes. This is its primary purpose. |
| Checking if a new variable improves the model | ❌ No. It will always say yes. | ✅ Yes. It will only increase if the variable adds sufficient explanatory power. |
| Reporting results in academic papers | Often reported alongside. | ✅ Almost always required as the primary fit statistic. |
Key Takeaways
1. R-squared measures fit, Adjusted R-squared measures fit per predictor. R-squared tells you how good the model is. Adjusted R-squared tells you how good it is given how many variables you used.
2. Adjusted R-squared is always lower than or equal to R-squared. The gap widens as you add more variables relative to your sample size.
3. The rule for model selection is simple: When comparing models, choose the one with the highest Adjusted R-squared. This automatically balances explanatory power with model simplicity.
In practice, serious quantitative analysis always uses Adjusted R-squared for model comparison. R-squared alone is an incomplete and potentially misleading statistic.