4

I have model like y=mx. Since the adjusted R2 tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable and I have only one independent variable, do I need to consider my adjusted r-square value? Or, r-square is good for this type of model?

Peter
  • 7,277
  • 5
  • 18
  • 47

3 Answers3

1

They're going to be very similar (practically the same), for a model with only one independent variable. So I'd say it doesn't matter without understanding better your purpose in using R2 / Adj R.

1

Your interpretation of R² is not correct.

R2 tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable

R² does not perform any variable selection - it [...]

is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

However, there is often a misconception about R² - It does not tell you if your model is correctly specified (e.g. Homoscedasticity, No Autocorrelation etc..) nor it does tell you if your regressor is significant.

Extreme High R² can also mean a spurious regression (as the model is not correctly specified)

Nonetheless, deciding to use adj R² or R² is somewhat depending on your sample size. If you have enough observations (and you only have a small number of regressors (degree of freedoms)) then adj R² and R² are almost identical. Use it if you have only a few data points to estimate your model.

Maeaex1
  • 578
  • 2
  • 15
1

Your question boils down to what the difference between $R^2$ and $\bar{R^2}$ is.

R-squared is given by: $$ R^2=1-(SSR/n)/(SST/n) .$$

The adjusted R-squared is given by: $$ \bar{R^2}=1- [ SSR/(n-k-1)]/[SST/(n-1) ].$$

  • $SSR$ is the sum of squared residuals $\sum u_i^2$,

  • $SST$ is the total sum of squares $(y-\bar{y})^2$,

  • $n$ is the number of observations,

  • and $k$ is the number of independent variables (the number of $x$ variables).

So essentially, the adjusted R-squared "adjusts" for the degree of freedem in your model. This is done by introducing a "penalty" for adding more independent variables $k$.

It is easy to write this in R:

# Regression using mtcars data
reg = lm(mpg~cyl,data=mtcars)

# Define n, k
n = length(mtcars$mpg)
k = nrow(mtcars)-1-df.residual(reg)

# Calculate SSR, SST
ssr = sum(resid(reg)^2)
sst = sum((mtcars$mpg - mean(mtcars$mpg))^2)

# Calculate r2, r2_bar
r2  = 1-(ssr/n)/(sst/n)
r2_bar = 1-(ssr/(n-k-1))/(sst/(n-1))

# Compare results
r2
summary(reg)$r.squared
r2_bar
summary(reg)$adj.r.squared

Adjustment for the degree of freedom in the model is used because when you add more $x$ variables to your model, the new variables may probably not help to explain $y$ (so no improvement whatsoever in this case). However, after adding more variables to the model, $SSR$ falls, but also the degree of freedom falls.

So $R^2$ can be a little misleading while $\bar{R^2}$ provides - because of adjustment by the degree of freedom - a better guidance when comparing (nested) models with different $k$.

In the little exercise below, I add a "noisy" variable ($x_2$) which does not help much to explain $y$. After adding $x_2$, $R^2$ goes up, while $\bar{R^2}$ goes down. This essentially is what $\bar{R^2}$ is supposed to do: To show if the reduction in the degrees of freedom is worth the improvement from adding a new variable.

# Use simulated data to compare r2, r2_bar
# Set seed for reproducible results
set.seed(81)

# Draw y, x1 from normal distribution
y = rnorm(100, mean = 0, sd = 1)
x1 = rnorm(100, mean = 0, sd = 1)

# Draw from uniform distribution 
# Lot of noise, little explanatory power
x2 = runif(100, min = 0, max = 1)

# Compare r2, r2_bar
summary(lm(y~x1))$r.squared
summary(lm(y~x1))$adj.r.squared
summary(lm(y~x1+x2))$r.squared
summary(lm(y~x1+x2))$adj.r.squared
Peter
  • 7,277
  • 5
  • 18
  • 47