Why would the result change so much for a linear regression with or without a constant?

Question

I was running a Linear Regression with Wooldridge dataset named GPA2, which is found on Python library named wooldridge.

I tried two linear regressions. The first:

results = smf.ols('colgpa ~ hsperc + sat', data=gpa).fit()

And the second

results = smf.ols('colgpa ~ hsperc + sat - 1', data=gpa).fit()

As you can see, there are no major differences between them, I've only removed the intercept from the seconde equation. However, a couple of things changes: (I) the warning of high multicollinearity disapeared when I removed the intercept; (II) The r-squared and adjusted r-squared went both from 0.273 to 0.954; (III) the f-statistic went from 1.77e-287 to 4.284e+04.

Why would this happen only by removing the intercept? Shouldn't them really be pretty similar?

Also, when running a variance inflation factor, I got a pretty high number for the constant. How's that possible?

Thanks

score 0 · Answer 1 · answered Aug 02 '23 at 07:26

High Multicollinearity Warning Disappeared: When you remove the intercept term from the regression equation (by specifying - 1 in the formula), it effectively removes the constant term from the model. Without the constant term, the independent variables (hsperc and sat) are centered around the origin (0,0) in the data. This centering reduces the multicollinearity between the independent variables, as they are no longer forced to pass through a fixed point (the intercept). Hence, the high multicollinearity warning disappears.
R-squared and Adjusted R-squared Increased: R-squared and Adjusted R-squared are measures of how well the model fits the data. By removing the intercept term, you are essentially fitting a model through the origin (0,0). When the intercept is included, the model can shift up or down, leading to a different fit. In this case, since the intercept term is omitted, the model is forced to pass through the origin, and the resulting fit captures the data's variation much better, resulting in significantly higher R-squared and Adjusted R-squared values.
F-Statistic Increased: The F-statistic is a measure of the overall significance of the regression model. When the intercept is omitted, the model is restricted to pass through the origin, and this simpler model is compared to the full model (with an intercept) using the F-statistic. In your case, since the simpler model (without the intercept) fits the data much better (as evidenced by the higher R-squared values), the F-statistic becomes much larger, indicating a more significant overall fit.

Regarding the high variance inflation factor (VIF) for the constant term, this can happen if the constant term is highly correlated with one or more of the independent variables in your model. Since the constant term is effectively a column of ones, it can be highly correlated with other variables that have relatively large values. This correlation can lead to a high VIF for the constant term, indicating multicollinearity.

Why would the result change so much for a linear regression with or without a constant?

1 Answers1