I have a dataset that has high collinearity among variables. When I created the linear regression model, I could not include more than five variables ( I eliminated the feature whenever VIF>5). But I need to have all the variables in the model and find their relative importance. Is there any way around it?. I was thinking about doing PCA and creating models on principal components. Does it help?.
-
Why can’t you include more than five variables? – Dave Oct 31 '21 at 02:30
-
Because VIF increases beyond 5 when I use more than 5 features. – NAS_2339 Oct 31 '21 at 17:50
-
So VIF exceeds $5$…how does that impact your analysis? – Dave Oct 31 '21 at 20:24
-
Doesn't it mean high collinearity in the data? So that I can't keep those features – NAS_2339 Nov 01 '21 at 01:17
-
1But VIF of 4.5 also means that there is (multi)collinearity. How does VIF $>5$ impact your analysis? – Dave Nov 01 '21 at 03:05
-
I set the threshold as 5. Isn't VIF 3-5 usually specified as the threshold. – NAS_2339 Nov 01 '21 at 05:18
-
Why have a threshold at all? – Dave Nov 01 '21 at 10:33
-
What are you suggesting? I'm not very clear – NAS_2339 Nov 01 '21 at 12:06
-
Why not just include all of your variables? Why set a cutoff based on VIF? – Dave Nov 01 '21 at 12:25
-
Wouldn't that make coefficients unstable if multicollinearity exists?. I intend to get feature importance from the model. What do you suggest? – NAS_2339 Nov 01 '21 at 12:32
-
I suggest you be very clear about your goals. // Yes, multicollinearity can result in coefficient instability (variance), but omitting variables can result in bias. Are you familiar with the bias variance decomposition of mean squared error? // Figuring out which five (or four, or six) variables you will include in your model can invalidate downstream results. This is at least evocative of the [myriad issues with stepwise regression](https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/). (The math does not depend on Stata software.) – Dave Nov 01 '21 at 12:49
-
Thanks, Dave! I need to read up on this. But I'm still unclear about the way around this problem. How do I find the feature importance if I have high collinearity in the data?. Is it possible to find feature importance in a meaningful way in this data? – NAS_2339 Nov 01 '21 at 13:04
-
It is a hard problem to untangle feature influence when the features are related. The gist is that, when the features are related, how can you attribute changes in $y$ to either of the features instead of the other? – Dave Nov 01 '21 at 14:32
-
A [post](https://stats.stackexchange.com/questions/555145/ridge-regression-for-multicollinearity-and-outliers/555163#555163) of mine on the statistics Stack, Cross Validated, is worth a read. Correlated features get An undeserved bad wrap. – Dave May 26 '22 at 00:35
2 Answers
When using PCA, you should not try to interpret the single features anymore. The principal components are multiple linear combinations of your variables that should not be related to the original features.
When you want to work on feature importance, you can use random forests or decision trees instead, as described before. You can do it with neural networks as well by randomizing or shuffling one feature, re-train the network, and comparing the performance.
- 311
- 1
- 3
PCA will generate „new“ (transformed) features which are orthogonal (non-correlated). However, since the original features are transformed, you can hardly claim to say a lot about the importance of (original) features based on PCA.
One obvious alternative would be to use a random forest (RF) to determine feature importance. Using tree based models (like RF or tree based boosting) you do not need to care about collinearity in the feature space.
- 7,277
- 5
- 18
- 47
-
2But my principal components are still a linear combination of the original variables, right?. Can I distribute the feature importance of the principal components to the original variables somehow? – NAS_2339 Oct 26 '21 at 05:49
-