Is feature importance in XGBoost or in any other tree based method reliable?

Question

This question is quite long, if you know how feature importance to tree based methods works i suggest you to skip to text below the image.

Feature importance (FI) in tree based methods is given by looking through how much each variable decrease the impurity of a such tree (for single trees) or mean impurity (for ensemble methods). I'm almost sure the FI for single trees it's not reliable due to high variance of trees mainly in how terminal regions are built. XGBoost is empirically better than single tree and "the best" ensemble learning algorithm so we will aim on it. One of advantages of using XGBoost is its regularization to avoid overfitting, XGBoost can also learn linear functions as good as linear regression or linear classifiers (see Didrik Nielsen). My trouble is about its interpretation has came up due to image bellow:

In upper side i've got the FI in XGBoost for each variable and below the FI (or coefs) in logistic regression model, i know that FI to XGB is normalized to ranges in 0-1 and logistic regression is not but the functions usually used to normalize something are bijective so it won't comprimise the comparation between the FI of two models, logistic regression got the same accuracy (~90) than XGB at cross validation and test set, note that the most three important variables in xgb are v5,v6,v8 (the importances are respective to variables) and in logistic model are v1,v2,v3 so it's totally different to the two models, i'm sure that the interpretation to logistic model is reliable so would xgboost interpretation not be reliable because this difference? if it wouldn't so it wouldn't only for linear situations or in general case?

If you admins think i should put this post on stats stackexchange to get some feedback please tell me i'm still not sure where tree methods matter would best fit in. — Davi Américo, Jul 15 '21 at 22:08
I'd like to first confirm that your logreg model's importances are computed correctly; could you provide the code? (In particular, are the features scaled in advance, and if not, did you scale the coefficients to produce importances?) — Ben Reiniger, Jul 15 '21 at 23:11
I've used xgboost FI in sklearn API (see https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) so its FI has been evaluated by MDI (see https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html). I havent used any transformation on data. In logistic model i used the coefficients of sigmoid function (see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.) — Davi Américo, Jul 16 '21 at 00:38
You need to scale your data, or postprocess the coefficients, if you want to use the coefficients as a measure of importance. See e.g. https://datascience.stackexchange.com/q/30302/55122, https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#interpreting-coefficients-scale-matters — Ben Reiniger, Jul 16 '21 at 01:04
I didnt think that would solve my problem at all now i have got v5,v6,v8 for xgboost and v6,v7,v8 for logistic if i take the five most important ones i got v6,v7,v8,v3,v5 for both. I'm in doubt if scale data would compromise to "extend" the interpretation of scaled variables to original variables what do you think?. Please use your last comment as answer i will consider it. Thank you a lot. — Davi Américo, Jul 16 '21 at 01:50

score 1 · Accepted Answer · answered Jul 16 '21 at 04:02

Your main problem (it turns out, thanks for following up in the comments) is that you used the raw coefficients from the logistic regression as a measure of importance, but the scale of the features makes such comparisons invalid. You should either scale the features before training, or process the coefficients after.

I find it helpful to emphasize that feature importances are generally about interpreting your model, which hopefully but not necessarily in turn tells you about the data. So in this case, it could be that some set of features has predictive interaction, or that some feature's relationship with the target is nonlinear; these will be found important for the xgboost model, but not for the linear one.

Aside from that, impurity-based feature importance for tree models have received some criticism.

I think i've seen interaction FI in xgboost documentation i will look through later. — Davi Américo, Jul 16 '21 at 06:08

Is feature importance in XGBoost or in any other tree based method reliable?

1 Answers1