0

I am currently building a tree, with 10 features but setting max_depth = 2 in sklearn.tree.DecisionTreeClassifier.

Since only tree features are explicitly used to make predictions I wondered about dropping uneeded columns.

Counter-intuitively, the absence of just 1 of the 7 variables not highlighted by the tree is enough to change the accuracy of the predictions, albeit marginally.

Looking around I found that

Variables that are not used in any split can still affect the decision tree [...] It is possible for a variable to be used in a split, but the subtree that contained that split might have been pruned.

However, max_depth does not technically prune according to one answer on this website.

So what could be the reason?

user1627466
  • 101
  • 1

1 Answers1

1

My guess is multicollinearity: some of the predictors are correlated and removing one may change the relationships between the remainining predictors and the response variable. Also while calculating the split of each node the algorithm use metrics to decide the feature to split, e.g. Gini impurity (Gini Impurity) or information gain (analyticsvidhya: Simple Ways to Split a Decision Tree, towardsdatascience: How Decision Trees Split Nodes); for this calculation all the features contributed to the calculation.

Memristor
  • 236
  • 1
  • 7