1

I was looking at a notebook someone posted for a Kaggle competition. They use lightgbm with the number of leaves set to 40. If I understand right, that's setting a limit on the size of the weak learners (decision trees) that are used in the boosting; the trees can't have more than 40 leaves.

However, after training, we see that the feature with the greatest feature importance is a categorical variable with 1000+ categories! If a branch were ever used in a decision tree for that variable, wouldn't it necessarily have at least 1000+ leaves?

How is this situation handled? When the number of leaves on the weak learners is smaller than the number of categories within one of the variables?

Nick Koprowicz
  • 213
  • 1
  • 3
  • 10

1 Answers1

1

There are different ways to include categorical features, and in many of them a single leaf can combine multiple categories:

  • With the label, target, or frequency encoding the categorical feature is effectively replaced by a numeric one, so a leaf can include multiple original categories naturally. Conversely, any numeric feature can be thought of as an ordinal categorical feature with an infinite number of possible values.
  • With the hash or binary encoding a single bit is activated for different categories, grouping them together.
  • LightGBM also supports categorical features natively. If they are marked in the configuration, LightGBM will consider various ways to partition all the categories of a given feature into two subsets.
  • If the one-hot encoding is used, LightGBM can choose to combine multiple categories via the Exclusive Feature Bundling (EFB) algorithm. It is discussed in a related question.

Finally, it is also possible that only a few out of these 1000+ categories are really important (in reality or only seemingly due to overfitting).

Andrey Popov
  • 321
  • 1
  • 5