Number of leaves for lightgbm is smaller than categories in one feature

Question

I was looking at a notebook someone posted for a Kaggle competition. They use lightgbm with the number of leaves set to 40. If I understand right, that's setting a limit on the size of the weak learners (decision trees) that are used in the boosting; the trees can't have more than 40 leaves.

However, after training, we see that the feature with the greatest feature importance is a categorical variable with 1000+ categories! If a branch were ever used in a decision tree for that variable, wouldn't it necessarily have at least 1000+ leaves?

How is this situation handled? When the number of leaves on the weak learners is smaller than the number of categories within one of the variables?

score 1 · Answer 1 · answered Mar 06 '20 at 22:42

There are different ways to include categorical features, and in many of them a single leaf can combine multiple categories:

With the label, target, or frequency encoding the categorical feature is effectively replaced by a numeric one, so a leaf can include multiple original categories naturally. Conversely, any numeric feature can be thought of as an ordinal categorical feature with an infinite number of possible values.
With the hash or binary encoding a single bit is activated for different categories, grouping them together.
LightGBM also supports categorical features natively. If they are marked in the configuration, LightGBM will consider various ways to partition all the categories of a given feature into two subsets.
If the one-hot encoding is used, LightGBM can choose to combine multiple categories via the Exclusive Feature Bundling (EFB) algorithm. It is discussed in a related question.

Finally, it is also possible that only a few out of these 1000+ categories are really important (in reality or only seemingly due to overfitting).

Number of leaves for lightgbm is smaller than categories in one feature

1 Answers1