How to boost the performance of a single decision tree by adding additional trees?

Question

I have a binary classification task and the data has imbalance issue (99% is negative and 1% is positive). I am able to build a decision tree that is carefully tuned, weighted, and post-pruned. Take this as tree1 and it has a high recall and medium-high precision, which performs well on detecting positive instances.

I wonder how can I improve its performance by incorporating the idea of using ensemble methods (bagging, boosting, stacking, etc).

One important thing to note is that using large amount of trees (e.g., Random Forest with 100+ trees) is not allowed in our production environment because of the real-time severing requirement. I want to look for incremental performance by adding only 1 or 2 (at max) trees. Is it possible?

I do know ensemble methods usually start with a large group of weak learners (default or lightly tuned). Then you take the majority vote, assuming that all trees weight roughly equal. However in my case I have a fine-tuned DT as a "strong" base learner, so I probably need to use soft voting (tree1 should be weighted more), but is ensembling still making sense with only three trees?

Maybe let me ask from another perspective: If I have a tree1 with high recall + low precision, how can I build a tree2 to improve the precision but keeping the high recall? If tree2 is tuned as high precision + low recall, would it possible to use ensemble learning as an optimization to balance out the weaknesses of both trees, and obtain a final model with high recall and high precision?

Brian Spiering · Answer 1 · 2023-03-21T13:44:47.220

1

One option is cascading - putting machine learning models in a row where the output of one model becomes the input of another model. The first model is typically high recall, the second model is high precision. The first model reduces the size search by selecting any highly likely candidates, the second model makes sure all labeled items are labeled as best as possible. This hierarchical modeling is common in search engineering where models have to be fast and correct.

edited Mar 21 '23 at 13:44

answered Mar 20 '23 at 10:28

Brian Spiering

20,142
2
25
102

1

This is sometimes also referred to as "Cascading". https://en.wikipedia.org/wiki/Cascading_classifiers – Jon Nordby Mar 20 '23 at 14:30
Thanks Brian Spiering and Jon Nordby. In this case, do I just take whatever Tree2 outputs, since the results from Tree1 are already taken into consideration? Can I use different features for Tree2? – szheng Mar 20 '23 at 19:47
1

@JonNordby Thank you for the idea. Cascading is a better term. I have updated my answer. – Brian Spiering Mar 21 '23 at 13:44
@szheng Try both and see which one is empirically better for your problem. – Brian Spiering Mar 21 '23 at 13:44

score 0 · Answer 2 · answered Mar 20 '23 at 09:25

0

You could try some incremental boosting techniques such as AdaBoost or Gradient Boosting. These methods incrementally try to train the next weak learner in a way that it compensates the flaws of the previous ones.

Still your requirement are not ideal:

Having just 1 or 2 additional trees is quite a small number and you probably have to play a bit with the hyper-parameters of these algorithms to get a good solution
Most frameworks (i.e. all I know of) train all weak learners. Having already one learner at hand will restrict your options. Probably you might have to code them by your own.

answered Mar 20 '23 at 09:25

Broele

1,107
7
13

Thanks @Broele. I have a follow-up question regarding boosting. How should I re-assign the sample weights? Do take the same data, and increase the weights of incorrectly classified instances by Tree1, i.e. False Positive and False Negative from Tree1? I will definitely work on some manual implementations. The weighting functions will vary from algorithm to algorithm for sure, so I will try a few. I am limiting the number of trees/nodes because the deployed model must make a decision in milliseconds. – szheng Mar 20 '23 at 19:36
Also should I change features for Tree2? Can I? I mean for bagging (RF) I definitely should, to minimize the correlations between trees, since in this case classifiers are trained in parallel not based on a hierarchy. – szheng Mar 20 '23 at 19:42
First: the computational expensive part of trees is the training. Assume you have 100 Trees, each of depth 10. Then a prediction requires to evaluate 10*100 and build a sum of 100 floats. This is easily done in milliseconds on todays hardware. Of course it depends on the language, the environment, etc. But I now of real-world-applications where XGBoost models with hundrets of nodes run in milliseconds. – Broele Mar 20 '23 at 20:16
Second: with only 2 or 3 trees, I would not sub-sample the number of features. I assume you are looking at AdaBoost when it comes to reweighting samples. For this, you have to evaluate your current tree to see where errors occur. I would also look into Gradient Boosting. Read for example the XGBoost Paper, they have a neat algorithm and it does not really do a reweighting but more an adjustment of the loss. – Broele Mar 20 '23 at 20:20
Yes the training can be done offline and I do believe the computation time from 10 nodes to 100+ will not increase much. The other problem is that the implementation team won't bother hard-coding 100 rules into the decision engine, which do not use a common programming language like Python, SQL, SAS, etc. Infrastructure in a finance company is bad and that is something I cannot help with... but thanks for the insights on the boosting algorithm. – szheng Mar 20 '23 at 20:30
You can for example run XGBoost models (trained in Python) in Java. No need to hard code the rules. Just export the model (e.g. as json) and either use an existing framework or write is by your own (I did both) – Broele Mar 20 '23 at 21:19

How to boost the performance of a single decision tree by adding additional trees?

2 Answers2