Decision trees, categorizacion and oversampling

Question

I want to create a model to predict the propensity to buy a certain product. As my proportion of 1's is very low, I decided to apply oversampling (to get a 10% of 1's and a 90% of 0's).

Now, I want to discretize some of the variables. To do so I run a tree for each variable against the target.

Should I define the prior probabilities when I do this (run the trees), or it doesn't matter and I can use the over-sampled dataset just like that?

Using a tree is certainly a way to discretize, i.e. finding break points. Quantiles is another alternative. Not sure why you need to discretize, although it is a valid thing to do, but check your accuracy. You could create the model on this transformed data set. Also not sure what you mean by prior in this context. It might be better to subset on the class which is easy to predict as opposed to "over-sample" on the low volume class, as not to over-fit the model. — , Nov 03 '15 at 02:38

score 1 · Answer 1 · answered Jul 03 '20 at 14:25

Do you use Python? Python class DecisionTreeClassifier has an attribute class_weight for this purpose. So you do not need to adjust it manually. Check here https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Regarding discretizing or encoding - hard to say what is better without knowing your data. Unless you are really sure one is the best choice, you can check the model by using encoding instead of discretizing and compare the quality.

Decision trees, categorizacion and oversampling

1 Answers1