Making Use of the Target Values for Regression

Question

Problem:

I have a regression problem and I decided to useg Gradient Boosting Regression Trees to solve it. After all the preprocessing, I end up having around 130 attributes, 70K rows, and my cross-validated R-squared is 0.62.

Work So Far:

To increase my R-squared (and tighten my prediction intervals) I tried using a hint-feature that divides the target into linear groups (I manually decided for 4 groups and placed the target into 4 bins, - why 4? not too small not to big and not cleverly decided.).

I am aware that this is not a usual way to go, but this is doable for my application since eventually, I will be collecting these hints from the user. (I will be asking for the target value for that user and suggest him/her another target value depending on his/her data.)

So it would be possible for me to place each user into such groups since it seems like not all of the users behave the same and they need a separated modeling for the ones that are extremely high/low target values, plus the distribution is extremely right-skewed.

After using this cheating-hint-feature, my cross-validated R-squared became 0.81. However, obviously, dividing the target into linear groups like that, brings the disadvantage of having a step-like model that has a clear separation between those groups while this is not necessarily always the case (a user that gives his target value most probably but not necessarily should belong to that group), the boundary values for such binning should not be manually decided, and this model would definitely need to be smoothed.

Next Steps?

After seeing that such an approach can be helpful (and I could really make use of the user-given values), I am now thinking about smarter ways to separate between the users regarding these hints. I am thinking about applying hierarchical clustering and using the cluster information as a hint-feature but not sure if this is the best way to proceed.

Would you maybe have suggestions for me for such method (and papers would be great) or statistical approach to figure out such boundary values (other than density-based methods)?

score 1 · Answer 1 · answered Nov 08 '17 at 18:31

Clustering is definitely something that can help. As you describe, the issue is that you can see that not all instances "behave" the same way. So if you can cluster them into groups that do behave more similarly, you would probably improve.

Any clustering technique would work in theory. Although, techniques like K-means for instance do favour clusters of even-size, and so if your problem isn't necessarily "balanced", I would be cautious.

What you could also do to improve your clustering is to look at which features are more discriminative (perhaps look at the information gain) and use a weighted distance to favour these features.

Another way I feel could help you is if you used meta learning/stacking. By that I mean that you could use decision trees to split your data in a supervised way (as opposed to clustering), and at the leaves, use a more elaborate regressor than majority voting.

Thanks for the answer. I tried some clustering methods combined with feature selection. However, it seems like whatever these clusters can bring (clustered on those features) are already detected by the GBR. So those clusters won't bring additional advantage. Now, I am trying ensembling the GBR-grouped and GBR-notgrouped. I will write here once I see some results. — mari, Nov 10 '17 at 14:59
Okay, ensembling both made the model even more step-like and also further improved the r-squared. However. my issue with the "smoothness" is still there. — mari, Nov 14 '17 at 15:39

Making Use of the Target Values for Regression

Problem:

Work So Far:

Next Steps?

1 Answers1