Problem:
I have a regression problem and I decided to useg Gradient Boosting Regression Trees to solve it. After all the preprocessing, I end up having around 130 attributes, 70K rows, and my cross-validated R-squared is 0.62.
Work So Far:
To increase my R-squared (and tighten my prediction intervals) I tried using a hint-feature that divides the target into linear groups (I manually decided for 4 groups and placed the target into 4 bins, - why 4? not too small not to big and not cleverly decided.).
I am aware that this is not a usual way to go, but this is doable for my application since eventually, I will be collecting these hints from the user. (I will be asking for the target value for that user and suggest him/her another target value depending on his/her data.)
So it would be possible for me to place each user into such groups since it seems like not all of the users behave the same and they need a separated modeling for the ones that are extremely high/low target values, plus the distribution is extremely right-skewed.
After using this cheating-hint-feature, my cross-validated R-squared became 0.81. However, obviously, dividing the target into linear groups like that, brings the disadvantage of having a step-like model that has a clear separation between those groups while this is not necessarily always the case (a user that gives his target value most probably but not necessarily should belong to that group), the boundary values for such binning should not be manually decided, and this model would definitely need to be smoothed.
Next Steps?
After seeing that such an approach can be helpful (and I could really make use of the user-given values), I am now thinking about smarter ways to separate between the users regarding these hints. I am thinking about applying hierarchical clustering and using the cluster information as a hint-feature but not sure if this is the best way to proceed.
Would you maybe have suggestions for me for such method (and papers would be great) or statistical approach to figure out such boundary values (other than density-based methods)?