How to deal with imbalanced categorical variables in regression tasks?

Question

I want to predict real estate prices using several Machine Learning algorithms. My dataset contains numerical and categorical predictors. I already eliminated the outliers of numerical variables. Now I'm wondering on how to deal with "outliers" (i.e., imbalanced classes) of the categorical variables but I could not find anything on this topic. Do I have to deal with the imbalanced classes (outliers) at all or is it only relevant for classification tasks?

Side note, if important: I encoded the categorical variables using one-hot encoding.

You're not supposed to eliminate "outliers"! Regardless if it's regression or classification, or if the variables are numerical or categorical. — user2974951, Jul 15 '22 at 10:24
@user2974951 could you explain this please? Why should I not eliminate outliers? — moby1209, Jul 15 '22 at 10:32
Why would you remove data in the first place? What are you trying to achieve with this? And how do you define "outliers"? — user2974951, Jul 15 '22 at 10:43
I define outliers as values that deviate more than three standard deviations from the mean. I want to remove them because I do not want the ML models to learn from unusual observations to improve the predictive out-of-sample performance. — moby1209, Jul 15 '22 at 10:48
Weel the rule of anything above 3 SD is an outlier is arbitrary... you could just as well use 2 SD, or 4, or the IQR, the point is this is completely arbitrary. Anyway, in regards to `I want to remove them ... to improve the predictive out-of-sample performance` ... that's cheating, you are purposefully removing hard-to-predict data so that your metrics look nicer, that's not very scientific. In any case, whatever you do on your training set in regards to "outlier removal", you should definitely not do this on your test set. — user2974951, Jul 15 '22 at 10:54
[cont] If your data has some extreme values which causes massive biases then you should consider using some robust models. — user2974951, Jul 15 '22 at 10:58
So you're saying my analysis is just not a suitable case for outlier elimination? Or outlier elimination in general is not desired for Machine Learning? — moby1209, Jul 15 '22 at 11:44
I am saying that removing data simply because it causes problems is not a solution in general (UNLESS you have good reasons to remove them, such as erroneous measurements i.e. incorrect data). See also [Is it appropriate to identify and remove outliers because they cause problems?](https://stats.stackexchange.com/questions/15497/is-it-appropriate-to-identify-and-remove-outliers-because-they-cause-problems/) — user2974951, Jul 15 '22 at 11:51

otaku · Answer 1 · 2022-07-16T07:00:09.953

-1

EDITED:

You should not remove outliers, because when you feed unseen data to the model that you have made, it will not be able to give good predictions for 'outliers in unseen data'. One way to make model generalize even when the data has more frequency of few categories is to sample your data - also called bootstrapping.

Bootstrapping will help model learn from more data.

edited Jul 16 '22 at 07:00

answered Jul 15 '22 at 13:18

otaku

1
2

Isn't resampling only necessary for classification problems? – moby1209 Jul 15 '22 at 14:39

How to deal with imbalanced categorical variables in regression tasks?

1 Answers1