2

I want to use XGB regression. the dataframe is coneptually similar to this table:


index    feature 1   feature 2   feature 3  encoded_1  encoded_2  encoded_3   y
0          0.213      0.542       0.125       0             0        1        0.432
1          0.495      0.114       0.234       1             0        0        0.775
2          0.521      0.323       0.887       1             0        0        0.691

My question is, what is the influence of having imbalanced observations of the encoded features? for example, is I have more features that are "encoded 1" comapred to "encoded 2" or "encoded_3". Just to make it clear, I want to use regression and not classification.

If there is any material to read about it pelase let me know.

Reut
  • 349
  • 2
  • 13

2 Answers2

5

It doesn't matter, it's just what the data is.

I assume that you're thinking about issues related to "imbalanced dataset", but this term refers only to imbalance in the values of the target variable (and it's more commonly used about classification, but technically it's relevant also in regression).

Features don't need to be balanced in any way, they just need to be good indicators for the target variable.

Erwan
  • 24,823
  • 3
  • 13
  • 34
2

As Erwan said, the imbalanced dataset problem is about the target variables and not the features.

But if your model favors a section of your regression target more, you can perform a study on the distribution of the target variable and then, depending on the distribution, perform a transformation (e.g. square root or exp), to get a more uniform output.

Also, an underfit can be mistakenly thought of as a result of feature imbalance and not the representativeness of your features. You can add new features or even transformed versions of your current features to capture non-linearity in your data.

SimplyFarzad
  • 111
  • 5
  • Is an imbalanced target even a problem, though? https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Nov 09 '21 at 13:10
  • It depends on the model. For a well-defined model, there should no problem. But when there is a high price for predicting false positive in a highly imbalanced dataset, I think there need to be a study before fitting a model. – SimplyFarzad Nov 09 '21 at 15:32