4

I've recently encountered a problem where I want to fit a regression model on data that's target variable is like 75% zeroes, and the rest is a continuous variable. This makes it a regression problem, however, the non-zero values also have a very high variance: they can take anywhere from between 1 to 105 million.

What would be an effective approach to such a problem? Due to the high variance, I keep getting regressors that fit too much to the zeroes and as a result I get very high MAE. I understand in classification you can use balanced weighting for example in RandomForests, but what's the equivalent to regression problems? Does SciKit-Learn have anything similar?

lte__
  • 1,310
  • 5
  • 18
  • 26

1 Answers1

3

Zero-inflated models (https://en.wikipedia.org/wiki/Zero-inflated_model) first predict whether an individual's response will be zero, and then among the non-zero responses, predict categorical values.

If your non-zero values could be consider count or rate data, you might use:

statsmodels.discrete.count_model.ZeroInflatePoisson
Ethan
  • 1,625
  • 8
  • 23
  • 39
clementzach
  • 131
  • 2