What is the difference between filling missing values with 0 or any othe constant term like -999?

Question

Most of the text book says to fill missing values use mean/median(numerical) and most frequent(categorical) but I working on a data set which has too many missing values and I can't remove those columns because they are important.

train.isnull().sum()

TransactionID          0
isFraud                0
TransactionDT          0
TransactionAmt         0
ProductCD              0
                   ...  
id_36             449555
id_37             449555
id_38             449555
DeviceType        449730
DeviceInfo        471874
Length: 434, dtype: int64

train.shape
(590540, 434)

You can see that DeviceType and DeviceInfo has too many missing values. I am not sure if filling these with most frequent value would be a right choice. So, I am thinking about filling it with some constant term 0 or -999. Is there any difference when fill it with 0 or -999. What is the right way to fill when you have too many missing values?

What model are you trying to implement? Also, what does it mean "too many missing values"? What percentage of the total values available? — Leevo, Mar 17 '20 at 13:17
I am trying to implement logistic regression and Random forest. Almost 76% of the data are missing. — bhola prasad, Mar 17 '20 at 13:29
How much data do you have if you drop rows with NaN's? Are they sufficient to train a useful model? — Leevo, Mar 17 '20 at 13:43
I can drop the rows but I will lose 76% of the data. Which I don't want to. — bhola prasad, Mar 17 '20 at 13:48
The easiest solution would be to leave the missing values as NaN and use XGBoost, which automatically handles missing data. (it assumes missing data is the worst possible) — Brady Gilg, Mar 17 '20 at 17:43
@BradyGilg, not quite: xgb decides which branch to send the missing data through, based on the loss improvement. — Ben Reiniger, Mar 17 '20 at 17:55

score 1 · Accepted Answer · answered Mar 17 '20 at 13:51

Coming to your questions:

I am thinking about filling it with some constant term 0 or -999. Is there any difference when fill it with 0 or -999.

The reason values are imputed with their mean is that in this way inputed data won't alter the regression slopes. You shouldn't use constant terms such as -999, otherwise the regression parameters would result completely messed up. Random Forests would suffer too, since the model would then learn to partition the variables space in a strange way, trying to chase outliers that don't make sense.

What is the right way to fill when you have too many missing values?

Drop these data if possible. If it's not the case, try inputing the mean. If you have outlier problems (and the mean suffers from them), substitute it with the median.

This is the simplest thing you can do. An alternative is to fit a preliminary model on the available, complete data, then use it to predict missing values, then retrain the final model on the bigger imputed dataset. This would introduce a more "flexible" imputation than just a constant number, at your own risk of course.

Ben Reiniger · Answer 2 · 2020-03-17T18:06:00.790

First, DeviceType and DeviceInfo don't sound like naturally numeric values. If you're going to need to encode them anyway, then it doesn't matter: just encode "missing" as another level (or the baseline all-zeros). And if the non-missing values are nearly-unique, they may not be very useful anyway; perhaps just the fact that they exist is informative?

Tree models can deal with missing values implicitly (splitting the non-missing data into two subsets, then examining which side the rows with that feature missing should go to); however, not all implementations allow for that. E.g., sklearn doesn't yet (but working on it?) allow missing values at all; xgboost and lightgbm do what I've mentioned above; catboost only sends in one fixed direction (https://github.com/catboost/catboost/issues/588). Quinlan-family trees actually send missing values along all possible paths, and return a result that's a weighted sum of the possible results, weights coming from the proportion of the training data in the node that went along each path (https://stats.stackexchange.com/a/98967/232706).

In a tree-based model, imputing with $-999$ (or any value less than all your data) is the next best choice: it allows you to split the rest of the data however the tree normally would, and just always sends the "missing" rows to the left. You may want to test imputing with a very large value as well (which will instead always send the missing rows to the right); again, see the catboost github issue above.

For a linear model, imputing with anything will distort the distribution and the model. However, adding a missingness indicator and imputing (with anything) takes care of the model: the coefficient on the imputed feature can fit to the "real" slope, while the coefficient on the indicator prevents the imputed value from pulling that slope away from its true value. An indicator variable may also help in a tree-based model, though that's not as certain.

Brian Spiering · Answer 3 · 2020-03-17T14:15:00.807

There are many ways to handle missing values.

Combinations of the following are often done:

Drow rows
Drow columns
Impute (i.e., replace with a valid value)
Keep as missing value (i.e., missing values are encoded as NaN or NULL)

Generally, it is not useful to fill in all missing values with a randomly selected valid value. For example, summary statistics and machine learning models will be distorted if all missing values are replaced with the numerical values of 0 or -999.

Often times missing values are signals to model.

What is the difference between filling missing values with 0 or any othe constant term like -999?

3 Answers3

Linked