Error from XGBoost missing data handling

Asked Jun 16 '22 at 03:09

Active Jun 16 '22 at 03:09

Viewed 80 times

I have a regression problem with a very large dataset >50 million rows, 81 features and 1 target, all positive float values unevenly distributed between 0 - 1 million. I've trained an XGBoost model on the data and gotten a relatively good R^2 score of 0.7.

Around 70% of the dataset is missing values - I've read up on how XGBoost automatically splits missing values, and uses minimization of training loss to find the best imputation value when it encounters a missing value for any specific feature.

However, I can't find anything on how to quantify the error that arises from this imputation.

The question I want to answer is: given an unseen feature set where all features except 1 are missing, how can I then quantify the uncertainty/error involved in the resulting prediction from the model? Surely the prediction will be less "accurate" than if all features were provided.

Thanks in advance.

asked Jun 16 '22 at 03:09

lexan55

Error from XGBoost missing data handling

0 Answers0