1

I have a regression problem with a very large dataset >50 million rows, 81 features and 1 target, all positive float values unevenly distributed between 0 - 1 million. I've trained an XGBoost model on the data and gotten a relatively good R^2 score of 0.7.

Around 70% of the dataset is missing values - I've read up on how XGBoost automatically splits missing values, and uses minimization of training loss to find the best imputation value when it encounters a missing value for any specific feature.

However, I can't find anything on how to quantify the error that arises from this imputation.

The question I want to answer is: given an unseen feature set where all features except 1 are missing, how can I then quantify the uncertainty/error involved in the resulting prediction from the model? Surely the prediction will be less "accurate" than if all features were provided.

Thanks in advance.

lexan55
  • 36
  • 2

0 Answers0