I know this is kind of a broad question but I have tried to scour both this forum and the internet in general to no avail for this particular situation. So imagine I have a model trained for which, though the data initially might not be complete and clean, I took steps to make the data complaint with the model requirements (no outliers where appropriate, de-skewed if necessary, normalized if necessary, null values imputed appropriately). This is done in a cross validation framework. All this stuff works and is absolutely fine when tuning the model but I run into problems when I try to make a single prediction out of it (meaning I have a single "test" record - think web service with some fields that can be null). In fact, null values generally need a dataset to refer to for filling, as well as for the normalization/outlier procedures.
Initially I thought about linking such "test" record to a portion of the "train" dataset so that I would not run into this problem (such problems would be resolved) but at that point other issues would arise: how would I choose such dataset? if I used the most recent data, would I bias it somehow? and using the whole dataset is impractical as well as potentially unfeasible when dealing with "big" data.
do you happen to know whether there are some best practices on the topic or could you refer me the themes/keywords that deal with these issues?
p.s.: for the relevance of the problem, the null values most likely will remain there (I have no way of forcing them beforehand in the web application in order to have a smoother user experience)