Dealing with issues in "test" predictons for single "items" (null values, standardization in place, etc)

Question

I know this is kind of a broad question but I have tried to scour both this forum and the internet in general to no avail for this particular situation. So imagine I have a model trained for which, though the data initially might not be complete and clean, I took steps to make the data complaint with the model requirements (no outliers where appropriate, de-skewed if necessary, normalized if necessary, null values imputed appropriately). This is done in a cross validation framework. All this stuff works and is absolutely fine when tuning the model but I run into problems when I try to make a single prediction out of it (meaning I have a single "test" record - think web service with some fields that can be null). In fact, null values generally need a dataset to refer to for filling, as well as for the normalization/outlier procedures.

Initially I thought about linking such "test" record to a portion of the "train" dataset so that I would not run into this problem (such problems would be resolved) but at that point other issues would arise: how would I choose such dataset? if I used the most recent data, would I bias it somehow? and using the whole dataset is impractical as well as potentially unfeasible when dealing with "big" data.

do you happen to know whether there are some best practices on the topic or could you refer me the themes/keywords that deal with these issues?

p.s.: for the relevance of the problem, the null values most likely will remain there (I have no way of forcing them beforehand in the web application in order to have a smoother user experience)

score 1 · Accepted Answer · answered Feb 03 '20 at 13:53

1

You need to save the instructions for performing these preprocessing steps, not necessarily the dataset that you extracted them from.

See
Obtaining consistent one-hot encoding of train / production data
Binary Classification - One Hot Encoding preventing me using Test Set
one-hot-encoding categorical data gives error

In particular, sklearn preprocessors can be pickled, then used with their transform method in production, if you can use sklearn in deployment. PMML also cam translate most transformers. Or you can write your own simple transformer.

As to using newer data to rework the transformers, that's getting closer to retraining; in most settings, I would keep it in the same place as refitting the model: either both offline or both online.

answered Feb 03 '20 at 13:53

Ben Reiniger

11,094
3
16
53

the model of course assumes a Pipeline for preprocessing (otherwise I would have data leakage in the CV phase) that gets saved in the model, pickled and then, as you say, gets transformed in the deployment phase. problem is I need the raw data if I want to compare apples to apples – Asher11 Feb 03 '20 at 14:29
Sorry, I don't understand; what do you need to compare? – Ben Reiniger Feb 03 '20 at 14:42
What I mean is I cannot use pre-processed "train" data and raw "test" data. If I want to apply the pre-processing phase correctly, both have to be in the same stage. – Asher11 Feb 03 '20 at 14:58
Why not? If you're expecting concept drift, then the fitted model will be just as susceptible as your preprocessing, and then see my last paragraph. Maybe you can give a specific preprocessing step (to start with) where you don't want / can't apply the "fitted" preprocessor? – Ben Reiniger Feb 03 '20 at 15:44
ah I see what you mean. that could actually work! – Asher11 Feb 03 '20 at 19:36

Dealing with issues in "test" predictons for single "items" (null values, standardization in place, etc)

1 Answers1