Fit multiple models e.g classifiers -> stacking -> calibration without data-leak or getting too many datasets

Question

I have some data X on which I want to do the following:

Train two models; SVM and Logistic Regression
Use a stacking classifier based on the models from (1)
Calibrate the stacker from (2).

We want to train a stacking-classifier on data of which the model has not been trained i.e we could have X_train and X_stack where we train models in (1) on X_train and then use their predictions on X_stack to train (2).

Then, we want to calibrate thus we need another dataset X_cal. As you can see a lot of data-information is now lost for the (crucial) training part since we need those different datasets.

I'm thinking about using cross-validation to do this, but I aint sure how exactly that should be done. Note my models takes a fairly large amount of time to train, thus I was hoping that I could use X_train to train the two models only once, and then have one dataset X_other_stuff which could be used for all other "stuff" e.g training the stacking classifier (another logistic regression) and training the calibrator.

I know sklearn provides a calibrator and stacker which can do it by using CV, but as mentioned my models takes quite some time to train, doing a 5-fold (e.g training each model atleast 5 times) would simply be too time consuming thus I'm trying to write it my self.

Yang Wu · Answer 1 · 2023-05-30T09:41:57.870

I think while stacking and then calibrating on top may be a methodologically sound approach, the added computational and architectural complexities would be huge.

Plus, to reduce the risk of data leakage and overfitting, you have to ensure that the data used to train the base models are not the same data used to train the meta-learner, which is in turn not the same data used for calibration. Partitioning the data this many times may be problematic depending on your problem and data sample. In practice, we may get away with being less rigorous here but the results will come back eventually if the model is going to be deployed commercially.

I might suggest going with base learners and calibration and only going with stacking if absolutely necessary. In my experience, end users might sometimes care more about interpretability (which calibration helps provide in terms of reliable estimates of the probabilities) than pure prediction accuracy.

In terms of computation, I have had some success scaling by using Rapids' cuml base estimators with Sklearn's calibration utilities. You may be interested in checking them out here.

Fit multiple models e.g classifiers -> stacking -> calibration without data-leak or getting too many datasets

1 Answers1