Stacking: Use predictions of train or test to create features for level 1 classifier

Question

The question is pretty simple.

In stacking, the predictions of level 0 models are being used as features to train a level 1 model.

However, the predictions of what data? Intuitively it makes more sense to predict the test set and use those results to train the final classifier.

I am not sure whether this results in data leakage, I don't think this results to data leakage (since the final classifier has only information that the initial ones do, ie. only from the train data - it doesn't know if those predictions are good or not).

Is this reasoning correct?

score 4 · Accepted Answer · answered Sep 26 '22 at 10:44

I'm not sure if there's any standard about this, but I usually proceed by splitting the training set into two parts A and B:

A is used as training set for level 0 models
B is used as test set for the level 0 models and as training set for the level 1 model.

As usual, the final test set made of fresh instances is used to evaluate the final model, made of stacking the level 0 models and level 1 model.

[added] You're right that there would be data leakage if one were using the same data for training and testing the level 0 models. This would be especially bad, because it means that the level 1 model would expect 'very good' level 0 predictions (since they have been seen during training), and obviously the 'production' level 0 predictions would not be as good and therefore the level 1 model would be completely overfit.

One can also use nested cross-validation to the same effect.

I am afraid that my dataset is not very big in terms of samples so if I further split the training set the samples would be too few. So I was wondering if I could avoid it. CV might be better with little data but would add significant overhead, something that I care about. — liakoyras, Sep 26 '22 at 11:01
@liakoyras I understand the problem, but I can't think of any good solution... good luck anyway! — Erwan, Sep 26 '22 at 14:30

score 1 · Answer 2 · answered Sep 27 '22 at 10:48

Since my dataset was not very big and I didn't want to split it (as @Erwan suggested), I ended up doing the same thing sklearn does:

Train level 0 classifier on the entire training data
Εxtract cross validated predictions (using sklearn.model_selection.cross_val_predict) from them in order for the predictions to be more robust than simply predicting on all the training data
Use those predictions to train the level 1 classifier

Then for testing the data, extract the predictions of the level 0 classifiers (without cross validation of course since we don't want to learn anything from the test dataset) and use those as inputs of the level 1 classifier to get the final predictions.

Stacking: Use predictions of train or test to create features for level 1 classifier

2 Answers2