2

The question is pretty simple.

In stacking, the predictions of level 0 models are being used as features to train a level 1 model.

However, the predictions of what data? Intuitively it makes more sense to predict the test set and use those results to train the final classifier.

I am not sure whether this results in data leakage, I don't think this results to data leakage (since the final classifier has only information that the initial ones do, ie. only from the train data - it doesn't know if those predictions are good or not).

Is this reasoning correct?

liakoyras
  • 626
  • 4
  • 15

2 Answers2

4

I'm not sure if there's any standard about this, but I usually proceed by splitting the training set into two parts A and B:

  • A is used as training set for level 0 models
  • B is used as test set for the level 0 models and as training set for the level 1 model.

As usual, the final test set made of fresh instances is used to evaluate the final model, made of stacking the level 0 models and level 1 model.

[added] You're right that there would be data leakage if one were using the same data for training and testing the level 0 models. This would be especially bad, because it means that the level 1 model would expect 'very good' level 0 predictions (since they have been seen during training), and obviously the 'production' level 0 predictions would not be as good and therefore the level 1 model would be completely overfit.

One can also use nested cross-validation to the same effect.

Erwan
  • 24,823
  • 3
  • 13
  • 34
  • I am afraid that my dataset is not very big in terms of samples so if I further split the training set the samples would be too few. So I was wondering if I could avoid it. CV might be better with little data but would add significant overhead, something that I care about. – liakoyras Sep 26 '22 at 11:01
  • 2
    @liakoyras I understand the problem, but I can't think of any good solution... good luck anyway! – Erwan Sep 26 '22 at 14:30
1

Since my dataset was not very big and I didn't want to split it (as @Erwan suggested), I ended up doing the same thing sklearn does:

  • Train level 0 classifier on the entire training data
  • Εxtract cross validated predictions (using sklearn.model_selection.cross_val_predict) from them in order for the predictions to be more robust than simply predicting on all the training data
  • Use those predictions to train the level 1 classifier

Then for testing the data, extract the predictions of the level 0 classifiers (without cross validation of course since we don't want to learn anything from the test dataset) and use those as inputs of the level 1 classifier to get the final predictions.

liakoyras
  • 626
  • 4
  • 15