Stacking realization problems

Question

I have two dataframes: x_train with features got from base models and y_train with ground true labels of these features using cross_validation.

x_train
f1  f2  f3  f4  f5
0   False   False   False   True    True
1   True    False   False   False   True
2   False   False   False   True    False
3   False   True    True    False   True
4   False   True    False   False   True

y_train
t1  t2  t3  t4  t5
0   False   False   True    True    True
1   False   False   False   True    True
2   False   False   False   False   True
3   False   True    True    False   False
4   True    True    False   False   True

Question: how to train Random forest on these data? We can not pass data in such way, cause well get only this:

clf.predict(x_train)

array([[False, False, False,  True,  True],
       [ True, False, False, False,  True],
       [False, False, False,  True, False],
       ...,
       [False, False, False,  True,  True],
       [ True, False,  True, False, False],
       [False, False, False, False,  True]])

But we want, obviously, only 1dim array?

Little bit explanation. At first we have 5 models trained on 5 folds. For example, first model trained on table data x1[param1,param2,...param10], with ground true labels t1. Then each model gets predictions on OOF set (features or f) and store them into table for each params row. Then we combine them into one big table. Below is shown, how we do this.

model.train(x1,t1)
f1 = model.predict(x1_OOF)

We do this 5 times, to get all t,f.

x_train = pd.concat(f1, f2, f3, f4, f5)

t1 is False,False,False,False,True, cause features f1 were trained on them during base learning. If our OOF data were same for each x, yes, there could be 1-D array in y_train, but each of x has its own x_OOF, and we need to teach model on them all to prevent overfit. I tried to convert it through reshape, but then it will be, obviously, inconsistent numbers of samples.

Are you trying to solve a *multi-label* classification problem? If so, you can train a [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) on `X` and `y` as you have shown. I don't get the issue on this, please explain what "we get only this" means — Luca Anzalone, Jun 13 '23 at 20:12
@LucaAnzalone , I solve Titanic comp, so problem is binary classification. When we use stacking, we train different models on different folds. As I have mentioned before, y_train has ground true labels of these features using cross_validation. — XEX, Jun 14 '23 at 07:18
I think I understand now, thanks for the clarification. So basically you want to train a RF to aggregate the outputs of all the different models. But I don't understand what `x_train` and `y_train` represent. I mean in this context `x_train` should be the binary output of all the models (thus a matrix), while `y_train` should be the 1D array of ground-truth labels. — Luca Anzalone, Jun 14 '23 at 17:28
If `y_train` is ground truth shouldn't it be a single variable? — Memristor, Jun 15 '23 at 09:22
@LucaAnzalone, question updated. I could provide link to public notebook if you ask. Thanks for trying to help anyway. — XEX, Jun 16 '23 at 09:36

Stacking realization problems

0 Answers0