I have two dataframes: x_train with features got from base models and y_train with ground true labels of these features using cross_validation.
x_train
f1 f2 f3 f4 f5
0 False False False True True
1 True False False False True
2 False False False True False
3 False True True False True
4 False True False False True
y_train
t1 t2 t3 t4 t5
0 False False True True True
1 False False False True True
2 False False False False True
3 False True True False False
4 True True False False True
Question: how to train Random forest on these data? We can not pass data in such way, cause well get only this:
clf.predict(x_train)
array([[False, False, False, True, True],
[ True, False, False, False, True],
[False, False, False, True, False],
...,
[False, False, False, True, True],
[ True, False, True, False, False],
[False, False, False, False, True]])
But we want, obviously, only 1dim array?
Little bit explanation. At first we have 5 models trained on 5 folds. For example, first model trained on table data x1[param1,param2,...param10], with ground true labels t1. Then each model gets predictions on OOF set (features or f) and store them into table for each params row. Then we combine them into one big table. Below is shown, how we do this.
model.train(x1,t1)
f1 = model.predict(x1_OOF)
We do this 5 times, to get all t,f.
x_train = pd.concat(f1, f2, f3, f4, f5)
t1 is False,False,False,False,True, cause features f1 were trained on them during base learning. If our OOF data were same for each x, yes, there could be 1-D array in y_train, but each of x has its own x_OOF, and we need to teach model on them all to prevent overfit. I tried to convert it through reshape, but then it will be, obviously, inconsistent numbers of samples.