1

With stacking, several (diverse) base learners are used to predict the dependent variable $\hat{y}_{b,m}=\beta_{b,m} X$ in a hold-out set, where $m$ are base learner models $1,...,n$. These predictions are used in a second step as explanatory variable(s) in a meta learner $y = \beta_1 X + \beta_2 \hat{y}_b + u$.

I wonder how to best treat $\hat{y}_{b,m}$ in practice. There are basically two options:

  • Use each base learner's prediction $\hat{y}_{b,m}$ as a separate feature (column) in the meta learner model.
  • Take the rowmean over the different base learner's predictions $ 1/n \sum \hat{y}_{b,m}$ and use it as a single feature (column) in the meta learner.

My intuition is that both approaches might work, dependent on the choice of the meta learner. E.g. when the meta learner uses shrinkage (e.g. Ridge), this may help to "shrink" some of the "not so useful" $\hat{y}_{b,m}$ when all of the base learners predictions are treated as single feature (OK, correlation might be an issue in linear models). A similar logic might apply to meta learners such as boosted trees (correlation not a big issue). Using each prediction as a single feature may also provide more information (variation in the data) which can be exploited by the meta learner.

Nevertheless, averaging seems to be used quite often (if I'm not mistaken) to create a single feature from different base learners's predictions. I can't really pin down what is be the best approach here.

Are there any insights - theory based or from practical experiance - which help to decide what the best approach is?

Peter
  • 7,277
  • 5
  • 18
  • 47
  • I've never actually seen the second approach (except when the meta-learner is trivial, i.e. the actual ensemble is just the average of the base models' predictions). Can you provide a reference? – Ben Reiniger Sep 10 '20 at 20:44
  • Actually, I can‘t. By reading several online sources I was under the impression that averaging takes place. But I did not quite understand why. – Peter Sep 11 '20 at 10:53

0 Answers0