5

I have a regression problem where I need to predict three dependent variables ($y$) based on a set of independent variables ($x$): $$ (y_1,y_2,y_3) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n +u. $$

To solve this problem, I would prefer to use tree-based models (i.e. gradient boosting or random forest), since the independent variables ($x$) are correlated and the problem is non-linear with ex-ante unknown parameterization.

I know that I could use sklearn's MultiOutputRegressor() or RegressorChain() as a meta-estimator.

However, there is an additional twist to my problem, namely that I do know that $y_1 + y_2 - y_3 = x_1$.

In other words, there is a fixed relation between the three $y$ and one of the independent variables. So essentially, the value of $x_1$ needs to be "distributed" in a first best manner to the (unknow) targets $(y_1,y_2,y_3)$ for each observation, contingent on the remaining independent variables $x_2,\dots,x_n$.

Of course a naive approach would be, to squeze the predicted values together somehow, so to satisfy $\hat{y_1} + \hat{y_2} - \hat{y_3} = x_1$. However, I wonder if there are any other options to introduce a "hard constraint" such as $\hat{y_1} + \hat{y_2} - \hat{y_3} = x_1$ to some (tree-based) estimator.

I noticed this post. However, I would prefer a tree-based method for reasons mentionned above.

Peter
  • 7,277
  • 5
  • 18
  • 47
  • Is the relationship exact in the training data, or noisy? – Ben Reiniger Aug 24 '21 at 13:32
  • The relation $y_1+y_2-y_3=x_1$ is almost exact (few minor residual values), while the effect of the remaining $x$ on $y$ in the sense of $ y_1,y_2,y_3(x_2,...,x_n)$ is rather noisy. – Peter Aug 24 '21 at 14:33
  • I'd need to think through it some more before upgrading this to an answer, but some things to think about: (1) model just $y_1, y_2$ and then predict $\hat{y}_3 = \hat{y}_1 + \hat{y}_2 - x_1$. (2) `RegressorChain`, with the $y$s in order, will do essentially that but with some flexibility to change the $\hat{y}_3$. (3) Trees already do multi-output regression in a single tree, so `MultiOutputRegressor` shouldn't be needed. (4) If the relationship were just in the $y$s, that would be captured automatically by trees, since the leaf values are averages and the relationship is linear. – Ben Reiniger Aug 25 '21 at 14:40
  • 1
    Thanks for your comment: I'm currently using `RegressorChain()` and I wonder if I would benefit from using option (1) in a stacking process where each of the $y_i$ is determined as "residual" for parts of the data. So first stage: Estimate two of the $\hat{y}_i$ in a chain, determine the last $\hat{\hat{y}}_i$ as residual to $x_1$. Second stage: Use the $\hat{\hat{y}}_i$ in a further modeling step to see if this information helps to reduce MSE, MAE and ensure that $\hat{y}_1+\hat{y}_2-\hat{y}_3=x_1$ is met. Do you think something like this could work? – Peter Aug 26 '21 at 09:52

1 Answers1

0

So there does not seem to be anything that is out of the box ready for this but, I found an example of someone doing something similar to what you want to do with a Random Forest. Here is the link: http://astrohackweek.org/blog/multi-output-random-forests.html

  • Thanks for the hint. However, when I see this correctly, the approach is essentially a regression chain. I‘ll have a closer look but I guess it will not be possible to model a constraint like the one mentionned in my post. https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html – Peter Aug 21 '21 at 15:04