Suppose that I am trying to build a random forest by subsampling the data and choosing a single feature per tree randomly. For example, suppose there is some dataset,
$D = \{(x_{1},y_{1}), ......(x_{N},y_{N})\}$ where we have that $x_{i} \in \mathbb{R}^{D}$ and $y_{i} \in \mathbb{R}$ for $ i= 1,....n.$. We are trying to construct the tree as follows:
- First we randomly sample one feature index $j \in \{1,....D\}$
- Then we draw some sample of the data $\tilde D_{k}$ of size $M \le N$ with replacement. These datapoints will then have indices $k = k_{1},....,k_{M}$
- Keep only the $j^{th}$ feature of the M samples: $\tilde D^{j}_{k} = {(\tilde x^{(j)}_{(k_{1})},y_{(k_{1})}),......(\tilde x^{(j)}_{(k_{M})},y_{(k_{M})})}$
- Then we build a decision tree on $\tilde D_{k}^{(j)}$.
- Then average R of these random trees to create a random forest
We were asked for which class of conditional distributions $Y|X=x$ are very random forests unbiased? I am wondering what is meant by the "class" of conditional distribution? Could someone shed some light on this please?
Also, how does the bias and variance of this RF vary with a traditional RF? I assume that I will need to look at the generalization bounds? I am not sure. Could someone please shed some light on this?