3

Suppose that I am trying to build a random forest by subsampling the data and choosing a single feature per tree randomly. For example, suppose there is some dataset,

$D = \{(x_{1},y_{1}), ......(x_{N},y_{N})\}$ where we have that $x_{i} \in \mathbb{R}^{D}$ and $y_{i} \in \mathbb{R}$ for $ i= 1,....n.$. We are trying to construct the tree as follows:

  1. First we randomly sample one feature index $j \in \{1,....D\}$
  2. Then we draw some sample of the data $\tilde D_{k}$ of size $M \le N$ with replacement. These datapoints will then have indices $k = k_{1},....,k_{M}$
  3. Keep only the $j^{th}$ feature of the M samples: $\tilde D^{j}_{k} = {(\tilde x^{(j)}_{(k_{1})},y_{(k_{1})}),......(\tilde x^{(j)}_{(k_{M})},y_{(k_{M})})}$
  4. Then we build a decision tree on $\tilde D_{k}^{(j)}$.
  5. Then average R of these random trees to create a random forest

We were asked for which class of conditional distributions $Y|X=x$ are very random forests unbiased? I am wondering what is meant by the "class" of conditional distribution? Could someone shed some light on this please?

Also, how does the bias and variance of this RF vary with a traditional RF? I assume that I will need to look at the generalization bounds? I am not sure. Could someone please shed some light on this?

user1234
  • 131
  • 1
  • This is a special case of the [Random Subspace Method for Ensemble Decision Trees](https://en.wikipedia.org/wiki/Random_subspace_method#Algorithm) – Nikos M. Mar 28 '21 at 13:18
  • Unbiased in the sense than $E[RandomForest(X)] = Y$? in other words the expected classification is the same as the actual classification? Or unbiased in the sense that the sample space is sampled uniformly? – Nikos M. Mar 28 '21 at 13:21
  • @NikosM. Unbiasedness in terms of bias-variance – user1234 Mar 28 '21 at 15:32
  • A partial answer (unless more ibackground is added in question as what is supposed to be used), is that in order for the average random forest to be unbiased and since each tree in the forest reflects only one feature, then features should be independent, so at least the trees of each feature have indepndent errors which on average sum up to zero thus unbiased – Nikos M. Mar 28 '21 at 17:12
  • @NikosM. What additional background might we need ? – user1234 Mar 28 '21 at 18:22
  • For example what are we supposed to rely on or use in this question. My guess is that this question is part of a class on classification trees, so some background in order to tackle this should be given – Nikos M. Mar 28 '21 at 18:28
  • @NikosM. Yes it is. It is based on random forests. We are supposed to use generalization bounds. – user1234 Mar 28 '21 at 18:30

0 Answers0