2

I have heard that decision trees can have a high amount of variance, and that for a data set $D$, split into test/train, the decision tree could be quite different depending on how the data was split. Apparently, this provides motivation for algorithms such as random forest.

Is this correct? Why does a decision tree suffer from high variability?

Edit:

Just a note - I do not really follow the current answer and have not been able to solve that in the comments.

Ethan
  • 1,625
  • 8
  • 23
  • 39
baxx
  • 173
  • 1
  • 7

4 Answers4

4

It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.

A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.

Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:

IF

  1. player X is on the field AND
  2. team A has a home game AND
  3. the weather is sunny AND
  4. the number of attending fans >= 26000 AND
  5. it is past 3pm

THEN team A wins.

If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.

Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).

Linear regression, for example, would not be so sensitive to a small change because it is limited ("biased" -> see bias-variance tradeoff) to linear relationships and cannot represent sudden changes from 25999 to 26000 fans.

That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.

(See e.g. here for more on how random forests can help with this further.)

oW_
  • 6,254
  • 4
  • 28
  • 45
1

The point is that if your training data does not have the same input features with different labels which leads to $0$ Bayes error, the decision tree can learn it entirely and that can lead to overfitting also known as high variance. This is why people usually use pruning using cross-validation for avoiding the trees to get overfitted to the training data.

Decision trees are powerful classifiers. Algorithms such as Bagging try to use powerful classifiers in order to achieve ensemble learning for finding a classifier that does not have high variance. One way can be ignoring some features and using the others, Random Forest, in order to find the best features which can generalize well. The other can be using choosing random training data for training each decision tree and after that put it that again inside the training data, bootstrapping.

The reason that decision trees can overfit is due to their VC. Although it is not infinite, unlike 1-NN, it is very large which leads to overfitting. It simply means you have to provide multiple numerous data in order not to overfit. For understanding VC dimension of decision trees, take a look at Are decision tree algorithms linear or nonlinear.

Green Falcon
  • 13,868
  • 9
  • 55
  • 98
  • "**the same input features with different labels which leads to 0 Bayes error**", I'm not sure what you mean by this. – baxx Mar 28 '19 at 18:29
  • @baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others. – Green Falcon Mar 28 '19 at 19:34
  • if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?) – baxx Mar 28 '19 at 20:08
  • Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have – Green Falcon Mar 28 '19 at 20:13
  • intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm. – Green Falcon Mar 28 '19 at 20:14
  • Anyway, in the answer I tried to use common jargons which are usually discussed in ML and DL textbooks. What I've referred can be found in pattern recognition textbooks too. – Green Falcon Mar 28 '19 at 20:15
  • "temperature for instance, suppose it can take any value and forget about -273", yeah I have no idea what you're talking about unfortunately. No worries though, thanks. – baxx Mar 28 '19 at 20:17
  • Less than -273 degree is not possible. – Green Falcon Mar 28 '19 at 21:41
  • Which part you cannot understand. feel free it's normal :) – Green Falcon Mar 28 '19 at 21:44
  • the bayes error is not defined over the training set. you may want to change this to training error or something similar. – oW_ Mar 28 '19 at 21:57
  • @oW_ the Bayes error is defined on the real distribution. Due to not having real distribution, we can consider the histogram of the training set such a thing. The point is that if you have same inupt patterns with contradictory labels where the situation is mutually exclusive and exhaustive, this leads to Bayes error. – Green Falcon Mar 28 '19 at 22:01
  • This means you cannot have $100%%$ accuracy in the current feature space. – Green Falcon Mar 28 '19 at 22:01
  • I understand the point but it still doesn't make it the Bayes error. Also I don't see why this is necessary to answer the question. There can be overfitting regardless of whether there is zero Bayes error (and what about regression?!) This is only confuses the OP. – oW_ Mar 28 '19 at 22:12
  • @oW_ As far as I know regression, real number output, cannot be used for decision trees. About Bayes error, if you have same input pattern with contradictory labels, your tree have to decide what to put as the label in the leaf node. This is why it's important. Because from the leaf to the root, all branches are the same for same inputs with contradictory labels. – Green Falcon Mar 28 '19 at 22:17
  • I'm referring to the question using pattern recognition perspective. – Green Falcon Mar 28 '19 at 22:19
1

I may be a bit late to the party. Yes, I would say this is correct.

Decision trees are prone to overfitting. Models that exhibit overfitting are usually non-linear and have low bias as well as high variance (see bias-variance trade-off). Decision trees are non-linear, now the question is why should they have high variance.

In order to illustrate this, consider yourself in a time-series regression setting. The usual procedure is to choose your features and split your data into training and test data. As features we will assume time-lagged values of the target value. Moreover, let's assume that the data exhibits a trend, meaning that the values are constantly growing. Unavoidably, the test set will contain larger values than the training data.

Now, you train a decision tree regressor on the training set. The decision tree will find splits based on the training data, i.e. partition the feature space into regions. In this sense, the feature space is dependent on the range of each individual feature. The range of the features is obviously lower for the training set. Subsequently, when testing the model on the test data, the decision tree is not able to predict the larger values. It will "cut-off" at the highest values encountered in the training data. This shows that decision trees are sensitive to change in the data and therefore have high variability.

Chipsen
  • 11
  • 1
0

I do not fully agree with the answer from @oW_♦ . Let's use linear regression as an example (and also assume the true model is a linear regression with Gaussian error).

Based on the normal equation, we could easily infer the model variance such as in this post: https://stats.stackexchange.com/a/306790

The variance of the model here means given an input row $X^{*}$ how prediction would vary based on the true error. In layman's terms, it is to say that if we collect the same data again (X is the same), the output will be different due to the irreducible error. Then how would my fitted model change?

I think we should keep features unchanged, but introduce small errors into the labels y and see how model prediction goes. I generate a simple simulation for proof of my idea: https://github.com/raingstar/BiasVarianceAnalysis/blob/main/Bias-Variance%20analysis.ipynb

rain keyu
  • 11
  • 2