30

I have two questions related to decision trees:

  1. If we have a continuous attribute, how do we choose the splitting value?

    Example: Age=(20,29,50,40....)

  2. Imagine that we have a continuous attribute $f$ that have values in $R$. How can I write an algorithm that finds the split point $v$, in order that when we split $f$ by $v$, we have a minimum gain for $f>v$?

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
WALID BELRHALMIA
  • 411
  • 1
  • 4
  • 5

1 Answers1

32

In order to come up with a split point, the values are sorted, and the mid-points between adjacent values are evaluated in terms of some metric, usually information gain or gini impurity. For your example, lets say we have four examples and the values of the age variable are $(20, 29, 40, 50)$. The midpoints between the values $(24.5, 34.5, 45)$ are evaluated, and whichever split gives the best information gain (or whatever metric you're using) on the training data is used.

You can save some computation time by only checking split points that lie between examples of different classes, because only these splits can be optimal for information gain.

timleathart
  • 3,900
  • 20
  • 35
  • @timleathart but normaly for an attribut f we choose the split v that give the biggest information gain for f>v, but here look at the question they asked for a minimum gain. – WALID BELRHALMIA Nov 04 '17 at 09:35
  • @timleathart, Can you explain more ? I need to know best optimized way of identifying such splits and check for information gain. Lets say one variable has lot of variation and other is almost constant. How many such splits should be there? – Arpit Sisodia Jul 03 '19 at 11:32
  • @timeleathart, extending ur answer, this split will not be optimized when values are (20,21,22,23, 45,67,80). shouldn't min to max iteration can be used here? Please correct me if i am wrong in my assumption:) – Arpit Sisodia Jul 03 '19 at 11:35
  • This clarifies my confusions! – Jinhua Wang Nov 16 '19 at 12:10
  • 1
    I think the use of midpoints is an implementation detail; IIRC, some may split as <= each value itself. And there are additional short-cuts for efficiency, e.g. histogram binning. – Ben Reiniger May 13 '20 at 19:44
  • @timleathart : can you please elaborate or give pointers to proof of "checking split points that lie between examples of different classes, because only these splits can be optimal for information gain." – user179156 May 18 '21 at 07:02
  • @user179156 Very informally: information gain is maximised when the two subsets created by the split have the highest possible class purity. So if you choose a split point that has examples with the same class on either side, this results in lower class purity compared to if both of the examples were on the same side of the split. – timleathart May 19 '21 at 00:43