In the previous exercise and in the page before that we have seen how a simple entropy measure can be used to recursively build a decision tree. This technique is flawed though because it favours attributes that have many possible values over those that have few. Why is this a problem? Why does this matter?
Take an example where we have a new attribute Date, that records the date when we sent out the promotion. Our modified dataset might look something like this:
This attribute will be given a high Information Gain value in our tree construction algorithm. And so it will seem like a good idea to use this at the root node. A tree built using the Information Gain measure on the above dataset will look like this:
What a monster!
Now this tree might classify the training data which we used to build it, but as well as being complex, it won’t be much good for predicting new instances. The date attribute will be the first thing the tree filters for when we want a new prediction for some new instance. It will try to base it’s predications on this just because the training data was split nicely with a lot of instances being classified just from their Date. It should be obvious that this attribute is not a good predictor in reality for the Outcome. It has only been selected because of the way the entropy calculations work. Why prefer a tree like the above, when we could build the following tree from the same data:
This is not even the shortest tree! Clearly, in some cases it would be better to have a different method of splitting the data in our algorithm. Have a go on the next page at building some trees with the modified dataset. See how the Information Gain measure always favours the Date attribute.