Decision Trees Tutorial 6: Entropy Bias

Entropy Bias

In the previous exercise and in the page before that we have seen how a simple entropy measure can be used to recursively build a decision tree. This technique is flawed though because it favours attributes that have many possible values over those that have few. Why is this a problem? Why does this matter?

An Example

Take an example where we have a new attribute Date, that records the date when we sent out the promotion. Our modified dataset might look something like this:

Date District House Type Income Previous
Customer
Outcome
3/10/03 Suburban Detached High No Nothing
14/9/03 Suburban Detached High Responded Nothing
2/4/02 Rural Detached High No Responded
18/1/03 Urban Semi-detached High No Responded
3/4/03 Urban Semi-detached Low No Responded
15/10/02 Urban Semi-detached Low Responded Nothing
15/10/02 Rural Semi-detached Low Responded Responded
2/3/01 Suburban Terrace High No Nothing
4/5/03 Suburban Semi-detached Low No Responded
2/1/03 Urban Terrace Low No Responded
3/10/03 Suburban Terrace Low Responded Responded
3/10/03 Rural Terrace High Responded Responded
8/4/03 Rural Detached Low No Responded
6/5/02 Urban Terrace High Responded Nothing

This attribute will be given a high Information Gain value in our tree construction algorithm. And so it will seem like a good idea to use this at the root node. A tree built using the Information Gain measure on the above dataset will look like this:

Date
|
|
3/10/03
2/3
|
House Type
|
|
Detached
1/1
|
Nothing
|
Semi-detached
2/2
|
Responded
|
Terrace
0/0
|
null
|
14/9/03
1/1
|
Nothing
|
2/4/03
1/1
|
Responded
|
18/1/03
1/1
|
Responded
|
3/4/03
1/1
|
Responded
|
15/10/02
1/2
|
District
|
|
Suburban
0/0
|
null
|
Rural
1/1
|
Responded
|
Urban
1/1
|
Nothing
|
2/3/01
1/1
|
Nothing
|
4/5/03
1/1
|
Responded
|
2/1/03
1/1
|
Responded
|
8/4/03
1/1
|
Responded
|
6/5/02
1/1
|
Nothing

What a monster!

Now this tree might classify the training data which we used to build it, but as well as being complex, it won’t be much good for predicting new instances. The date attribute will be the first thing the tree filters for when we want a new prediction for some new instance. It will try to base it’s predications on this just because the training data was split nicely with a lot of instances being classified just from their Date. It should be obvious that this attribute is not a good predictor in reality for the Outcome. It has only been selected because of the way the entropy calculations work. Why prefer a tree like the above, when we could build the following tree from the same data:

District
|
|
Suburban
3/5
|
House Type
|
|
Detached
2/2
|
Nothing
|
Semi-detached
1/2
|
Income
|
|
High
1/1
|
Nothing
|
Low
1/1
|
Responded
|
Terrace
1/1
|
Responded
|
Rural
4/4
|
Responded
|
Urban
3/5
|
Previous Customer
|
|
No
3/3
|
Responded
|
Yes
2/2
|
Nothing

This is not even the shortest tree! Clearly, in some cases it would be better to have a different method of splitting the data in our algorithm. Have a go on the next page at building some trees with the modified dataset. See how the Information Gain measure always favours the Date attribute.