Tutorial (9): Exercise 4
Exercise 4: Using Gain Ratio as a Splitting Criteria
The dataset:
| Date | District | House Type | Income | Previous Customer |
Outcome |
| 3/10/03 | Suburban | Detached | High | No | Nothing |
| 14/9/03 | Suburban | Detached | High | Responded | Nothing |
| 2/4/02 | Rural | Detached | High | No | Responded |
| 18/1/03 | Urban | Semi-detached | High | No | Responded |
| 3/4/03 | Urban | Semi-detached | Low | No | Responded |
| 15/10/02 | Urban | Semi-detached | Low | Responded | Nothing |
| 15/10/02 | Rural | Semi-detached | Low | Responded | Responded |
| 2/3/01 | Suburban | Terrace | High | No | Nothing |
| 4/5/03 | Suburban | Semi-detached | Low | No | Responded |
| 2/1/03 | Urban | Terrace | Low | No | Responded |
| 3/10/03 | Suburban | Terrace | Low | Responded | Responded |
| 3/10/03 | Rural | Terrace | High | Responded | Responded |
| 8/4/03 | Rural | Detached | Low | No | Responded |
| 6/5/02 | Urban | Terrace | High | Responded | Nothing |
The Decision Tree: Interactively build it
- Click on the root node below and start building the tree.
- Non leaf nodes can be "pruned" once they have been chosen (by clicking on the node and selecting "prune node completely")
- The ratios on the branches indicate how well the chosen attribute at a node splits the remaining data based on the target attribute ('outcome'). s
- Click on any nodes to hilight the rows in the data table that the rule down to that node covers.
- At each node, the entropy of the data at that point in the tree will be given.
- Information gain (entropy reduction) is specified for each attribute.
-
Reducing entropy to zero is a way of building a decision tree here.
When no more nodes can be expanded, the tree has classified all the training data. - Notice that the date attribute is calculated as having a high information gain.
- The gain ratio of an attribute is now also shown at each node construction phase, after the Information Gain value.
- See how the two differ and explore the types of trees that each produces.
- If we are to assume that the Date has no bearing on the Outcome, then which method produces the smaller trees?
| root node |
Error in example?
First off: Thanks a lot for this tutorial...
But I think there is a mistake in the example above:
when calculating intrinsic split info for the atribute Previous Customer, you seem to find exactly 1 (since the gain ratio is equal to gain)... But that doesn't seem possible: 8/14*log2(8/14)+6/14*log2(6/14) != 1... Even when rounding up, the value of the Gain Ratio for 'Previous Customer' should be 0.049, not 0.048...
Also, the definition of VI in the previous page is rather obscure... shouldn't it simply be: sum(pi * log2(pi)?