Pruning to avoid overfitting
The approach to constructing decision trees usually involves using greedy heuristics (such as Entropy reduction) that overfit the training data and lead to poor accuracy in future predictions.
In response to the problem of overfitting nearly all modern decision tree algorithms adopt a pruning strategy of some sort. Many algorithms use a technique known as postpruning or backward pruning. This essentially involves growing the tree from a dataset until all possible leaf nodes have been reached (i.e. purity) and then removing particular substrees. Studies have shown that post-pruning will result in smaller and more accurate trees by up to 25%. Different pruning techniques have been developed which have been compared in several papers and like with the different splitting criteria it has been found that there is not much variation in terms of performance (e.g. see Mingers89 and Esposito et. al. 97). There are quite a few methods that have been developed. We’ll look at one of the basic ones here.
An example: Reduced Error Pruning (Quinlan 86)
At each node in a tree it is possible to see the number of instances that are misclassified on a testing set by propagating errors upwards from leaf nodes. This can be compared to the error-rate if the node was replaced by the most common class resulting from that node. If the difference is a reduction in error, then the subtree at the node can be considered for pruning. This calculation is performed for all nodes in the tree and whichever one has the highest reduced-error rate is pruned. The procedure is then recursed over the freshly pruned tree until there is no possible reduction in error rate at any node.
The above node labelled ‘Income’ can be seen to produce branches that lead to a total of 3 misclassifications on the testing data (the second red number on leaf nodes). Replacing this node just with a leaf node labelled ‘Responded’ would result in:
Although now we have a re-substitution error on the training set, the errors on the testing set are reduced. This is desirable in terms of accuracy and tree size.
Other stategies to avoid overfitting
Pruning, or post-pruning as we have described here, is not the only method that has been used to avoid overfitting. Various other post-pruning methods exist, including strategies that convert the tree to rules before pruning. Recent work has involved trying to incorporate some overfitting-prevention bias into the splitting part of the algorithm. One example of this is based on the minimum-description length principle (Risanen 85) that states that the best hypothesis is the one that minimises length of encoding of the hypothesis and data. This has been shown to produce accurate trees with small size (e.g. see Quinlan 89 and Mehta et. al. 95).
Try it out
The next exercise allows you to prune branches of any trees that you can build. You can replace any non-leaf node with a leaf. See how this works in practice. See if you can you reduce the testing misclassification errors by pruning.
Mingers, J., 1989. An empirical comparison of pruning methods for decisiontree induction.
Machine Learning 4 (2), 227–243.
Esposito, F., Malerba, D., Semeraro, G., 1997. A comparative analysis of methods
for pruning decision trees.
IEEE Transactions on Pattern Machine Intelligence 19 (5), 476–491.
Rissanen, J., 1985. The minimum description length principle. In: Kotz, S.,
Johnson, N. (Eds.), Encyclopedia of Statistical Sciences. Vol. 5. John Wiley
and sons, New York, pp. 523–527.
Quinlan, J., Rivest, R., 1989. Inferring decision trees using minimum description length principle.
Information and Computation.
Mehta, M., Rissanen, J., Agrawal, R., 1995. MDL-based decision tree pruning.
In: Proceedings of the First International Conference on Knowledge Discovery
and Data Mining (KDD’95). pp. 216–221.