Tutorial (13): Overfitting
Decision Trees Tutorial > Overfitting
What did you notice from the last exercise? The main idea was to create a holdout (testing) set
from the training set and then construct the decision tree. The tree is built from only the training data but classification accuracy measures are applied to both sets. It is important to see that sometimes with complex trees the classification errors are high on the testing set compared to more simple trees. The error on the training set, known as the re-substitution error, is always zero (as the tree is grown and so it classifies this data perfectly). This is the problem. In order for our tree to generalise well it must not be built around the training data too closely. Let's look at an example that can be obtained from the exercise on the previous page.
Say we use our dataset:
| District | House Type | Income | Previous Customer |
Outcome |
| Suburban | Detached | High | No | Nothing |
| Suburban | Detached | High | Yes | Nothing |
| Rural | Detached | High | No | Responded |
| Urban | Semi-detached | High | No | Responded |
| Urban | Semi-detached | Low | No | Responded |
| Urban | Semi-detached | Low | Yes | Nothing |
| Rural | Semi-detached | Low | Yes | Responded |
| Suburban | Terrace | Low | Yes | Responded |
| Rural | Terrace | High | Yes | Responded |
| Rural | Detached | Low | No | Responded |
| Urban | Terrace | High | Yes | Nothing |
But we set aside the following data for testing:
| District | House Type | Income | Previous Customer |
Outcome |
| Urban | Semi-detached | Low | No | Responded |
| Suburban | Detached | High | Yes | Nothing |
| Rural | Semi-detached | Low | Yes | Responded |
| Suburban | Terrace | Low | Yes | Responded |
We then built this tree:
| District | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Training set misclassifications: 0
Testing set misclassfications: 0
Whereas with this tree:
| Previous Customer | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Training set misclassifications: 0
Testing set misclassfications: 3
Basically, the second tree misclassifies 3 of the 4 testing set instances whereas the first tree is accurate on all examples. Why is this? What's going on here? Maybe it's just my peculiar example? It could just be the way I have set this up with particular attributes and so on. Well yes, to some extent this is quite contrived and everything but you should get the idea. Also try constructing a tree with not many instances of training data (compared to testing data). You should see that the error on the testing set is high; in some cases higher than the error if you were to just use a decision stump of Outcome=Responded (try this). The phenomenen is known as overfitting. This is a central theme in machine learning.
What is Overfitting and why does it occur?
As you have seen in constructing decision trees we use training data. We do this because we want to capture some general underlying functions or trends in that data, usually to be used in prediction. General trends, because we are not interested in capturing all the exact nuances and extremities of the training data. These are normally the result of errors or peculiarities that we are not likely to come across again. It is important that we can use our decision tree model to predict or generalise over future instances that we might obtain. Overfitting occurs when our decision tree characterises too much detail, or noise in our training data. More formally this can be stated as:
Two hypothese, H1 and H2 over some data exist with the following relationship:
Training set errors(H1) < Training set errors(H2) AND
Testing set errors(H1) > Testing set errors(H2)
As well as noise in the training data, this can happen when we don't have much training data and are trying to extrapolate an underlying hypothesis from it.
We want our decision tree to generalise well, but unfortunately if we build a decision tree until all the training data has been classified perfectly and all leaf nodes are reached, then chances are that it won't and that we'll have a lot of misclassifications when we try and use it. There are methods that we can use to avoid overfitting such as "pruning" which we will look at on the next page.
Recent comments
55 min 13 sec ago
56 min 4 sec ago
1 hour 26 min ago
3 hours 15 min ago
3 hours 15 min ago
3 hours 15 min ago
3 hours 15 min ago
3 hours 15 min ago
3 hours 15 min ago
3 hours 15 min ago