Decision Trees Tutorial 13: Overfitting

Decision Trees Tutorial > Overfitting

What did you notice from the last exercise? The main idea was to create a holdout (testing) set
from the training set and then construct the decision tree. The tree is built from only the training data but classification accuracy measures are applied to both sets. It is important to see that sometimes with complex trees the classification errors are high on the testing set compared to more simple trees. The error on the training set, known as the re-substitution error, is always zero (as the tree is grown and so it classifies this data perfectly). This is the problem. In order for our tree to generalise well it must not be built around the training data too closely. Let’s look at an example that can be obtained from the exercise on the previous page.

Say we use our dataset:

District House Type Income Previous
Customer
Outcome
Suburban Detached High No Nothing
Suburban Detached High Yes Nothing
Rural Detached High No Responded
Urban Semi-detached High No Responded
Urban Semi-detached Low No Responded
Urban Semi-detached Low Yes Nothing
Rural Semi-detached Low Yes Responded
Suburban Terrace Low Yes Responded
Rural Terrace High Yes Responded
Rural Detached Low No Responded
Urban Terrace High Yes Nothing

But we set aside the following data for testing:

District House Type Income Previous
Customer
Outcome
Urban Semi-detached Low No Responded
Suburban Detached High Yes Nothing
Rural Semi-detached Low Yes Responded
Suburban Terrace Low Yes Responded

We then built this tree:

District
|
|
Suburban

2/3
|

Income
|
|
High
2/2
|
Nothing
2:1
0:0
|
Low
1/1
|
Responded
1:1
0:0
|
Rural
3/3
|
Responded
3:1
0:0
|
Urban
2/4
|
Previous Customer
|
|
No
2/2
|
Responded
2:1
0:0
|
Yes
2/2
|
Nothing
2:0
0:0

Training set misclassifications: 0
Testing set misclassfications: 0

Whereas with this tree:

Previous Customer
|
|
No
5/7
|
House Type
|
|
Detached
2/3
|
District
|
|
Suburban
1/1
|
Nothing
1:0
0:0
|
Rural
2/2
|
Responded
2:0
0:0
|
Urban
0/0
|
null
0:0
0:0
|
Semi-detached
2/2
|
Responded
2:1
0:0
|
Terrace
1/2
|
Income
|
|
High
1/1
|
Nothing
1:0
0:0
|
Low
1/1
|
Responded
1:0
0:0
|
Yes
2/3
|
House Type
|
|
Detached
0/0
|
null
0:0
0:1
|
Semi-detached
1/1
|
Nothing
1:0
0:1
|
Terrace
1/2
|
District
|
|
Suburban
0/0
|
null
0:0
0:1
|
Rural
1/1
|
Responded
1:0
0:0
|
Urban
1/1
|
Nothing
1:0
0:0

Training set misclassifications: 0
Testing set misclassfications: 3

Basically, the second tree misclassifies 3 of the 4 testing set instances whereas the first tree is accurate on all examples. Why is this? What’s going on here? Maybe it’s just my peculiar example? It could just be the way I have set this up with particular attributes and so on. Well yes, to some extent this is quite contrived and everything but you should get the idea. Also try constructing a tree with not many instances of training data (compared to testing data). You should see that the error on the testing set is high; in some cases higher than the error if you were to just use a decision stump of Outcome=Responded (try this). The phenomenen is known as overfitting. This is a central theme in machine learning.

What is Overfitting and why does it occur?

As you have seen in constructing decision trees we use training data. We do this because we want to capture some general underlying functions or trends in that data, usually to be used in prediction. General trends, because we are not interested in capturing all the exact nuances and extremities of the training data. These are normally the result of errors or peculiarities that we are not likely to come across again. It is important that we can use our decision tree model to predict or generalise over future instances that we might obtain. Overfitting occurs when our decision tree characterises too much detail, or noise in our training data. More formally this can be stated as:

Two hypothese, H1 and H2 over some data exist with the following relationship:

Training set errors(H1) < Training set errors(H2) AND
Testing set errors(H1) > Testing set errors(H2)

As well as noise in the training data, this can happen when we don’t have much training data and are trying to extrapolate an underlying hypothesis from it.

We want our decision tree to generalise well, but unfortunately if we build a decision tree until all the training data has been classified perfectly and all leaf nodes are reached, then chances are that it won’t and that we’ll have a lot of misclassifications when we try and use it. There are methods that we can use to avoid overfitting such as "pruning" which we will look at on the next page.

Leave a Reply

Your email address will not be published. Required fields are marked *

* Copy This Password *

* Type Or Paste Password Here *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>