An important technique in machine learning, decision trees are used extensively in data mining. They are able to produce human-readable descriptions of trends in the underlying relationships of a dataset and can be used for classfication and prediction tasks. The technique has been used successfully in many different areas, such as medical diagnosis, plant classification, and customer marketing strategies.
The following pages in this Tutorial will present the basic ideas and techniques involved in Decision Tree Learning with references to research in the area. They should be worked through in sequential order, or use the navigation to skip to a particular page. Every now and then, there will be interactive pages. These are intended to be used as aids in understanding the various algorithms and concepts. Once you feel that you have a good understanding of the material, please use the resources section or contribute to the site with comments or additions. Thanks for looking - have fun!
Imagine we have the following data about a fictitious marketing strategy. Say some company sent out some promotion to various houses and recorded a few facts about each house and also whether the people responded or not:
| District | House Type | Income | Previous Customer |
Outcome |
| Suburban | Detached | High | No | Nothing |
| Suburban | Detached | High | Yes | Nothing |
| Rural | Detached | High | No | Responded |
| Urban | Semi-detached | High | No | Responded |
| Urban | Semi-detached | Low | No | Responded |
| Urban | Semi-detached | Low | Yes | Nothing |
| Rural | Semi-detached | Low | Yes | Responded |
| Suburban | Terrace | High | No | Nothing |
| Suburban | Semi-detached | Low | No | Responded |
| Urban | Terrace | Low | No | Responded |
| Suburban | Terrace | Low | Yes | Responded |
| Rural | Terrace | High | Yes | Responded |
| Rural | Detached | Low | No | Responded |
| Urban | Terrace | High | Yes | Nothing |
Imagine that we had thousands and thousands of instances (records) of this stuff. Here we have only 14, but if we had a lot, then it would be reasonable to assume that there would be some patterns in it. What sort of patterns? What could we find out? Well, we might discover some underlying relationships between some of the attributes, in particular it would be good to know which factors influence whether someone responds or not. That is, which factors most strongly affect a household's response to the promotion. In the data above for example, we can see that all rural households responded. This would be useful to know, as next time we might only have so many promotional brochures and so we would like to be selective as to where we send them in order to get the most responses. The example above is pretty trivial and we could probable analyse it manually just by looking, but the general idea if we had more instances would be to build some sort of classifier which could be used to examine the underlying relationships and make future predictions about the target concept (in this case the outcome of a mailed promotion). This is where automated building of decision trees comes in - a technique that can be used to generate some rules about data and then perform generalisation and prediction tasks.
We'll stick with the example data above in this tutorial and use it to show how we could build a decision tree to analyse it. Of course we're not going to 'build' any trees ourselves-we'd get some software to do it, or some seeds and a pot- but it's important to examine the techniques involved.
How can a tree help here? Well, in order to generate a set of rules we can construct a decision tree. This is done top-down from a root node and involves partitioning the data into subsets that contain instances that have similar values. Doing this for the dataset above can result in such a tree:
| District | |||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||
N.B. If you click on the green boxes such as "Responded" that are the leaf nodes of the tree you will see an option to highlight the matches in the data. Selecting this will highlight rows in the data that are covered by the rules of the tree ending in this leaf node. Try it!
Ok, so the nodes in brown in the tree correspond to attributes. At each node the dataset is split into subsets based on the value of the attribute at the node. For instance, at the root node, we split the entire dataset into three subsets. One that contains only instances (rows, tuples, whatever) that have the value 'Suburban' for the 'District' attribute, one that that contains only instances where the District attribute is 'Urban', and one where the all the instances are 'Rural' for that attribute. The numbers on the branches are important here: They correspond to the number of instances in each subset that have one and only one value for the target attribute ('Outcome'). This basically says how well the given value of the attribute we split on relates to the target attribute. What? Look at the tree - the middle branch of the first node. '4/4' below 'Rural' indicates that all four of the instances with District=Rural have the same value for 'Outcome' (in this case 'Responded'). This is good, because we have split the data using this attribute=value pairing to perfectly classify all instances that have this pairing. In other cases the value of the District attribute does not lead to a perfect, or pure subset. These ideas are related to entropy which we shall examine later. Anyway, continuing with the above tree - look at the first branch of the first node. This tells us that when District=Suburban, only 3 of 5 instances have the same value of the target attribute. In this case, it is necessary to continue splitting up this subset using other attribute tests until we have only pure subsets. The 5 instances which have District=Suburban on the left-most branch are then tested with 'House-Type' and are split into further subsets. The tree construction continues until purity, or until all the subsets are pure (with respect to the target attribute). When this occurs the branch terminates in a green leaf node that specifies what value the target attribute takes for all instances that have been filtered down this branch.
Ok, so we can represent the data with a tree? So what? Well we can extract rules from the tree quite easily. Just read off the paths of all the leaf nodes. This gives us (from left to right in the tree):
A disjunction of conjunctions. This is useful for summarising the data and extracting the underlying relationships.
How can this be used for classification and prediction?
Well, say for example that we wanted to predict the outcome of mailing to a certain house. We could just of course do a look-up on our dataset to see if the characteristics of this new house matched any we had mailed to before with the assumption that the new house will respond in the same way. This won't always be possible though, as our dataset here doesn't represent all the possible combinations. Instead we use the decision tree to generalise.
E.g. If we know the District we're going to mail to is Urban and the person was a previous customer, then the tree predicts that the person will not respond (Follow the attributes and values down the tree).
Ok, so this illustrates the basic idea how we can use Decision Tree Learning. You might be thinking that this is a very small, contrived example and that really it's all fairly random what happened. Well, yes, yes and yes, but the same basic idea is used in practical situations. Imagine if we had thousands of records of data for a concept like the one above and maybe lots more attributes, perhaps some of them with numeric attributes. We wouldn't be able to analyse such data by just looking at it and so constructing a decision tree would help. Furthermore, the more data we have, then the more chance that we can get a real insight into the any underlying function or relationship between the attributes. This is because the tree generalises when used for predictions and we would be more confident about it's accuracy if it had been constructed from many examples instead of just a few. There are many other complications and details to worry about, but all that can be looked at later. Right now, we are going to look at the tree building process a bit more.
For any given dataset there are a lot of possible trees that we could construct. Instead of having the root node as 'District', it could have been 'Income' for example. Likewise, the second child node could have been 'House-Type' instead of 'Previous-Customer'. Have a go building a tree from this dataset on the next page. You will see that there are many possible trees that can be built to model this data. They are all perfectly legitimate decision trees. Try to find the shortest (least number of nodes).
The dataset:
| District | House Type | Income | Previous Customer | Outcome |
| Suburban | Detached | High | No | Nothing |
| Suburban | Detached | High | Yes | Nothing |
| Rural | Detached | High | No | Responded |
| Urban | Semi-detached | High | No | Responded |
| Urban | Semi-detached | Low | No | Responded |
| Urban | Semi-detached | Low | Yes | Nothing |
| Rural | Semi-detached | Low | Yes | Responded |
| Suburban | Terrace | High | No | Nothing |
| Suburban | Semi-detached | Low | No | Responded |
| Urban | Terrace | Low | No | Responded |
| Suburban | Terrace | Low | Yes | Responded |
| Rural | Terrace | High | Yes | Responded |
| Rural | Detached | Low | No | Responded |
| Urban | Terrace | High | Yes | Nothing |
The Decision Tree: Interactively build it
| Click here to begin |
If you did the practical exercise on the previous page, you will notice that the smallest complete tree that it is possible to build from the is this:
| District | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
There were other trees that you could build, some of them much larger and more complex than this one. The interesting thing is though, is that each of the trees describes the data perfectly - no instances are left unaccounted for. Each tree has no exceptions and is a perfectly valid hypothesis over the data. In that case, the shorter hypothesis is preferred. Why?
This principle thought up a long time ago by William Occam while shaving(!) states that the shortest hypothesis, or solution to a problem, should be the one we should prefer (over longer more complicated ones). This is one of the fundamental tenet's of the way western science works and has received much debate and controversy. Let's just say that most of the stuff concerning decision trees follows the idea of Occam's razor and looks to always contstruct small, simple, easily-understandable tree structures.
Why not just construct the shortest tree each time each time then for every dataset? As simple as it was for our dataset, in the general case this isn't so simple and can take a very long time on some datasets. Formally, this has shown to be an NP-complete problem which means that no computer algorithm can solve this in a sensible amount of time (in the general case). Therefore algorithms have been developed which find a good tree, usually very close to the best. It's now necessary to start examining some of these decision tree algorithms in more detail.
An early technique by the influential Ross Quinlan that influenced a large part of the research on Decision Trees is useful to look at in order to understand basic decision tree construction.
A fundamental part of any algorithm that constructs a decision tree from a dataset is the method in which it selects attributes at each node of the tree. In the previous exercise, you may have found it was better to place certain attributes (such as 'District') higher up the tree in order to produce a short tree. Q. Why is this?
A. Some attributes split the data up more purely than others. That means that their values correspond more consistently with instances that have particular values of the target attribute (the one we want to predict) than those of another attribute. Therefore, we might say that such attributes have some underlying relationship with the target attribute . But how can this be quantified in some way? Essentially, we would like some sort of measure that enables us to compare attributes with each other and then be able to decide to put ones that split the data more purely higher up the tree.
A measure used from Information Theory in the ID3 algorithm and many others used in decision tree construction is that of Entropy. Informally, the entropy of a dataset can be considered to be how disordered it is. It has been shown that entropy is related to information, in the sense that the higher the entropy, or uncertainty, of some data, then the more information is required in order to completely describe that data. In building a decision tree, we aim to decrease the entropy of the dataset until we reach leaf nodes at which point the subset that we are left with is pure, or has zero entropy and represents instances all of one class (all instances have the same value for the target attribute).
We measure the entropy of a dataset,S, with respect to one attribute, in this case the target attribute, with the following calculation:

where Pi is the proportion of instances in the dataset that take the ith value of the target attribute, which has C different values
This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.
Using the example of the marketing data, we know that there are two classes in the data and so we use the fractions that each class represents in an entropy calculation:
Entropy S = -([9/14 responses, 5/14 no responses])
= -9/14 log2 9/14 - 5/14 log2 5/14 = 0.94 bits
Ok now we want a quantitative way of seeing the effect of splitting the dataset by using a particular attribute (which is part of the tree building process). We can use a measure called Information Gain, which calculates the reduction in entropy (Gain in information) that would result on splitting the data on an attribute, A.

where v is a value of A
, |Sv| is the subset of instances of S where A takes the value v,
and |S| is the number of instances
Continuing with our example dataset, let's name it S just for convenience, let's work out the Information Gain that splitting on the attribute District would result in over the entire dataset:

So by calculating this value for each attribute that remains, we can see which attribute splits the data more purely. Let's imagine we want to select an attribute for the root node, then performing the above calcualtion for all attributes gives:
We can clearly see that District results in the highest reduction in entropy or the highest information gain. We would therefore choose this at the root node splitting the data up into subsets corresponding to all the different values for the District attribute.
With this node evaluation technique we can procede recursively through the subsets we create until leaf nodes have been reached throughout and all subsets are pure with zero entropy. This is exactly how ID3 and other variants work.
We can now present the basic ID3 algorithm in pseudo-code:
Input: A data set, S
Output: A decision tree
Ok, so go and play about with this method on the next page interactively. You can see how easy it now is to construct a small decision tree with this entropy measure as a guide.
The dataset:
| District | House Type | Income | Previous Customer | Outcome |
| Suburban | Detached | High | No | Nothing |
| Suburban | Detached | High | Yes | Nothing |
| Rural | Detached | High | No | Responded |
| Urban | Semi-detached | High | No | Responded |
| Urban | Semi-detached | Low | No | Responded |
| Urban | Semi-detached | Low | Yes | Nothing |
| Rural | Semi-detached | Low | Yes | Responded |
| Suburban | Terrace | High | No | Nothing |
| Suburban | Semi-detached | Low | No | Responded |
| Urban | Terrace | Low | No | Responded |
| Suburban | Terrace | Low | Yes | Responded |
| Rural | Terrace | High | Yes | Responded |
| Rural | Detached | Low | No | Responded |
| Urban | Terrace | High | Yes | Nothing |
The Decision Tree: Interactively build it
| root node |
In the previous exercise and in the page before that we have seen how a simple entropy measure can be used to recursively build a decision tree. This technique is flawed though because it favours attributes that have many possible values over those that have few. Why is this a problem? Why does this matter?
Take an example where we have a new attribute Date, that records the date when we sent out the promotion. Our modified dataset might look something like this:
| Date | District | House Type | Income | Previous Customer |
Outcome |
| 3/10/03 | Suburban | Detached | High | No | Nothing |
| 14/9/03 | Suburban | Detached | High | Responded | Nothing |
| 2/4/02 | Rural | Detached | High | No | Responded |
| 18/1/03 | Urban | Semi-detached | High | No | Responded |
| 3/4/03 | Urban | Semi-detached | Low | No | Responded |
| 15/10/02 | Urban | Semi-detached | Low | Responded | Nothing |
| 15/10/02 | Rural | Semi-detached | Low | Responded | Responded |
| 2/3/01 | Suburban | Terrace | High | No | Nothing |
| 4/5/03 | Suburban | Semi-detached | Low | No | Responded |
| 2/1/03 | Urban | Terrace | Low | No | Responded |
| 3/10/03 | Suburban | Terrace | Low | Responded | Responded |
| 3/10/03 | Rural | Terrace | High | Responded | Responded |
| 8/4/03 | Rural | Detached | Low | No | Responded |
| 6/5/02 | Urban | Terrace | High | Responded | Nothing |
This attribute will be given a high Information Gain value in our tree construction algorithm. And so it will seem like a good idea to use this at the root node. A tree built using the Information Gain measure on the above dataset will look like this:
| Date | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What a monster!
Now this tree might classify the training data which we used to build it, but as well as being complex, it won't be much good for predicting new instances. The date attribute will be the first thing the tree filters for when we want a new prediction for some new instance. It will try to base it's predications on this just because the training data was split nicely with a lot of instances being classified just from their Date. It should be obvious that this attribute is not a good predictor in reality for the Outcome. It has only been selected because of the way the entropy calculations work. Why prefer a tree like the above, when we could build the following tree from the same data:
| District | |||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||
This is not even the shortest tree! Clearly, in some cases it would be better to have a different method of splitting the data in our algorithm. Have a go on the next page at building some trees with the modified dataset. See how the Information Gain measure always favours the Date attribute.
The dataset:
| Date | District | House Type | Income | Previous Customer |
Outcome |
| 3/10/03 | Suburban | Detached | High | No | Nothing |
| 14/9/03 | Suburban | Detached | High | Responded | Nothing |
| 2/4/02 | Rural | Detached | High | No | Responded |
| 18/1/03 | Urban | Semi-detached | High | No | Responded |
| 3/4/03 | Urban | Semi-detached | Low | No | Responded |
| 15/10/02 | Urban | Semi-detached | Low | Responded | Nothing |
| 15/10/02 | Rural | Semi-detached | Low | Responded | Responded |
| 2/3/01 | Suburban | Terrace | High | No | Nothing |
| 4/5/03 | Suburban | Semi-detached | Low | No | Responded |
| 2/1/03 | Urban | Terrace | Low | No | Responded |
| 3/10/03 | Suburban | Terrace | Low | Responded | Responded |
| 3/10/03 | Rural | Terrace | High | Responded | Responded |
| 8/4/03 | Rural | Detached | Low | No | Responded |
| 6/5/02 | Urban | Terrace | High | Responded | Nothing |
The Decision Tree: Interactively build it
| root node |
The observation that Information Gain is unfairly biased has led to other splitting criteria being proposed.
An obvious way to negate the bias or "greediness" of Information Gain is to take into account the number of values of an attribute. This is exactly the approach that can be used. A new, improved calculation for attribute A over data S is:


The second equation for IV (A) measures the information content for the attribute A by looking at each proportion, p, of instances that take value i for the attribute.
Try the practical exercise on the next page to see how gain ratio differs from information gain.
A method used by the influential CART system that was developed in the 1980s (see Breiman, L., Freidman, J., Olshen, R., and Stone, C., 1984. Classification and regression trees. Wadsworth International, CA).
A lot of the modern algorithms use some information-based measures adapted from Information Gain. However other methods involve the use of statistical approaches which base their tree induction algorithms on the distribution of the data. In particular, measures adopted from the classic statistical technique of chi-squared are used. This measures the amount of correlation between two variables. We use correlation tables to calculate this. As an example, here is the correlation table for the attribute Income from the Marketing data:
| Class | ||||||
|
Income |
Responded | Nothing | Totals | |||
| High | 3 | 4 | 7 | |||
| Low | 6 | 1 | 7 | |||
| Totals | 9 | 5 | 14 | |||
This table shows the class distributions for each instance according to their Income value. The totals are in the last column and row. We calculate the chi-squared distribution from the following formula:

Calculating this for our Income attribute gives:
For the other attributes the (normalised) chi-squared values are:
The higher the value, the more correlation there is between the target attribute and the attribute used in the calculation.
Although the Date attribute is still favoured, it is not quite as highly favoured (relatively) compared with the evaulation technique of information gain (see the last exercise).
Several other variations have been proposed such as permutation tests (Frank & Witten 98) and multivariate splits.
References
Mingers, J. 1989. An Empirical Comparison of Selection Measures for Decision-Tree Induction. Machine Learning 3: 319-342.
Frank, E., Witten, I., 1998. Using a permutation test for attribute selection in decision trees.
The dataset:
| Date | District | House Type | Income | Previous Customer |
Outcome |
| 3/10/03 | Suburban | Detached | High | No | Nothing |
| 14/9/03 | Suburban | Detached | High | Responded | Nothing |
| 2/4/02 | Rural | Detached | High | No | Responded |
| 18/1/03 | Urban | Semi-detached | High | No | Responded |
| 3/4/03 | Urban | Semi-detached | Low | No | Responded |
| 15/10/02 | Urban | Semi-detached | Low | Responded | Nothing |
| 15/10/02 | Rural | Semi-detached | Low | Responded | Responded |
| 2/3/01 | Suburban | Terrace | High | No | Nothing |
| 4/5/03 | Suburban | Semi-detached | Low | No | Responded |
| 2/1/03 | Urban | Terrace | Low | No | Responded |
| 3/10/03 | Suburban | Terrace | Low | Responded | Responded |
| 3/10/03 | Rural | Terrace | High | Responded | Responded |
| 8/4/03 | Rural | Detached | Low | No | Responded |
| 6/5/02 | Urban | Terrace | High | Responded | Nothing |
The Decision Tree: Interactively build it
| root node |
So far we have considered simple approaches on basic datasets. The examples have been quite contrived and basic. We need to look at more complex situations that are dealt with in modern decision tree algorithms such as C4.5 and CART.
In real-life datasets it is common to find that there are missing values here and there. This may be because someone has forgot and not recorded the data properly, or maybe even because there simply was no value for a particular attribute. There are several different approaches that modern decision tree algorithms take to deal with this:
Our example dataset only including categorical attributes. More realistically, we might have attributes which have numeric values from a continuous domain. For example, we could include the income of the househld, not simply as either High or Low, but as a numeric value. Splitting on this attribute in the tree-building process then involves using inequalities to partition the data, such that each branch follows some expression such as if value of attribute, A, is less than or equal to X, where X is some interval threshold that we have set (usually based on the range of values for the numeric attribute). A lot of other methods also exist for dealing with numeric attributes (see Fayyad & Irani 92), but mainly algorithms that can deal with them look to discretise them in some way. This means that they can then be treated just as categorical attributes.
So far we have detailed decision tree learning with respect to predicting categories. In the case of predicting continuous numeric quantities, ie. where the target attribute is from a continuous-valued domain, leaf nodes are generated which correspond to the averages of instances from the training set that correspond to the path down to the leaf node. When all the attributes are numeric then the procedure is similar to that of classical regression in which a linear weighted combination of values is used. Of course a number of other machine learning tools such as Artificial Neural Networks are adept at dealing with numeric datasets.
The method looked at so far involves batch learning. This means that we obtain all our data and then build a decision tree that acts as a classifier. If we get some more data after this, then we have to build the tree all over again with our old dataset together with our new data. Utgoff(88,94) has developed an ID4 algorithm which could modify a decision tree with new data dynamically, without having to reconstruct the tree from scratch (also see Kalles & Morris 95). In a situation whereby it takes a long time to construct a tree (e.g. if we had a massive dataset intially) then building the tree from scratch again when just a few more data instances arrive, might not be a good idea. Incremental deicison tree induction is a solution in such situations.
Future misclassifications that the tree might make, may have varying consequences. One type of misclassification might be more serious than another. It all depends on the problem and the data and the things we are trying to predict. For example, in medical diagnosis some misclassifications may be more serious in some way than others. Some approaches to decision tree induction allow cost measures to be attached to certain outcomes.
References
Fayyad, U. M., Irani, K. B., 1992. On the handling of continuous-valued attributes
in decision tree generation.
Machine Learning 8, 87–102.
Utgoff, P., 1988. Id5: An incremental id3.
In: Proceedings of the 5th National Conference on Machine Learning. Ann Arbor,MI, pp. 107–120.
Utgoff, P., 1994. An improved algorithm for incremental induction of decision trees.
In: Cohen, W., Hirsh, H. (Eds.), Proceedings of the 11th International Conference on Machine Learning. Morgan Kaufmann, New Brunswick, NJ, pp. 318–325.
Kalles, D., Morris, T., 1995. Efficient incremental induction of decision trees.
Machine Learning, 1–13.
We have seen so far that there are differing approaches in constructing decision trees which lead to quite different trees being produced. There are several criteria that we can use in order to evaluate one decision tree algorithm for a given dataset over another algorithm on the same dataset. We might be interested in how concise the decision tree is that the algorithm produces- as we may want to be able to easily understand the rules from the tree. We might also look at how fast an algorithm is. Generally though, we will be interested in an algorithm's learning ability or how accurate it's predictions are. In this case, we need to look at some measures that evaluate a given tree's performance in classifiying previously unseen instances from a given problem. We cannot base a measure of the tree's accuracy on how well it performs on the training set, as this has been used in constructing the tree! There are several techniques that we can use.
In some cases where the dataset is relatively small, it is necessary to set some of the training data aside in order to use it as test data. This allows us to see the accuracy of a tree on hitherto unseen samples. In this situation we must be careful to ensure that the data used for training is representative of the underlying function or distribution. For instance, we may have a situation whereby one class is totally underrepresented in the training data and so error rates will be high when we come to use the test data on the tree built from this training data. In order to avoid this, random sampling can be used to ensure a more balanced representation of all possible classes seen in the dataset. This technique is known as stratified holdout and can be used to lessen the effect of any bias in the data of one particular class. This method can be explored in the next exercise. Here you can specify a holdout testing set and then see misclassfication rates on this set.
In order to confidently lessen the effects of algorithmic bias, a way of performing repeated training and testing is possible. This often used procedure involves splitting the training data into equally sized mutually exclusive subsets or folds. Each one of the subsets is then used in turn as a testing set after all the other sets combined have been the training set on which a tree has been built. This cross validation procedure allows mean error rates to be calculated which gives a useful insight into tree accuracy.
This technique is simply k-fold cross validation whereby k is the number of data instances. This has the advantage of allowing the largest amount of training data to be used in each run and conversely means that the testing procedure is deterministic. With large datasets this is computationally infeasible however and in certain situations the deterministic nature of the testing results in weird errors.
The central idea behind this method of validation is one underpinned by the statistical technique of sampling with replacement. Here a testing set is chosen from instances of the training set that are replaced once chosen. Error rates for training and testing sets are then combined to give a truer representation of the actual error. This method has been proposed in the form of the .632 bootstrap which involves taking 63.2% of the training data (which could include duplicates that have already been chosen) and using it in testing.
The next exercise will use the technique of holdout sets to check classification errors. See what type of trees will result in the fewest classification errors.
The dataset
| District | House Type | Income | Previous Customer |
Outcome |
| Suburban | Detached | High | No | Nothing |
| Suburban | Detached | High | Responded | Nothing |
| Rural | Detached | High | No | Responded |
| Urban | Semi-detached | High | No | Responded |
| Urban | Semi-detached | Low | No | Responded |
| Urban | Semi-detached | Low | Responded | Nothing |
| Rural | Semi-detached | Low | Responded | Responded |
| Suburban | Terrace | High | No | Nothing |
| Suburban | Semi-detached | Low | No | Responded |
| Urban | Terrace | Low | No | Responded |
| Suburban | Terrace | Low | Responded | Responded |
| Rural | Terrace | High | Responded | Responded |
| Rural | Detached | Low | No | Responded |
| Urban | Terrace | High | Responded | Nothing |
The Holdout (validation) data
Select rows by clicking to move data instances (rows) between tables
| District | House Type | Income | Previous Customer |
Outcome |
The Decision Tree: Interactively build it
| root node |
Classification Errors (Totals)
|
Correct |
Incorrect |
|
|
Training set |
||
|
Validation set |
What did you notice from the last exercise? The main idea was to create a holdout (testing) set
from the training set and then construct the decision tree. The tree is built from only the training data but classification accuracy measures are applied to both sets. It is important to see that sometimes with complex trees the classification errors are high on the testing set compared to more simple trees. The error on the training set, known as the re-substitution error, is always zero (as the tree is grown and so it classifies this data perfectly). This is the problem. In order for our tree to generalise well it must not be built around the training data too closely. Let's look at an example that can be obtained from the exercise on the previous page.
Say we use our dataset:
| District | House Type | Income | Previous Customer |
Outcome |
| Suburban | Detached | High | No | Nothing |
| Suburban | Detached | High | Yes | Nothing |
| Rural | Detached | High | No | Responded |
| Urban | Semi-detached | High | No | Responded |
| Urban | Semi-detached | Low | No | Responded |
| Urban | Semi-detached | Low | Yes | Nothing |
| Rural | Semi-detached | Low | Yes | Responded |
| Suburban | Terrace | Low | Yes | Responded |
| Rural | Terrace | High | Yes | Responded |
| Rural | Detached | Low | No | Responded |
| Urban | Terrace | High | Yes | Nothing |
But we set aside the following data for testing:
| District | House Type | Income | Previous Customer |
Outcome |
| Urban | Semi-detached | Low | No | Responded |
| Suburban | Detached | High | Yes | Nothing |
| Rural | Semi-detached | Low | Yes | Responded |
| Suburban | Terrace | Low | Yes | Responded |
We then built this tree:
| District | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Training set misclassifications: 0
Testing set misclassfications: 0
Whereas with this tree:
| Previous Customer | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Training set misclassifications: 0
Testing set misclassfications: 3
Basically, the second tree misclassifies 3 of the 4 testing set instances whereas the first tree is accurate on all examples. Why is this? What's going on here? Maybe it's just my peculiar example? It could just be the way I have set this up with particular attributes and so on. Well yes, to some extent this is quite contrived and everything but you should get the idea. Also try constructing a tree with not many instances of training data (compared to testing data). You should see that the error on the testing set is high; in some cases higher than the error if you were to just use a decision stump of Outcome=Responded (try this). The phenomenen is known as overfitting. This is a central theme in machine learning.
As you have seen in constructing decision trees we use training data. We do this because we want to capture some general underlying functions or trends in that data, usually to be used in prediction. General trends, because we are not interested in capturing all the exact nuances and extremities of the training data. These are normally the result of errors or peculiarities that we are not likely to come across again. It is important that we can use our decision tree model to predict or generalise over future instances that we might obtain. Overfitting occurs when our decision tree characterises too much detail, or noise in our training data. More formally this can be stated as:
Two hypothese, H1 and H2 over some data exist with the following relationship:
Training set errors(H1) < Training set errors(H2) AND
Testing set errors(H1) > Testing set errors(H2)
As well as noise in the training data, this can happen when we don't have much training data and are trying to extrapolate an underlying hypothesis from it.
We want our decision tree to generalise well, but unfortunately if we build a decision tree until all the training data has been classified perfectly and all leaf nodes are reached, then chances are that it won't and that we'll have a lot of misclassifications when we try and use it. There are methods that we can use to avoid overfitting such as "pruning" which we will look at on the next page.
The approach to constructing decision trees usually involves using greedy heuristics (such as Entropy reduction) that overfit the training data and lead to poor accuracy in future predictions.
In response to the problem of overfitting nearly all modern decision tree algorithms adopt a pruning strategy of some sort. Many algorithms use a technique known as postpruning or backward pruning. This essentially involves growing the tree from a dataset until all possible leaf nodes have been reached (i.e. purity) and then removing particular substrees. Studies have shown that post-pruning will result in smaller and more accurate trees by up to 25%. Different pruning techniques have been developed which have been compared in several papers and like with the different splitting criteria it has been found that there is not much variation in terms of performance (e.g. see Mingers89 and Esposito et. al. 97). There are quite a few methods that have been developed. We'll look at one of the basic ones here.
An example: Reduced Error Pruning (Quinlan 86)
At each node in a tree it is possible to see the number of instances that are misclassified on a testing set by propagating errors upwards from leaf nodes. This can be compared to the error-rate if the node was replaced by the most common class resulting from that node. If the difference is a reduction in error, then the subtree at the node can be considered for pruning. This calculation is performed for all nodes in the tree and whichever one has the highest reduced-error rate is pruned. The procedure is then recursed over the freshly pruned tree until there is no possible reduction in error rate at any node.
An example
| ... | 1/2 | |
|||||||||||||||||||||
| Income | |||||||||||||||||||||
|
|||||||||||||||||||||
The above node labelled 'Income' can be seen to produce branches that lead to a total of 3 misclassifications on the testing data (the second red number on leaf nodes). Replacing this node just with a leaf node labelled 'Responded' would result in:
| ... | 1/2 | |
| Responded 1:2 1:1 |
Although now we have a re-substitution error on the training set, the errors on the testing set are reduced. This is desirable in terms of accuracy and tree size.
Pruning, or post-pruning as we have described here, is not the only method that has been used to avoid overfitting. Various other post-pruning methods exist, including strategies that convert the tree to rules before pruning. Recent work has involved trying to incorporate some overfitting-prevention bias into the splitting part of the algorithm. One example of this is based on the minimum-description length principle (Risanen 85) that states that the best hypothesis is the one that minimises length of encoding of the hypothesis and data. This has been shown to produce accurate trees with small size (e.g. see Quinlan 89 and Mehta et. al. 95).
The next exercise allows you to prune branches of any trees that you can build. You can replace any non-leaf node with a leaf. See how this works in practice. See if you can you reduce the testing misclassification errors by pruning.
References
Mingers, J., 1989. An empirical comparison of pruning methods for decisiontree induction.
Machine Learning 4 (2), 227–243.
Esposito, F., Malerba, D., Semeraro, G., 1997. A comparative analysis of methods
for pruning decision trees.
IEEE Transactions on Pattern Machine Intelligence 19 (5), 476–491.
URL: http://citeseer.nj.nec.com/esposito97comparative.html
Rissanen, J., 1985. The minimum description length principle. In: Kotz, S.,
Johnson, N. (Eds.), Encyclopedia of Statistical Sciences. Vol. 5. John Wiley
and sons, New York, pp. 523–527.
Quinlan, J., Rivest, R., 1989. Inferring decision trees using minimum description length principle.
Information and Computation.
Mehta, M., Rissanen, J., Agrawal, R., 1995. MDL-based decision tree pruning.
In: Proceedings of the First International Conference on Knowledge Discovery
and Data Mining (KDD’95). pp. 216–221.
URL: http://citeseer.nj.nec.com/mehta95mdlbased.html
The dataset:
| District | House Type | Income | Previous Customer |
Outcome |
| Suburban | Detached | High | No | Nothing |
| Suburban | Detached | High | Yes | Nothing |
| Rural | Detached | High | No | Responded |
| Urban | Semi-detached | High | No | Responded |
| Urban | Semi-detached | Low | No | Responded |
| Urban | Semi-detached | Low | Yes | Nothing |
| Rural | Semi-detached | Low | Yes | Responded |
| Suburban | Terrace | High | No | Nothing |
| Suburban | Semi-detached | Low | No | Responded |
| Urban | Terrace | Low | No | Responded |
| Suburban | Terrace | Low | Yes | Responded |
| Rural | Terrace | High | Yes | Responded |
| Rural | Detached | Low | No | Responded |
| Urban | Terrace | High | Yes | Nothing |
The Holdout (validation) data
Select rows by clicking to move data instances (rows) between tables
| District | House Type | Income | Previous Customer |
Outcome |
The Decision Tree: Interactively build it
| root node |
Classification Errors (Totals)
|
Correct |
Incorrect |
|
|
Training set |
||
|
Validation set |
The inherent instability of top-down decision tree induction means that different training datasets from a given problem domain will produce quite different trees. One of the main directions in decision tree learning in the late 1990s was to combine multiple trees, or committees of trees in order to obtain better results than those that can be obtained from a single model. This is an approach not just restricted to decision tree learning but one that has been applied to most other Machine Learning methods.
One technique, known as Bagging, or Boostrap-Aggregating, involves producing different trees from a different sample of the problem domain (or from the training set itself) and then aggregating the results on a set of test instances. This helps to reduce the instability and variance in the decison tree learning process by combining mutlitple models. Similarly to Boostrapping, Bagging usually involves sampling (with replacement) the training data in order to produce test data which can be used to evaluate each model. Several of the other methods described previously, such as cross-validation for example, can also be used in bagging however.
Boosting is another relatively new technique that involves generating a sequence of decision trees from the same dataset whereby attention in is paid to the misclassifications of the previous tree. This results in a fixed number of decision trees which can then be used to classify new instances more accurately by aggregating all the different tree's predictions and choosing the target class that occurs most often. Freund et al (1996) have developed the widely used AdaBoost.M1 which weights instances depending whether or not they were misclassified by previous trees. This allows attention to be directed to the instances that have caused errors in previous iterations. Although trees that use boosting are more accurate on new instances of data, they have been shown to overfit in some situations. Friedman99 has developed a way of reducing this aspect of its performance. Most high-performance modern decision tree learners use some form of boosting to generate more accurate trees. Studies have shown that Bagging and boosting can improve performance significantly (Quinlan96, Bauer99, Dietterich99 etc).
Several approaches have been examined which attempt to hybridise decision tree-learners with some other type of learner or technique. For instance, Carvalho (2000) has shown how a hybrid genetic algorithm decision tree learner can be used to find rules that are small disjuncts more effectively than a decision tree alone. This involves using Quinlan's C4.5 learner to produce an initial tree which is then decomposed into rules which act as chromosomes in a genetic algorithm. Other researchers have also looked at this type of hybridisation. Work has also been done on combing fuzzy classifiers and neural networks with decision trees (e.g. Sethi90, Leow99, Ping02). This usually involves building a tree using one of the greedy induction algorithms, and then refining it to be as close to optimal as possible using another technique. This method may be beneficial in certain problem domains where a weighted consideration of rules is necessary, such as with medical diagnosis. Neural networks trained with back-propagation can represent a larger number of concepts than a decision tree. A decision tree though, with it's greedy heuristics has the advantage of being able to quickly map out the general form of the concept space. A lot of the work in this area has of course been concerned with how to map the representations of decision trees to the necessary neural network representations. More recently, hybrid systems combining several different techniques have been developed such as the Neuro-fuzzy ICART algorithm (Li03).
Decision Tree Induction is just one method in the field of machine learning. Comparisons with other techniques such as Neural Networks have been done (Atlas90, Brown93, Talmon94), but it has been shown that the accuracy of the techniques is similar. Why then use decision trees? Two main reasons can be explained for it's popularity:
References
Freund, Y., Schapire, R. E., 1996. Experiments with a new boosting algorithm.
In: Proceedings of International Conference on Machine Learning, 148–156.
URL: http://citeseer.nj.nec.com/freund96experiments.html
Friedman, J., 1999. Greedy function approximation: A gradient boosting machine.
Technical Report, Stanford University, Department of Statistics, Palo Alto, CA.
Quinlan, J., 1996. Bagging, boosting, and C4.5.
In: Proceedings of the Thirteenth National Conference on Artificial Intelliegnce, 725–730.
Bauer, E., Kohavi, R., 1999. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants.
Machine Learning 36 (1/2), 105–139
Dietterich, T. G., 1999. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization.
Machine Learning 40 (2), 1–22.
URL http://citeseer.nj.nec.com/dietterich98experimental.html
Sethi, I., 1990. Entropy nets: from decision trees to neural networks.
In: Proceedings of the IEEE. Vol. 78, 1605–1613.
Leow, W., Setiono, R., 1999. On mapping decision trees and neural networks.
Knowledge Based Systems 12, 95–99.
Ping, Z., Lihui, C., January 2002. A novel feature extraction method and hybrid tree classification for handwritten numeral recognition.
Pattern Recognition Letters 23, 45.
Li, J., Li, E. and Yu, J. 2003. Designing neurofuzzy system based on ICART algorithm and its application for modeling jet fuel endpoint of hydrocracking process.
Engineering Applications of Artificial Intelligence 16, 11–19
Atlas, L., Cole, R., Muthuswarmy, Y., Lipman, A., Connor, J., Park, D., El-Sharkawi, M., Marks II, R., 1990. A performance comparison of trained multilayer perceptrons and trained classification trees.
In: Proceedings of the IEEE 78 (10), 1614–1619.
Brown.D.E., Corruble, V., Pittard, C., 1993. A comparison of decision tree classifiers with backpropagation neural networks for multimodal classification problems.
Pattern Recognition 26 (6), 953–961.
Talmon, J., Dassen, R., Karthaus, V., 1994. Neural nets and classification trees: A comparison in the domain of ecg analysis.
In: Gelsema, E., Kanal, L. (Eds.), Pattern Recognition in Practice IV: Multiple paradigms, Comparative studies and hybrid systems. Vol. 16 of Machine Intelligence and Pattern Recognition, 415–423.
Other Recommended Reading
Dietterich, T. G., 1998. Machine-learning research: Four current directions.
The AI Magazine 18 (4), 97–136.
URL: http://citeseer.nj.nec.com/dietterich97machine.html
Fleischer, R., June 1999. Decision trees: old and new results.
Information and Computation 152, 44.
Hopefully you will have grasped the concepts and techniques without being too bored :) Thanks!