Decision Trees Tutorial

What is Decision Tree Learning?

An important technique in machine learning, decision trees are used extensively in data mining. They are able to produce human-readable descriptions of trends in the underlying relationships of a dataset and can be used for classfication and prediction tasks. The technique has been used successfully in many different areas, such as medical diagnosis, plant classification, and customer marketing strategies.

How to Use These Pages

The following pages in this Tutorial will present the basic ideas and techniques involved in Decision Tree Learning with references to research in the area. They should be worked through in sequential order, or use the navigation to skip to a particular page. Every now and then, there will be interactive pages. These are intended to be used as aids in understanding the various algorithms and concepts. Once you feel that you have a good understanding of the material, please use the resources section or contribute to the site with comments or additions. Thanks for looking - have fun!

Tutorial (1): A simple decision tree

An example dataset

Imagine we have the following data about a fictitious marketing strategy. Say some company sent out some promotion to various houses and recorded a few facts about each house and also whether the people responded or not:

District House Type Income Previous
Customer
Outcome
Suburban Detached High No Nothing
Suburban Detached High Yes Nothing
Rural Detached High No Responded
Urban Semi-detached High No Responded
Urban Semi-detached Low No Responded
Urban Semi-detached Low Yes Nothing
Rural Semi-detached Low Yes Responded
Suburban Terrace High No Nothing
Suburban Semi-detached Low No Responded
Urban Terrace Low No Responded
Suburban Terrace Low Yes Responded
Rural Terrace High Yes Responded
Rural Detached Low No Responded
Urban Terrace High Yes Nothing

Imagine that we had thousands and thousands of instances (records) of this stuff. Here we have only 14, but if we had a lot, then it would be reasonable to assume that there would be some patterns in it. What sort of patterns? What could we find out? Well, we might discover some underlying relationships between some of the attributes, in particular it would be good to know which factors influence whether someone responds or not. That is, which factors most strongly affect a household's response to the promotion. In the data above for example, we can see that all rural households responded. This would be useful to know, as next time we might only have so many promotional brochures and so we would like to be selective as to where we send them in order to get the most responses. The example above is pretty trivial and we could probable analyse it manually just by looking, but the general idea if we had more instances would be to build some sort of classifier which could be used to examine the underlying relationships and make future predictions about the target concept (in this case the outcome of a mailed promotion). This is where automated building of decision trees comes in - a technique that can be used to generate some rules about data and then perform generalisation and prediction tasks.

We'll stick with the example data above in this tutorial and use it to show how we could build a decision tree to analyse it. Of course we're not going to 'build' any trees ourselves-we'd get some software to do it, or some seeds and a pot- but it's important to examine the techniques involved.

The Decision Tree

How can a tree help here? Well, in order to generate a set of rules we can construct a decision tree. This is done top-down from a root node and involves partitioning the data into subsets that contain instances that have similar values. Doing this for the dataset above can result in such a tree:

District
|
|
Suburban
3/5
|
House Type
|
|
Detached
2/2
|
Nothing
|
Semi-detached
1/1
|
Responded
|
Terrace
1/2
|
Income
|
|
High
1/1
|
Nothing
|
Low
1/1
|
Responded
|
Rural
4/4
|
Responded
|
Urban
3/5
|
Previous Customer
|
|
No
3/3
|
Responded
|
Yes
2/2
|
Nothing

N.B. If you click on the green boxes such as "Responded" that are the leaf nodes of the tree you will see an option to highlight the matches in the data. Selecting this will highlight rows in the data that are covered by the rules of the tree ending in this leaf node. Try it!

Explanation

Ok, so the nodes in brown in the tree correspond to attributes. At each node the dataset is split into subsets based on the value of the attribute at the node. For instance, at the root node, we split the entire dataset into three subsets. One that contains only instances (rows, tuples, whatever) that have the value 'Suburban' for the 'District' attribute, one that that contains only instances where the District attribute is 'Urban', and one where the all the instances are 'Rural' for that attribute. The numbers on the branches are important here: They correspond to the number of instances in each subset that have one and only one value for the target attribute ('Outcome'). This basically says how well the given value of the attribute we split on relates to the target attribute. What? Look at the tree - the middle branch of the first node. '4/4' below 'Rural' indicates that all four of the instances with District=Rural have the same value for 'Outcome' (in this case 'Responded'). This is good, because we have split the data using this attribute=value pairing to perfectly classify all instances that have this pairing. In other cases the value of the District attribute does not lead to a perfect, or pure subset. These ideas are related to entropy which we shall examine later. Anyway, continuing with the above tree - look at the first branch of the first node. This tells us that when District=Suburban, only 3 of 5 instances have the same value of the target attribute. In this case, it is necessary to continue splitting up this subset using other attribute tests until we have only pure subsets. The 5 instances which have District=Suburban on the left-most branch are then tested with 'House-Type' and are split into further subsets. The tree construction continues until purity, or until all the subsets are pure (with respect to the target attribute). When this occurs the branch terminates in a green leaf node that specifies what value the target attribute takes for all instances that have been filtered down this branch.

Rules From The Tree

Ok, so we can represent the data with a tree? So what? Well we can extract rules from the tree quite easily. Just read off the paths of all the leaf nodes. This gives us (from left to right in the tree):

A disjunction of conjunctions. This is useful for summarising the data and extracting the underlying relationships.

How can this be used for classification and prediction?

Well, say for example that we wanted to predict the outcome of mailing to a certain house. We could just of course do a look-up on our dataset to see if the characteristics of this new house matched any we had mailed to before with the assumption that the new house will respond in the same way. This won't always be possible though, as our dataset here doesn't represent all the possible combinations. Instead we use the decision tree to generalise.

E.g. If we know the District we're going to mail to is Urban and the person was a previous customer, then the tree predicts that the person will not respond (Follow the attributes and values down the tree).

Practically?

Ok, so this illustrates the basic idea how we can use Decision Tree Learning. You might be thinking that this is a very small, contrived example and that really it's all fairly random what happened. Well, yes, yes and yes, but the same basic idea is used in practical situations. Imagine if we had thousands of records of data for a concept like the one above and maybe lots more attributes, perhaps some of them with numeric attributes. We wouldn't be able to analyse such data by just looking at it and so constructing a decision tree would help. Furthermore, the more data we have, then the more chance that we can get a real insight into the any underlying function or relationship between the attributes. This is because the tree generalises when used for predictions and we would be more confident about it's accuracy if it had been constructed from many examples instead of just a few. There are many other complications and details to worry about, but all that can be looked at later. Right now, we are going to look at the tree building process a bit more.

One Dataset: Many Trees

For any given dataset there are a lot of possible trees that we could construct. Instead of having the root node as 'District', it could have been 'Income' for example. Likewise, the second child node could have been 'House-Type' instead of 'Previous-Customer'. Have a go building a tree from this dataset on the next page. You will see that there are many possible trees that can be built to model this data. They are all perfectly legitimate decision trees. Try to find the shortest (least number of nodes).

Tutorial (2): Exercise 1

Constructing a Simple Tree

The dataset:

District House Type Income Previous Customer Outcome
Suburban Detached High No Nothing
Suburban Detached High Yes Nothing
Rural Detached High No Responded
Urban Semi-detached High No Responded
Urban Semi-detached Low No Responded
Urban Semi-detached Low Yes Nothing
Rural Semi-detached Low Yes Responded
Suburban Terrace High No Nothing
Suburban Semi-detached Low No Responded
Urban Terrace Low No Responded
Suburban Terrace Low Yes Responded
Rural Terrace High Yes Responded
Rural Detached Low No Responded
Urban Terrace High Yes Nothing

The Decision Tree: Interactively build it

Click here to begin

Tutorial (3): Occam's Razor

The Smallest Tree

If you did the practical exercise on the previous page, you will notice that the smallest complete tree that it is possible to build from the is this:

District
|
|
Suburban
3/5
|
Income
|
|
High
3/3
|
Nothing
|
Low
2/2
|
Responded
|
Rural
4/4
|
Responded
|
Urban
3/5
|
Previous Customer
|
|
No
3/3
|
Responded
|
Yes
2/2
|
Nothing

There were other trees that you could build, some of them much larger and more complex than this one. The interesting thing is though, is that each of the trees describes the data perfectly - no instances are left unaccounted for. Each tree has no exceptions and is a perfectly valid hypothesis over the data. In that case, the shorter hypothesis is preferred. Why?

Occam's Razor

This principle thought up a long time ago by William Occam while shaving(!) states that the shortest hypothesis, or solution to a problem, should be the one we should prefer (over longer more complicated ones). This is one of the fundamental tenet's of the way western science works and has received much debate and controversy. Let's just say that most of the stuff concerning decision trees follows the idea of Occam's razor and looks to always contstruct small, simple, easily-understandable tree structures.

Why not just construct the shortest tree each time each time then for every dataset? As simple as it was for our dataset, in the general case this isn't so simple and can take a very long time on some datasets. Formally, this has shown to be an NP-complete problem which means that no computer algorithm can solve this in a sensible amount of time (in the general case). Therefore algorithms have been developed which find a good tree, usually very close to the best. It's now necessary to start examining some of these decision tree algorithms in more detail.

Tutorial (4): ID3

Decision Trees tutorial > ID3 & Entropy

ID3

An early technique by the influential Ross Quinlan that influenced a large part of the research on Decision Trees is useful to look at in order to understand basic decision tree construction.

Splitting Criteria

A fundamental part of any algorithm that constructs a decision tree from a dataset is the method in which it selects attributes at each node of the tree. In the previous exercise, you may have found it was better to place certain attributes (such as 'District') higher up the tree in order to produce a short tree. Q. Why is this?

A. Some attributes split the data up more purely than others. That means that their values correspond more consistently with instances that have particular values of the target attribute (the one we want to predict) than those of another attribute. Therefore, we might say that such attributes have some underlying relationship with the target attribute . But how can this be quantified in some way? Essentially, we would like some sort of measure that enables us to compare attributes with each other and then be able to decide to put ones that split the data more purely higher up the tree.

Entropy

A measure used from Information Theory in the ID3 algorithm and many others used in decision tree construction is that of Entropy. Informally, the entropy of a dataset can be considered to be how disordered it is. It has been shown that entropy is related to information, in the sense that the higher the entropy, or uncertainty, of some data, then the more information is required in order to completely describe that data. In building a decision tree, we aim to decrease the entropy of the dataset until we reach leaf nodes at which point the subset that we are left with is pure, or has zero entropy and represents instances all of one class (all instances have the same value for the target attribute).

We measure the entropy of a dataset,S, with respect to one attribute, in this case the target attribute, with the following calculation:

Entropy Calculation
where Pi is the proportion of instances in the dataset that take the ith value of the target attribute

This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.

Using the example of the marketing data, we know that there are two classes in the data and so we use the fractions that each class represents in an entropy calculation:

Entropy S = -([9/14 responses, 5/14 no responses])

= -9/14 log2 9/14 - 5/14 log2 5/14 = 0.947 bits

Ok now we want a quantitative way of seeing the effect of splitting the dataset by using a particular attribute (which is part of the tree building process). We can use a measure called Information Gain, which calculates the reduction in entropy (Gain in information) that would result on splitting the data on an attribute, A.

Gain Calulation
where v is a value of A
, |Sv| is the subset of instances of S where A takes the value v,
and |S| is the number of instances

Continuing with our example dataset, let's name it S just for convenience, let's work out the Information Gain that splitting on the attribute District would result in over the entire dataset:

Gain Example

So by calculating this value for each attribute that remains, we can see which attribute splits the data more purely. Let's imagine we want to select an attribute for the root node, then performing the above calcualtion for all attributes gives:

We can clearly see that District results in the highest reduction in entropy or the highest information gain. We would therefore choose this at the root node splitting the data up into subsets corresponding to all the different values for the District attribute.

With this node evaluation technique we can procede recursively through the subsets we create until leaf nodes have been reached throughout and all subsets are pure with zero entropy. This is exactly how ID3 and other variants work.

Decision Tree Construction Algorithm

We can now present the basic ID3 algorithm in pseudo-code:

Input: A data set, S
Output: A decision tree

  • If all the instances have the same value for the target attribute then return a decision tree that is simply this value (not really a tree - more of a stump).
  • Else
    1. Compute Gain values (see above) for all attributes and select an attribute with the highest value and create a node for that attribute.
    2. Make a branch from this node for every value of the attribute
    3. Assign all possibe values of the attribute to branches.
    4. Follow each branch by partitioning the dataset to be only instances whereby the value of the branch is present and then go back to 1.

Ok, so go and play about with this method on the next page interactively. You can see how easy it now is to construct a small decision tree with this entropy measure as a guide.

Tutorial (5): Exercise 2