An example dataset
Imagine we have the following data about a fictitious marketing strategy. Say some company sent out some promotion to various houses and recorded a few facts about each house and also whether the people responded or not:
Imagine that we had thousands and thousands of instances (records) of this stuff. Here we have only 14, but if we had a lot, then it would be reasonable to assume that there would be some patterns in it. What sort of patterns? What could we find out? Well, we might discover some underlying relationships between some of the attributes, in particular it would be good to know which factors influence whether someone responds or not. That is, which factors most strongly affect a household’s response to the promotion. In the data above for example, we can see that all rural households responded. This would be useful to know, as next time we might only have so many promotional brochures and so we would like to be selective as to where we send them in order to get the most responses. The example above is pretty trivial and we could probable analyse it manually just by looking, but the general idea if we had more instances would be to build some sort of classifier which could be used to examine the underlying relationships and make future predictions about the target concept (in this case the outcome of a mailed promotion). This is where automated building of decision trees comes in – a technique that can be used to generate some rules about data and then perform generalisation and prediction tasks.
We’ll stick with the example data above in this tutorial and use it to show how we could build a decision tree to analyse it. Of course we’re not going to ‘build’ any trees ourselves-we’d get some software to do it, or some seeds and a pot- but it’s important to examine the techniques involved.
The Decision Tree
How can a tree help here? Well, in order to generate a set of rules we can construct a decision tree. This is done top-down from a root node and involves partitioning the data into subsets that contain instances that have similar values. Doing this for the dataset above can result in such a tree:
N.B. If you click on the green boxes such as “Responded” that are the leaf nodes of the tree you will see an option to highlight the matches in the data. Selecting this will highlight rows in the data that are covered by the rules of the tree ending in this leaf node. Try it!
Ok, so the nodes in brown in the tree correspond to attributes. At each node the dataset is split into subsets based on the value of the attribute at the node. For instance, at the root node, we split the entire dataset into three subsets. One that contains only instances (rows, tuples, whatever) that have the value ‘Suburban’ for the ‘District’ attribute, one that that contains only instances where the District attribute is ‘Urban’, and one where the all the instances are ‘Rural’ for that attribute. The numbers on the branches are important here: They correspond to the number of instances in each subset that have one and only one value for the target attribute (‘Outcome’). This basically says how well the given value of the attribute we split on relates to the target attribute. What? Look at the tree – the middle branch of the first node. ‘4/4′ below ‘Rural’ indicates that all four of the instances with District=Rural have the same value for ‘Outcome’ (in this case ‘Responded’). This is good, because we have split the data using this attribute=value pairing to perfectly classify all instances that have this pairing. In other cases the value of the District attribute does not lead to a perfect, or pure subset. These ideas are related to entropy which we shall examine later. Anyway, continuing with the above tree – look at the first branch of the first node. This tells us that when District=Suburban, only 3 of 5 instances have the same value of the target attribute. In this case, it is necessary to continue splitting up this subset using other attribute tests until we have only pure subsets. The 5 instances which have District=Suburban on the left-most branch are then tested with ‘House-Type’ and are split into further subsets. The tree construction continues until purity, or until all the subsets are pure (with respect to the target attribute). When this occurs the branch terminates in a green leaf node that specifies what value the target attribute takes for all instances that have been filtered down this branch.
Rules From The Tree
Ok, so we can represent the data with a tree? So what? Well we can extract rules from the tree quite easily. Just read off the paths of all the leaf nodes. This gives us (from left to right in the tree):
A disjunction of conjunctions. This is useful for summarising the data and extracting the underlying relationships.
How can this be used for classification and prediction?
Well, say for example that we wanted to predict the outcome of mailing to a certain house. We could just of course do a look-up on our dataset to see if the characteristics of this new house matched any we had mailed to before with the assumption that the new house will respond in the same way. This won’t always be possible though, as our dataset here doesn’t represent all the possible combinations. Instead we use the decision tree to generalise.
E.g. If we know the District we’re going to mail to is Urban and the person was a previous customer, then the tree predicts that the person will not respond (Follow the attributes and values down the tree).
Ok, so this illustrates the basic idea how we can use Decision Tree Learning. You might be thinking that this is a very small, contrived example and that really it’s all fairly random what happened. Well, yes, yes and yes, but the same basic idea is used in practical situations. Imagine if we had thousands of records of data for a concept like the one above and maybe lots more attributes, perhaps some of them with numeric attributes. We wouldn’t be able to analyse such data by just looking at it and so constructing a decision tree would help. Furthermore, the more data we have, then the more chance that we can get a real insight into the any underlying function or relationship between the attributes. This is because the tree generalises when used for predictions and we would be more confident about it’s accuracy if it had been constructed from many examples instead of just a few. There are many other complications and details to worry about, but all that can be looked at later. Right now, we are going to look at the tree building process a bit more.
One Dataset: Many Trees
For any given dataset there are a lot of possible trees that we could construct. Instead of having the root node as ‘District’, it could have been ‘Income’ for example. Likewise, the second child node could have been ‘House-Type’ instead of ‘Previous-Customer’. Have a go building a tree from this dataset on the next page. You will see that there are many possible trees that can be built to model this data. They are all perfectly legitimate decision trees. Try to find the shortest (least number of nodes).