Decision Trees tutorial > ID3 & Entropy
An early technique by the influential Ross Quinlan that influenced a large part of the research on Decision Trees is useful to look at in order to understand basic decision tree construction.
A fundamental part of any algorithm that constructs a decision tree from a dataset is the method in which it selects attributes at each node of the tree. In the previous exercise, you may have found it was better to place certain attributes (such as ‘District’) higher up the tree in order to produce a short tree. Q. Why is this?
A. Some attributes split the data up more purely than others. That means that their values correspond more consistently with instances that have particular values of the target attribute (the one we want to predict) than those of another attribute. Therefore, we might say that such attributes have some underlying relationship with the target attribute . But how can this be quantified in some way? Essentially, we would like some sort of measure that enables us to compare attributes with each other and then be able to decide to put ones that split the data more purely higher up the tree.
A measure used from Information Theory in the ID3 algorithm and many others used in decision tree construction is that of Entropy. Informally, the entropy of a dataset can be considered to be how disordered it is. It has been shown that entropy is related to information, in the sense that the higher the entropy, or uncertainty, of some data, then the more information is required in order to completely describe that data. In building a decision tree, we aim to decrease the entropy of the dataset until we reach leaf nodes at which point the subset that we are left with is pure, or has zero entropy and represents instances all of one class (all instances have the same value for the target attribute).
We measure the entropy of a dataset,S, with respect to one attribute, in this case the target attribute, with the following calculation:
where Pi is the proportion of instances in the dataset that take the ith value of the target attribute, which has C different values
This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.
Using the example of the marketing data, we know that there are two classes in the data and so we use the fractions that each class represents in an entropy calculation:
Entropy S = -([9/14 responses, 5/14 no responses])
= -9/14 log2 9/14 – 5/14 log2 5/14 = 0.94 bits
Ok now we want a quantitative way of seeing the effect of splitting the dataset by using a particular attribute (which is part of the tree building process). We can use a measure called Information Gain, which calculates the reduction in entropy (Gain in information) that would result on splitting the data on an attribute, A.
where v is a value of A
, |Sv| is the subset of instances of S where A takes the value v,
and |S| is the number of instances
Continuing with our example dataset, let’s name it S just for convenience, let’s work out the Information Gain that splitting on the attribute District would result in over the entire dataset:
So by calculating this value for each attribute that remains, we can see which attribute splits the data more purely. Let’s imagine we want to select an attribute for the root node, then performing the above calcualtion for all attributes gives:
- Gain(S,House Type) = 0.049 bits
- Gain(S,Income) =0.151 bits
- Gain(S,Previous Customer) = 0.048 bits
We can clearly see that District results in the highest reduction in entropy or the highest information gain. We would therefore choose this at the root node splitting the data up into subsets corresponding to all the different values for the District attribute.
With this node evaluation technique we can procede recursively through the subsets we create until leaf nodes have been reached throughout and all subsets are pure with zero entropy. This is exactly how ID3 and other variants work.
Decision Tree Construction Algorithm
We can now present the basic ID3 algorithm in pseudo-code:
Input: A data set, S
Output: A decision tree
- If all the instances have the same value for the target attribute then return a decision tree that is simply this value (not really a tree – more of a stump).
- Compute Gain values (see above) for all attributes and select an attribute with the highest value and create a node for that attribute.
- Make a branch from this node for every value of the attribute
- Assign all possibe values of the attribute to branches.
- Follow each branch by partitioning the dataset to be only instances whereby the value of the branch is present and then go back to 1.
Ok, so go and play about with this method on the next page interactively. You can see how easy it now is to construct a small decision tree with this entropy measure as a guide.