Tutorial (4): ID3
Decision Trees tutorial > ID3 & Entropy
ID3
An early technique by the influential Ross Quinlan that influenced a large part of the research on Decision Trees is useful to look at in order to understand basic decision tree construction.
Splitting Criteria
A fundamental part of any algorithm that constructs a decision tree from a dataset is the method in which it selects attributes at each node of the tree. In the previous exercise, you may have found it was better to place certain attributes (such as 'District') higher up the tree in order to produce a short tree. Q. Why is this?
A. Some attributes split the data up more purely than others. That means that their values correspond more consistently with instances that have particular values of the target attribute (the one we want to predict) than those of another attribute. Therefore, we might say that such attributes have some underlying relationship with the target attribute . But how can this be quantified in some way? Essentially, we would like some sort of measure that enables us to compare attributes with each other and then be able to decide to put ones that split the data more purely higher up the tree.
Entropy
A measure used from Information Theory in the ID3 algorithm and many others used in decision tree construction is that of Entropy. Informally, the entropy of a dataset can be considered to be how disordered it is. It has been shown that entropy is related to information, in the sense that the higher the entropy, or uncertainty, of some data, then the more information is required in order to completely describe that data. In building a decision tree, we aim to decrease the entropy of the dataset until we reach leaf nodes at which point the subset that we are left with is pure, or has zero entropy and represents instances all of one class (all instances have the same value for the target attribute).
We measure the entropy of a dataset,S, with respect to one attribute, in this case the target attribute, with the following calculation:

where Pi is the proportion of instances in the dataset that take the ith value of the target attribute, which has C different values
This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.
Using the example of the marketing data, we know that there are two classes in the data and so we use the fractions that each class represents in an entropy calculation:
Entropy S = -([9/14 responses, 5/14 no responses])
= -9/14 log2 9/14 - 5/14 log2 5/14 = 0.94 bits
Ok now we want a quantitative way of seeing the effect of splitting the dataset by using a particular attribute (which is part of the tree building process). We can use a measure called Information Gain, which calculates the reduction in entropy (Gain in information) that would result on splitting the data on an attribute, A.

where v is a value of A
, |Sv| is the subset of instances of S where A takes the value v,
and |S| is the number of instances
Continuing with our example dataset, let's name it S just for convenience, let's work out the Information Gain that splitting on the attribute District would result in over the entire dataset:

So by calculating this value for each attribute that remains, we can see which attribute splits the data more purely. Let's imagine we want to select an attribute for the root node, then performing the above calcualtion for all attributes gives:
- Gain(S,House Type) = 0.049 bits
- Gain(S,Income) =0.151 bits
- Gain(S,Previous Customer) = 0.048 bits
We can clearly see that District results in the highest reduction in entropy or the highest information gain. We would therefore choose this at the root node splitting the data up into subsets corresponding to all the different values for the District attribute.
With this node evaluation technique we can procede recursively through the subsets we create until leaf nodes have been reached throughout and all subsets are pure with zero entropy. This is exactly how ID3 and other variants work.
Decision Tree Construction Algorithm
We can now present the basic ID3 algorithm in pseudo-code:
Input: A data set, S
Output: A decision tree
- If all the instances have the same value for the target attribute then return a decision tree that is simply this value (not really a tree - more of a stump).
- Else
- Compute Gain values (see above) for all attributes and select an attribute with the highest value and create a node for that attribute.
- Make a branch from this node for every value of the attribute
- Assign all possibe values of the attribute to branches.
- Follow each branch by partitioning the dataset to be only instances whereby the value of the branch is present and then go back to 1.
Ok, so go and play about with this method on the next page interactively. You can see how easy it now is to construct a small decision tree with this entropy measure as a guide.
It's not clear why there are
It's not clear why there are minus signs in the following line:
= -9/14 log2 9/14 - 5/14 log2 5/14 = 0.947 bits
because, minus of
because, minus of summation of pi log base 2 to pi where i=1 to c
Since the minus is before summation that will be -( Summation result) or -p1 log base 2 to p1 -p2 log base 2 to p2( where i=1,2)
Hope that makes sense.
Look here
Look here http://en.wikipedia.org/wiki/Information_entropyfor understanding the minus sign (in the definition part)
> It's not clear why there
> It's not clear why there are minus signs in the following line:
> = -9/14 log2 9/14 - 5/14 log2 5/14 = 0.947 bits
Because the log of numbers less than one is negative. The - makes the result positive. Technically, the definition of entropy includes the minus:
Sum ( - p log p)
Does anyone have an
Does anyone have an algorythm for this? Or even better a working PHP code?
Errors in equations
Thanks for pointing out the errors in the equations. These are now fixed and I believe them to be correct. Please let me know of anymore bugs that you find in the site.
does anyone know how you
does anyone know how you calculate:
-9/14 log2 9/14 - 5/14 log2 5/14 = 0.947 bits
in excel or on a calculator please.
many thanks
addition of values via Excel or Calculator
Neither Excel nor your Calculator likely have a binary logarithmic base. Thus you need to use the "change of base" formula for logaritms:
the log base A of value B is equal to the log base C of value B divided by log base C of value A.
In other words:
log2 3 = log10 3 / log10 2
where 10 was chosen because they can be done on the calculator.
Thus log2 9/14 = log(9/14)/log(2)
and you should be able to get a value fro the whole equation.
Although -- I've only ever gotten 0.9403 so I'm not certain how the author got 0.947
Log base 2 in Excel
In Excel, to get the log base 2 of 16, enter the following formula (as an example):
=LOG(16,2)
CALCULATOR USAGE FOR ENTROPY
-(9/14)*log(9/14)/log(2)-(5/14)*log(5/14)/log(2)=
or use casio fx-991ES calculator we can use direct any base values if we need.
I think there is a
I think there is a mistake:
Entropy S = -([9/14 responses, 5/14 no responses])
= -9/14 log2 9/14 - 5/14 log2 5/14 = 0.947 bits
It shoudn't be 0.947 bits but 0.940 bits...
And i just know how to calculate it in open office..just type
=-9/14 log(9/14;2) - 5/14 log(5/14;2) = 0.940
please, i need help
hi..can anyone explain to me how to solve is
-9/14 Log2 9/14 + (-5/14 Log2 5/14) =
i dont know how to solve it.
Log -> Log2
Most computer languages don't have a log2 function, but you can convert all logs doing the following:
log( x, base ) = log(x) / log(base)
It works regardless of what the base of the log function you're using. Whether it's natural log or log10.
asnwe
calculate like this way
-(9/14)*log(2)/log(9/14)-(5/14)*log(2)/log(5/14)=
or use CASIO CALCULATOR
fx-991ES
sorry this is correct
--(9/14)*log(9/14)/log(2)-(5/14)*log(5/14)/log(2)=
OR
USE CASIO CALCULATOR
fx-991ES
answer
log2 9/14 means log(9/14)/log(2)
Entropy(S) = 0.940
I agree with Matze. My calculations come up with
9/14 * log( 9/14, 2) - 5/14 * log( 5/14, 2 ) = 0.940
Not 0.947.
Entropy(S)=0.940
Speedcrunch on Linux:
-(9/14)*log((9/14))/log(2)-(5/14)*log((5/14))/log(2)
0.94028596
This probability measures
This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.
Does this mean that if there are more outcomes, for example Nothing, Responded and Sale, then we have to use log3?
I haven't really found a good clear answer to this via Google
Thanks for an informative article.
re: This probability measures
Basically if there are more values for the target attribute, then "C" reflects this in the entropy equation ( 1st equation on this page ). I have changed it to show this. The number of bits to encode increases yes, but it's still log to base 2.
Works fine
This is the nice article to share thank for the post.
Great work
learn vedic math
Nice
nice post