# Decision Trees tutorial > ID3 & Entropy

## ID3

An early technique by the influential Ross Quinlan that influenced a large part of the research on Decision Trees is useful to look at in order to understand basic decision tree construction.

### Splitting Criteria

A fundamental part of any algorithm that constructs a decision tree from a dataset is the method in which it selects attributes at each node of the tree. In the previous exercise, you may have found it was better to place certain attributes (such as ‘District’) higher up the tree in order to produce a short tree. Q. Why is this?

A. Some attributes split the data up more purely than others. That means that their values correspond more consistently with instances that have particular values of the target attribute (the one we want to predict) than those of another attribute. Therefore, we might say that such attributes have some underlying relationship with the target attribute . But how can this be quantified in some way? Essentially, we would like some sort of measure that enables us to compare attributes with each other and then be able to decide to put ones that split the data more purely higher up the tree.

### Entropy

A measure used from Information Theory in the ID3 algorithm and many others used in decision tree construction is that of Entropy. Informally, the entropy of a dataset can be considered to be how disordered it is. It has been shown that entropy is related to information, in the sense that the higher the entropy, or uncertainty, of some data, then the more information is required in order to completely describe that data. In building a decision tree, we aim to decrease the entropy of the dataset until we reach leaf nodes at which point the subset that we are left with is pure, or has zero entropy and represents instances all of one class (all instances have the same value for the target attribute).

We measure the entropy of a dataset,S, with respect to one attribute, in this case the target attribute, with the following calculation:

where Pi is the proportion of instances in the dataset that take the ith value of the target attribute, which has C different values

This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.

Using the example of the marketing data, we know that there are two classes in the data and so we use the fractions that each class represents in an entropy calculation:

Entropy S = -([9/14 responses, 5/14 no responses])

= -9/14 log2 9/14 – 5/14 log2 5/14 = 0.94 bits

Ok now we want a quantitative way of seeing the effect of splitting the dataset by using a particular attribute (which is part of the tree building process). We can use a measure called Information Gain, which calculates the reduction in entropy (*Gain in information*) that would result on splitting the data on an attribute, A.

where v is a value of A

, |Sv| is the subset of instances of S where A takes the value v,

and |S| is the number of instances

Continuing with our example dataset, let’s name it S just for convenience, let’s work out the Information Gain that splitting on the attribute District would result in over the entire dataset:

So by calculating this value for each attribute that remains, we can see which attribute splits the data more purely. Let’s imagine we want to select an attribute for the root node, then performing the above calcualtion for all attributes gives:

- Gain(S,House Type) = 0.049 bits
- Gain(S,Income) =0.151 bits
- Gain(S,Previous Customer) = 0.048 bits

We can clearly see that District results in the highest reduction in entropy or the highest information gain. We would therefore choose this at the root node splitting the data up into subsets corresponding to all the different values for the District attribute.

With this node evaluation technique we can procede recursively through the subsets we create until leaf nodes have been reached throughout and all subsets are pure with zero entropy. This is exactly how ID3 and other variants work.

### Decision Tree Construction Algorithm

We can now present the basic ID3 algorithm in pseudo-code:

**Input: ** A data set, S

**Output: ** A decision tree

- If all the instances have the same value for the target attribute then return a decision tree that is simply this value (not really a tree – more of a stump).
- Else
- Compute Gain values (see above) for all attributes and select an attribute with the highest value and create a node for that attribute.
- Make a branch from this node for every value of the attribute
- Assign all possibe values of the attribute to branches.
- Follow each branch by partitioning the dataset to be only instances whereby the value of the branch is present and then go back to 1.

Ok, so go and play about with this method on the next page interactively. You can see how easy it now is to construct a small decision tree with this entropy measure as a guide.

I still don’t get it =(

It's not clear why there are minus signs in the following line:

= -9/14 log2 9/14 – 5/14 log2 5/14 = 0.947 bits

because, minus ofÂ summation of pi log base 2 to pi where i=1 to c

Since the minus is before summation that will be -( Summation result) or -p1 log base 2 to p1 -p2 log base 2 to p2( where i=1,2)

Hope that makes sense.

Look here http://en.wikipedia.org/wiki/Information_entropy

for understanding the minus sign (in the definition part)

Thanks for pointing out the errors in the equations. These are now fixed and I believe them to be correct. Please let me know of anymore bugs that you find in the site.

Does anyone have an algorythm for this? Or even better a working PHP code?

does anyone know how you calculate:

-9/14 log2 9/14 – 5/14 log2 5/14 = 0.947 bits

in excel or on a calculator please.

many thanks

I think there is a mistake:

Entropy S = -([9/14 responses, 5/14 no responses])

= -9/14 log2 9/14 – 5/14 log2 5/14 = 0.947 bits

It shoudn’t be 0.947 bits but 0.940 bits…

And i just know how to calculate it in open office..just type

=-9/14 log(9/14;2) – 5/14 log(5/14;2) = 0.940

Neither Excel nor your Calculator likely have a binary logarithmic base. Thus you need to use the “change of base” formula for logaritms:

the log base A of value B is equal to the log base C of value B divided by log base C of value A.

In other words:

log2 3 = log10 3 / log10 2

where 10 was chosen because they can be done on the calculator.

Thus log2 9/14 = log(9/14)/log(2)

and you should be able to get a value fro the whole equation.

Although — I’ve only ever gotten 0.9403 so I’m not certain how the author got 0.947

I agree with Matze. My calculations come up with

9/14 * log( 9/14, 2) – 5/14 * log( 5/14, 2 ) = 0.940

Not 0.947.

Most computer languages don’t have a log2 function, but you can convert all logs doing the following:

log( x, base ) = log(x) / log(base)

It works regardless of what the base of the log function you’re using. Whether it’s natural log or log10.

In Excel, to get the log base 2 of 16, enter the following formula (as an example):

=LOG(16,2)

hi..can anyone explain to me how to solve is

-9/14 Log2 9/14 + (-5/14 Log2 5/14) =

i dont know how to solve it.

Speedcrunch on Linux:

-(9/14)*log((9/14))/log(2)-(5/14)*log((5/14))/log(2)

0.94028596

calculate like this way

-(9/14)*log(2)/log(9/14)-(5/14)*log(2)/log(5/14)=

or use CASIO CALCULATOR

fx-991ES

–(9/14)*log(9/14)/log(2)-(5/14)*log(5/14)/log(2)=

OR

USE CASIO CALCULATOR

fx-991ES

-(9/14)*log(9/14)/log(2)-(5/14)*log(5/14)/log(2)=

or use casio fx-991ES calculator we can use direct any base values if we need.

log2 9/14 means log(9/14)/log(2)

This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.

Does this mean that if there are more outcomes, for example

Nothing,RespondedandSale, then we have to use log3?I haven’t really found a good clear answer to this via Google

Thanks for an informative article.

Basically if there are more values for the target attribute, then “C” reflects this in the entropy equation ( 1st equation on this page ). I have changed it to show this. The number of bits to encode increases yes, but it’s still log to base 2.

i dont understand how to calculate the

Entropy of District=Suburban/Urban/Rural:’(can someone please explain a step by step ? please please please ((

Does this mean the total entropy can be > 1 in the case of C = 3?

I.e.:

-3 * (1/3 * log2(1/3)) = 1.58

No, you never get greater than 1 for entropy. In your example, you would not multiply by -3.

The algorithm is based on Occam’s razor: it prefers smaller decision trees (simpler theories) over larger ones. However, it does not always produce the smallest tree, and is therefore a heuristic. Joe Zanotti

It’s funny cuz I can’t understand the minus sign (in the definition part).

Pretty complex stuff , starting to get the hang of it though. Henrik Stigell