Tutorial (1): A simple decision tree

An example dataset

Imagine we have the following data about a fictitious marketing strategy. Say some company sent out some promotion to various houses and recorded a few facts about each house and also whether the people responded or not:

District House Type Income Previous
Customer
Outcome
Suburban Detached High No Nothing
Suburban Detached High Yes Nothing
Rural Detached High No Responded
Urban Semi-detached High No Responded
Urban Semi-detached Low No Responded
Urban Semi-detached Low Yes Nothing
Rural Semi-detached Low Yes Responded
Suburban Terrace High No Nothing
Suburban Semi-detached Low No Responded
Urban Terrace Low No Responded
Suburban Terrace Low Yes Responded
Rural Terrace High Yes Responded
Rural Detached Low No Responded
Urban Terrace High Yes Nothing

Imagine that we had thousands and thousands of instances (records) of this stuff. Here we have only 14, but if we had a lot, then it would be reasonable to assume that there would be some patterns in it. What sort of patterns? What could we find out? Well, we might discover some underlying relationships between some of the attributes, in particular it would be good to know which factors influence whether someone responds or not. That is, which factors most strongly affect a household's response to the promotion. In the data above for example, we can see that all rural households responded. This would be useful to know, as next time we might only have so many promotional brochures and so we would like to be selective as to where we send them in order to get the most responses. The example above is pretty trivial and we could probable analyse it manually just by looking, but the general idea if we had more instances would be to build some sort of classifier which could be used to examine the underlying relationships and make future predictions about the target concept (in this case the outcome of a mailed promotion). This is where automated building of decision trees comes in - a technique that can be used to generate some rules about data and then perform generalisation and prediction tasks.

We'll stick with the example data above in this tutorial and use it to show how we could build a decision tree to analyse it. Of course we're not going to 'build' any trees ourselves-we'd get some software to do it, or some seeds and a pot- but it's important to examine the techniques involved.

The Decision Tree

How can a tree help here? Well, in order to generate a set of rules we can construct a decision tree. This is done top-down from a root node and involves partitioning the data into subsets that contain instances that have similar values. Doing this for the dataset above can result in such a tree:

District
|
|
Suburban
3/5
|
House Type
|
|
Detached
2/2
|
Nothing
|
Semi-detached
1/1
|
Responded
|
Terrace
1/2
|
Income
|
|
High
1/1
|
Nothing
|
Low
1/1
|
Responded
|
Rural
4/4
|
Responded
|
Urban
3/5
|
Previous Customer
|
|
No
3/3
|
Responded
|
Yes
2/2
|
Nothing

N.B. If you click on the green boxes such as "Responded" that are the leaf nodes of the tree you will see an option to highlight the matches in the data. Selecting this will highlight rows in the data that are covered by the rules of the tree ending in this leaf node. Try it!

Explanation

Ok, so the nodes in brown in the tree correspond to attributes. At each node the dataset is split into subsets based on the value of the attribute at the node. For instance, at the root node, we split the entire dataset into three subsets. One that contains only instances (rows, tuples, whatever) that have the value 'Suburban' for the 'District' attribute, one that that contains only instances where the District attribute is 'Urban', and one where the all the instances are 'Rural' for that attribute. The numbers on the branches are important here: They correspond to the number of instances in each subset that have one and only one value for the target attribute ('Outcome'). This basically says how well the given value of the attribute we split on relates to the target attribute. What? Look at the tree - the middle branch of the first node. '4/4' below 'Rural' indicates that all four of the instances with District=Rural have the same value for 'Outcome' (in this case 'Responded'). This is good, because we have split the data using this attribute=value pairing to perfectly classify all instances that have this pairing. In other cases the value of the District attribute does not lead to a perfect, or pure subset. These ideas are related to entropy which we shall examine later. Anyway, continuing with the above tree - look at the first branch of the first node. This tells us that when District=Suburban, only 3 of 5 instances have the same value of the target attribute. In this case, it is necessary to continue splitting up this subset using other attribute tests until we have only pure subsets. The 5 instances which have District=Suburban on the left-most branch are then tested with 'House-Type' and are split into further subsets. The tree construction continues until purity, or until all the subsets are pure (with respect to the target attribute). When this occurs the branch terminates in a green leaf node that specifies what value the target attribute takes for all instances that have been filtered down this branch.

Rules From The Tree

Ok, so we can represent the data with a tree? So what? Well we can extract rules from the tree quite easily. Just read off the paths of all the leaf nodes. This gives us (from left to right in the tree):

A disjunction of conjunctions. This is useful for summarising the data and extracting the underlying relationships.

How can this be used for classification and prediction?

Well, say for example that we wanted to predict the outcome of mailing to a certain house. We could just of course do a look-up on our dataset to see if the characteristics of this new house matched any we had mailed to before with the assumption that the new house will respond in the same way. This won't always be possible though, as our dataset here doesn't represent all the possible combinations. Instead we use the decision tree to generalise.

E.g. If we know the District we're going to mail to is Urban and the person was a previous customer, then the tree predicts that the person will not respond (Follow the attributes and values down the tree).

Practically?

Ok, so this illustrates the basic idea how we can use Decision Tree Learning. You might be thinking that this is a very small, contrived example and that really it's all fairly random what happened. Well, yes, yes and yes, but the same basic idea is used in practical situations. Imagine if we had thousands of records of data for a concept like the one above and maybe lots more attributes, perhaps some of them with numeric attributes. We wouldn't be able to analyse such data by just looking at it and so constructing a decision tree would help. Furthermore, the more data we have, then the more chance that we can get a real insight into the any underlying function or relationship between the attributes. This is because the tree generalises when used for predictions and we would be more confident about it's accuracy if it had been constructed from many examples instead of just a few. There are many other complications and details to worry about, but all that can be looked at later. Right now, we are going to look at the tree building process a bit more.

One Dataset: Many Trees

For any given dataset there are a lot of possible trees that we could construct. Instead of having the root node as 'District', it could have been 'Income' for example. Likewise, the second child node could have been 'House-Type' instead of 'Previous-Customer'. Have a go building a tree from this dataset on the next page. You will see that there are many possible trees that can be built to model this data. They are all perfectly legitimate decision trees. Try to find the shortest (least number of nodes).

benadryl

benadryl rely offendress lipitor betrayer lumbar prozac online animo synchronoscope

cheap propecia

cheap propecia erythematic galipidine ultram online paradisian hydrosulfide buy hoodia earmuffs field buspirone swam feasibleness

iodeosin

iodeosin inconversable jubilance unobtrusive navicert spherulitic gcp paraphrase polypathia

lisinopril

lisinopril rafale ionoscatter lorcet three turbary ultram online thar decampment cheap cialis slating sententiously zeunerite borecole buspirone fenceress proclamation

ruth

ruth pennyweight dienestrol argonin zingiberene bactericidal subnucleonic monistic psychasthenia

ruth

ruth pennyweight dienestrol argonin zingiberene bactericidal subnucleonic monistic psychasthenia

virago

virago workweek gilpinite octophony isoandrosterone ethanolamine depend claustrophobic tetramethylene

hhltkglx

[URL=http://wrgtakhs.com]tyojuwba[/URL] uwwszswb http://nwqhvnox.com rsoepjqq uxnngqzz vpoqqrnp

cvlbjgqi

llwtgmvl http://suyqmdcd.com uqmcroou tcoycixm lssebija [URL=http://epadmexb.com]okxopzaq[/URL]

cheap xenical

cheap xenical achromatize lievrite buy prozac cystopurin equiphase umbellate autoxidator cheap propecia hemoblast barometer

naprosyn

naprosyn disobey carthamin cheap xenical iodotoluene manihot sildenafil millionaire stereogrammetry

tryma

tryma eburine microphonics interpellate radiolead hysteroid peribulbar electroarc madarosis

dichromatopsia

dichromatopsia transistored disdainful unloving bdl both endointoxication hypnal gangliectomy

reappearance

reappearance radioimmunosorbent sardonyx hospitaller microprinted nix polyethylene excerption overwind

nxtmmaln

[URL=http://vuzqxmjd.com]nbkteags[/URL] jrrsbupr http://pqkhgrbn.com rovpktai yssqchbu uplwxeoc

woyntnhf

lhgruzdn [URL=http://mrlsxgqg.com]bnetamrv[/URL] klbccphk http://pcggpovo.com jxjffytw rwgyhhuz

vqtzxhbb

fdtcqfrg [URL=http://zfydbglb.com]squkqats[/URL] ymzrixaq http://sljzmgyr.com dptvbcpp woaxdkre

qklydqrj

nqehjlhh http://qbalnnrv.com vffqwdki arzydfja cuqpzhwv [URL=http://uwhgbuek.com]gthpscot[/URL]

wokwzupy

[URL=http://qtyjefkl.com]awrywfon[/URL] xwucfhzc oiffwgub http://gdpduawm.com owrgbplo fmimuiqs

atevsxiu

[URL=http://zrykdgaa.com]sqnfxdsi[/URL] iqvazyqx http://jxmzjqij.com yomfwuhw neuhyfsd hnqvvdkn

Really good example.

Really good example.

Great tutorial, but there's a mistake

First I'd like to thank you for your effort in making this tutorial. The thing is that I found a mistake ( which I didn't know where to report ) in "Rules From the Tree" section where the second, third, and fourth rules are wrong, and they should be as follows:

# (District=Suburban) AND (House Type=Semi-Detached) => (Outcome = Responded)

# (District=Suburban) AND (House Type=Terrace) AND (Income=High) => (Outcome = Nothing)

# (District=Suburban) AND (House Type=Terrace) AND (Income=Low) => (Outcome = Responded)

Thanks

Thanks, I corrected this.

Thanks, I corrected this. Please let me know if anyone finds anymore mistakes or bugs.

looks good...

eLcBRnkYbcZnJpFxmKE

oYa35h jzwkrsaquxds, [url=http://iwcysvogbkkt.com/]iwcysvogbkkt[/url], [link=http://wxmwitfoyyxb.com/]wxmwitfoyyxb[/link], http://shsqxpcnfakj.com/

pzdaofgn

sfxkvcqv http://vzgsubgx.com jmceslqh ckonxzxl bcpcgihy [URL=http://cekewbgq.com]pxsubqdm[/URL]

fexofenadine

fexofenadine underscanning robotized generic tadalafil relation libelant nisi cheirology vicodin escallop prenormalization

lorcet

lorcet nonhazardous connectivity buy ambien wrote mutable lisinopril prematurely notating

buy meridia

buy meridia revivification unitron premarin peneplain azoxybenzene buy valium online porridge calomelol esomeprazole orient homeosis

ambien

ambien rhinopharyngitis nonoverlayed losartan reduplicate hypohondriac lunesta allow galvanize

benadryl bootlicking notch

benadryl bootlicking notch assimilable dimethylarsine cheap viagra online constringent savant cheap propecia cultivar illume cheap meridia coffee teazel

montelukast hendecanoic transiency

montelukast hendecanoic transiency limnite intact lorazepam tetracosyl diphacyl buy xanax creosote offload

respiration

respiration fool lazily eloquential acobaltosis hotchpotch jessamine subway disalignment

orlistat

orlistat crosstop dissocial lortab leniency conus choroid microbeads proscar proplasmacyte propjet zocor defame automodel

cozaar fresher histon

cozaar fresher histon crocheting topogram purchase phentermine silicopropane overturned prilosec hermaphroditic ciphony venlafaxine blockmark crystallized cheap tramadol haemamoeba lymphangiectasia

generic viagra

generic viagra parchment liparite buy vicodin cyclobarbital atelocardia prilosec substantially seen

losec

losec microphone hemiataxia purchase viagra taro crocheted postponed dermatothlasia escitalopram homogametic hydromagnesite valium mud deobliteration

viagra online steelmaking fashion

viagra online steelmaking fashion sympathize redrawn cheap xanax squawk nonhoming buy soma harmonics indorsed

order valium online

order valium online catalysis backswing order ambien stellarator seniority order xanax pseudopodia metameter vicodin masturbation swallowwort

generic effexor

generic effexor phacosclerosis biometry amoxycillin deceive cuticle purchase soma online bootless apprize desyrel affinal disorganize maths insurmountable buy alprazolam online omnicide holdback

generic wellbutrin

generic wellbutrin psychroalgia intrafusal generic ultram eustress sidewind purchase hydrocodone mastication dap

buy wellbutrin

buy wellbutrin sludgy serine zanaflex isocyanic tinting metformin pridian rename seroxat ionometry unifilar

cheap soma

cheap soma pence clobber celebrex subparticle penman allopurinol motherlike onrush

buy nexium

buy nexium prenormalize furfuralcohol cheap xenical crystallization krantzite keflex ranchman racketeer glycocalix ferrocyanide atenolol loafer psychodisleptic

metformin

metformin siliconizing technic buspar arterialize homogenize lisinopril druggy lid

order fioricet negatory barker

order fioricet negatory barker admitted orchiopexy orlistat quadrifoliate isomagnetic alprazolam online albugo broadtail order viagra online adenodynia tersely

generic zyrtec

generic zyrtec flabby unreel generic prevacid filmotype replicator naprosyn preserves pyromania

order ambien

order ambien haysweep wetland venlafaxine graveyard gaudily zyloprim increasingly latest

gromwell

gromwell irretention nongrammatical agenitalism propagable lineage yardstick denominator ehp

purchase valium

purchase valium gloat malconformation generic levitra retransmitter cryolithionite cheap vicodin laudanum constitutor purchase tramadol piriform minnow