Decision Trees (for Classification)¶

In a decision tree the data is split at each node according to a decision rule. This corresponds to nested if-then-else-rules. In the if-part of such a rule are decison is made based on a feature of the data record.

We will use the scikit learn implementation. For this implementation the features must be binary or have (at least) ordinal characteristic. If a feature is e.g. nominal with many values, it must be converted to a set of binary (one-hot-coded) features.

The splitting rules in the scikit learn implementation are binary and are based on a threshold, e.g.

_if $x_6 <= 2.14$ then_ left subbranch, else right subbranch.
binary features must be coded as 0/1, so the rule becomes: if $x_2 <= 0.5$ then left subbranch, else right subbranch.

In the leafs of the tree the (class) predictions are made. There are two possibilities for such an inference:

hard assignment: Predict for the data records which end up on a leaf by the majority class of the training data that end up on that leaf.
soft assignment: Assign probabilities according to the distribution of the classes in respect to the training data which end up on that leaf.

As an example of a decision tree we will learn the following tree from the titanic data set:

A full explanation of the tree will be given later. Here just look at the desion rules (first line of the inner nodes) and at the last line of the leafs. In each leaf you see a array with counts of the different targets (train data): [number_died number_survivors] .

Learning¶

Finding a tree that splits the traing data optimal is np-hard. Therefore often a greedy-strategy is used:

To build up a decision tree the algorithm starts at the root of the tree. The feature and the threshold that splits the training data best (with respect to the classes) are choosen. In an iterative way the whole tree is build up by such splitting rules.

There are different criteria for measuring the "separation (split) quality". The most important ones are:

Gini Impurity
Information Gain

In this tutorial we use the information gain.

Information Gain as splitting criterion¶

The entropy with respect to the target class variable $y$ of a training data set $\mathcal D$ is defined as:

$$ H(y, \mathcal D) = - \sum_{y \in \mathcal Y} p(y|\mathcal D) \log_2 p(y|\mathcal D) $$ with the domain of the target values $\mathcal Y = \{t_1, t_2,... \}$.

The probabilities are estimated by $$ p(y=t_i, \mathcal D) = |\mathcal D^{(y=t_i)}| /|\mathcal D| $$

with the number of training data $|\mathcal D|$ and the number of training data $|\mathcal D^{(y=t_i)}|$ with target label $t_i$:

On a node a (binary) split on a feature $x_k$ is made by the split rule $x_k \leq v$. As result there are two data sets $\mathcal D_0$ and $\mathcal D_1$ for the left resp. the right branch.

The feature $x_k$ and the split value $v$ are choosen that they maximize the 'reduction of the entropy' measured by the information gain $I$: $$ I(y; x_k) = H(y, \mathcal D) - H(y|x_k) = H(y, \mathcal D) - \sum_{j=0}^1 p_jH(y, \mathcal D_j) = H(y, \mathcal D) + \sum_{j=0}^1 \sum_{y \in \mathcal Y} \frac{|\mathcal D_j|}{|\mathcal D|} p(y|\mathcal D_j) \log_2 p(y|\mathcal D_j) $$ Note that $p_{j=0}$ is the estimated probability that a random data record of $\mathcal D$ has feature value $x_k \leq v$ which can be estimated by ${|\mathcal D_0|}/{|\mathcal D|}$ (analog for $j=1$).

$p(y=t_i|\mathcal D_0)$ can also be estimated by the fraction of the counts ${|\mathcal D_0^{(y=t_i)}|}/{|\mathcal D_0|}$. So the information gain can be computed just with counts:

$$ I(y; x_k) = \sum_{y \in \mathcal Y} \frac{|\mathcal D^{(y=t_i)}|}{|\mathcal D|} \log_2 \frac{|\mathcal D^{(y=t_i)}|}{|\mathcal D|} + \sum_{j=0}^1 \sum_{y \in \mathcal Y} \frac{|\mathcal D_j^{(y=t_i)}|}{|\mathcal D|} \log_2 \frac{|\mathcal D_j^{(y=t_i)}|}{|\mathcal D_j|} $$

Overfitting¶

Deep decision trees generalize often poorly. The following remedies reduce overfitting:

Limitation of the maximal depth of the tree.
Pruning with an validation set either during training (pre-pruning) or after training (post-pruning).
Dimensionality reduction (reducing the number of features before training)

Also often combining decision trees to an ensemble (decision forests) is used against overfitting.

Example: Survival of the Titanic¶

First you have read in the titanic data with pandas:

In [1]:

import pandas as pd
train_df = None # Exercise

train_df is a pandas data frame. Let's view the data.

In [3]:

train_df

Out[3]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	NaN	S
16	17	0	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	NaN	Q
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	0	0	244373	13.0000	NaN	S
18	19	0	3	Vander Planke, Mrs. Julius (Emelia Maria Vande...	female	31.0	1	0	345763	18.0000	NaN	S
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	0	0	2649	7.2250	NaN	C
20	21	0	2	Fynney, Mr. Joseph J	male	35.0	0	0	239865	26.0000	NaN	S
21	22	1	2	Beesley, Mr. Lawrence	male	34.0	0	0	248698	13.0000	D56	S
22	23	1	3	McGowan, Miss. Anna "Annie"	female	15.0	0	0	330923	8.0292	NaN	Q
23	24	1	1	Sloper, Mr. William Thompson	male	28.0	0	0	113788	35.5000	A6	S
24	25	0	3	Palsson, Miss. Torborg Danira	female	8.0	3	1	349909	21.0750	NaN	S
25	26	1	3	Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...	female	38.0	1	5	347077	31.3875	NaN	S
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	0	0	2631	7.2250	NaN	C
27	28	0	1	Fortune, Mr. Charles Alexander	male	19.0	3	2	19950	263.0000	C23 C25 C27	S
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	0	0	330959	7.8792	NaN	Q
29	30	0	3	Todoroff, Mr. Lalio	male	NaN	0	0	349216	7.8958	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
861	862	0	2	Giles, Mr. Frederick Edward	male	21.0	1	0	28134	11.5000	NaN	S
862	863	1	1	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	0	0	17466	25.9292	D17	S
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
864	865	0	2	Gill, Mr. John William	male	24.0	0	0	233866	13.0000	NaN	S
865	866	1	2	Bystrom, Mrs. (Karolina)	female	42.0	0	0	236852	13.0000	NaN	S
866	867	1	2	Duran y More, Miss. Asuncion	female	27.0	1	0	SC/PARIS 2149	13.8583	NaN	C
867	868	0	1	Roebling, Mr. Washington Augustus II	male	31.0	0	0	PC 17590	50.4958	A24	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
869	870	1	3	Johnson, Master. Harold Theodor	male	4.0	1	1	347742	11.1333	NaN	S
870	871	0	3	Balkic, Mr. Cerin	male	26.0	0	0	349248	7.8958	NaN	S
871	872	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S
872	873	0	1	Carlsson, Mr. Frans Olof	male	33.0	0	0	695	5.0000	B51 B53 B55	S
873	874	0	3	Vander Cruyssen, Mr. Victor	male	47.0	0	0	345765	9.0000	NaN	S
874	875	1	2	Abelson, Mrs. Samuel (Hannah Wizosky)	female	28.0	1	0	P/PP 3381	24.0000	NaN	C
875	876	1	3	Najib, Miss. Adele Kiamie "Jane"	female	15.0	0	0	2667	7.2250	NaN	C
876	877	0	3	Gustafsson, Mr. Alfred Ossian	male	20.0	0	0	7534	9.8458	NaN	S
877	878	0	3	Petroff, Mr. Nedelio	male	19.0	0	0	349212	7.8958	NaN	S
878	879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
879	880	1	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C50	C
880	881	1	2	Shelley, Mrs. William (Imanita Parrish Hall)	female	25.0	0	1	230433	26.0000	NaN	S
881	882	0	3	Markun, Mr. Johann	male	33.0	0	0	349257	7.8958	NaN	S
882	883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167	NaN	S
883	884	0	2	Banfield, Mr. Frederick James	male	28.0	0	0	C.A./SOTON 34068	10.5000	NaN	S
884	885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.0500	NaN	S
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

Scikit's learn decision trees can handle only numeric data. So we must convert the nominal Sex feature.

In [19]:

#train_df["Sex"] = # Exercise

Survived is the target, that we want to predict from the values of the other columns.
But not all of the other columns are helpful for classification. So we choose a feature set by hand and convert the features into a numpy array for scikit learn.

In [6]:

y = targets = labels # =  TODO as Exercise

columns = ["Fare", "Pclass", "Sex", "Age", "SibSp"]
features # = TODO as Exercise

There are missing values (nan). We use the scikit learn Imputer to replace them by the mean of the columns.

In [8]:

from sklearn.preprocessing import Imputer
# TODO as Exercise

Now we are ready to learn a decision tree by the criterion 'Information Gain' and we restrict the depth of the tree to 3. We use the scikit learn decison tree module.

We use a splitting criterion the 'Information Gain'!

In [21]:

from sklearn import tree
# clf = tree.DecisionTreeClassifier(criterion="entropy", ... # TODO as Exercise

clf is an instance of a trained decision tree classifier.

The decision tree can be visualized. For this we must write an graphviz dot-File

In [25]:

from sklearn.externals.six import StringIO
with open("data/titanic.dot", 'w') as f:
    f = tree.export_graphviz(clf, out_file=f, feature_names=columns)

The dot file can be converted with the graphiz dot- renderer to an image.

dot -Tpng titanic.dot -o titanic.png

Here is the graph:

According to the decision tree the main criterion (root node) for survival is the sex of the passenger. In the left subtree are the male passengers (sex = 0), in the right subtree the female (sex=1).

In the leafs the class information is given by a value array. Here the second value is the number of survivers in the leafs.

For example the leftmost leaf represents passengers that are male (sex=0) with fare<=26.2687 and age<=13.5. 13 of such boys survived and 2 of them died.

The entropy $- \sum p_i \log_2 (p_i)$ is displayed also at each node (splitting criterion).

Exercises: Splitting criterion entropy / information gain¶

Exercise 1:¶

Compute the root node entropy (with numpy).

Exercise 2¶

Compute the information gain of the first split node (root node). Use the entropy values and the number of data records (samples) from the decision tree image.

Exercise 3¶

Compute the information gain of the following split table:

	class 0	class 1
feature <= v	2	13
feature > v	359	41

The numbers are the corresponding data records, e.g. there are 13 data records with target class 1 and feature <= v.

Write a python function that computes the information gain.The data is given by a python array:

In [16]:

data = np.array([[2.,13.],[359., 41.]])

Prediction¶

To make predictions with scikit learn, we must convert the data in the same way as we did it with the training data. Then we could use the method:

clf.predict(testdata)

Note: The depth of the decision tree is a hyperparameter. So the depth should be determined with the help of a validation set or by cross validation.

Literature¶

[Cri] A. Criminisi, J. Shotton, and E. Konukoglu, Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi- Supervised Learning, no. MSR-TR-2011-114, 28 October 2011.