Decision Trees (for Classification)

In a decision tree the data is split at each node according to a decision rule. This corresponds to nested if-then-else-rules. In the if-part of such a rule are decison is made based on a feature of the data record.

We will use the scikit learn implementation. For this implementation the features must be binary or have (at least) ordinal characteristic. If a feature is e.g. nominal with many values, it must be converted to a set of binary (one-hot-coded) features.

The splitting rules in the scikit learn implementation are binary and are based on a threshold, e.g.

  • if $x_6 <= 2.14$ then left subbranch, else right subbranch.
  • binary features must be coded as 0/1, so the rule becomes: if $x_2 <= 0.5$ then left subbranch, else right subbranch.

In the leafs of the tree the (class) predictions are made. There are two possibilities for such an inference:

  • hard assignment: Predict for the data records which end up on a leaf by the majority class of the training data that end up on that leaf.
  • soft assignment: Assign probabilities according to the distribution of the classes in respect to the training data which end up on that leaf.

As an example of a decision tree we will learn the following tree from the titanic data set:

A full explanation of the tree will be given later. Here just look at the desion rules (first line of the inner nodes) and at the last line of the leafs. In each leaf you see a array with counts of the different targets (train data): [number_died number_survivors] .

Learning

Finding a tree that splits the traing data optimal is np-hard. Therefore often a greedy-strategy is used:

To build up a decision tree the algorithm starts at the root of the tree. The feature and the threshold that splits the training data best (with respect to the classes) are choosen. In an iterative way the whole tree is build up by such splitting rules.

There are different criteria for measuring the "separation (split) quality". The most important ones are:

  • Gini Impurity
  • Information Gain

In this tutorial we use the information gain.

Information Gain as splitting criterion

The entropy with respect to the target class variable $y$ of a training data set $\mathcal D$ is defined as:

$$ H(y, \mathcal D) = - \sum_{y \in \mathcal Y} p(y|\mathcal D) \log_2 p(y|\mathcal D) $$ with the domain of the target values $\mathcal Y = \{t_1, t_2,... \}$.

The probabilities are estimated by $$ p(y=t_i, \mathcal D) = |\mathcal D^{(y=t_i)}| /|\mathcal D| $$

with the number of training data $|\mathcal D|$ and the number of training data $|\mathcal D^{(y=t_i)}|$ with target label $t_i$:

On a node a (binary) split on a feature $x_k$ is made by the split rule $x_k \leq v$. As result there are two data sets $\mathcal D_0$ and $\mathcal D_1$ for the left resp. the right branch.

The feature $x_k$ and the split value $v$ are choosen that they maximize the 'reduction of the entropy' measured by the information gain $I$: $$ I(y; x_k) = H(y, \mathcal D) - H(y|x_k) = H(y, \mathcal D) - \sum_{j=0}^1 p_jH(y, \mathcal D_j) = H(y, \mathcal D) + \sum_{j=0}^1 \sum_{y \in \mathcal Y} \frac{|\mathcal D_j|}{|\mathcal D|} p(y|\mathcal D_j) \log_2 p(y|\mathcal D_j) $$ Note that $p_{j=0}$ is the estimated probability that a random data record of $\mathcal D$ has feature value $x_k \leq v$ which can be estimated by ${|\mathcal D_0|}/{|\mathcal D|}$ (analog for $j=1$).

$p(y=t_i|\mathcal D_0)$ can also be estimated by the fraction of the counts ${|\mathcal D_0^{(y=t_i)}|}/{|\mathcal D_0|}$. So the information gain can be computed just with counts:

$$ I(y; x_k) = \sum_{y \in \mathcal Y} \frac{|\mathcal D^{(y=t_i)}|}{|\mathcal D|} \log_2 \frac{|\mathcal D^{(y=t_i)}|}{|\mathcal D|} + \sum_{j=0}^1 \sum_{y \in \mathcal Y} \frac{|\mathcal D_j^{(y=t_i)}|}{|\mathcal D|} \log_2 \frac{|\mathcal D_j^{(y=t_i)}|}{|\mathcal D_j|} $$

Overfitting

Deep decision trees generalize often poorly. The following remedies reduce overfitting:

  • Limitation of the maximal depth of the tree.
  • Pruning with an validation set either during training (pre-pruning) or after training (post-pruning).
  • Dimensionality reduction (reducing the number of features before training)

Also often combining decision trees to an ensemble (decision forests) is used against overfitting.

Example: Survival of the Titanic

First we read in the titanic data with pandas:

In [2]:
import pandas as pd
train_df = pd.read_csv("data/titanic-train.csv")
/usr/local/lib/python2.7/dist-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))

train_df is a pandas data frame. Let's view the data.

In [3]:
train_df
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Scikit's learn decision trees can handle only numeric data. So we must convert the nominal Sex feature.

In [4]:
train_df["Sex"] = train_df["Sex"].apply(lambda sex: 0 if sex == 'male' else 1)
# or
#df.Sex[df["Sex"]=='female'] = 1
#df.Sex[df["Sex"]=='male'] = 0

Survived is the target, that we want to predict from the values of the other columns.
But not all of the other columns are helpful for classification. So we choose a feature set by hand and convert the features into a numpy array for scikit learn.

In [5]:
y = targets = labels = train_df["Survived"].values

columns = ["Fare", "Pclass", "Sex", "Age", "SibSp"]
features = train_df[list(columns)].values
features
Out[5]:
array([[  7.25  ,   3.    ,   0.    ,  22.    ,   1.    ],
       [ 71.2833,   1.    ,   1.    ,  38.    ,   1.    ],
       [  7.925 ,   3.    ,   1.    ,  26.    ,   0.    ],
       ..., 
       [ 23.45  ,   3.    ,   1.    ,      nan,   1.    ],
       [ 30.    ,   1.    ,   0.    ,  26.    ,   0.    ],
       [  7.75  ,   3.    ,   0.    ,  32.    ,   0.    ]])

There are missing values (nan). We use the scikit learn Imputer to replace them by the mean of the columns.

In [6]:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X = imp.fit_transform(features)
X
Out[6]:
array([[  7.25      ,   3.        ,   0.        ,  22.        ,   1.        ],
       [ 71.2833    ,   1.        ,   1.        ,  38.        ,   1.        ],
       [  7.925     ,   3.        ,   1.        ,  26.        ,   0.        ],
       ..., 
       [ 23.45      ,   3.        ,   1.        ,  29.69911765,   1.        ],
       [ 30.        ,   1.        ,   0.        ,  26.        ,   0.        ],
       [  7.75      ,   3.        ,   0.        ,  32.        ,   0.        ]])

Now we are ready to learn a decision tree by the criterion 'Information Gain' and we restrict the depth of the tree to 3. We use the scikit learn decison tree module.

In [14]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion="entropy", max_depth=3)
clf = clf.fit(X, y)

clf is an instance of a trained decision tree classifier.

The decision tree can be visualized. For this we must write an graphviz dot-File

In [8]:
from sklearn.externals.six import StringIO
with open("data/titanic.dot", 'w') as f:
  f = tree.export_graphviz(clf, out_file=f, feature_names=columns)

The dot file can be converted with the graphiz dot- renderer to an image.

dot -Tpng titanic.dot -o titanic.png

Here is the graph:

According to the decision tree the main criterion (root node) for survival is the sex of the passenger. In the left subtree are the male passengers (sex = 0), in the right subtree the female (sex=1).

In the leafs the class information is given by a value array. Here the second value is the number of survivers in the leafs.

For example the leftmost leaf represents passengers that are male (sex=0) with fare<=26.2687 and age<=13.5. 13 of such boys survived and 2 of them died.

The entropy $- \sum p_i \log_2 (p_i)$ is displayed also at each node (splitting criterion).

Exercises: Splitting criterion entropy / information gain

Exercise 1:

Recompute the root node entropy.

Solution

In [9]:
import numpy as np
nb_females = y.sum()
p_female = float(y.sum())/len(y)
p_male = 1. - p_female
entropy_root = - p_female * np.log2(p_female) - p_male * np.log2(p_male)
entropy_root
Out[9]:
0.96070790187564692

Exercise 2

Compute the information gain of the first split node (root node). Use the shown entropy values and number of data records (samples).

Solution

In [10]:
entropy_before_split = 0.9607
weighted_entropy_after_split = 577./(577+314) * 0.699182 + 314./(577+314) * 0.823655
information_gain_root_node = entropy_before_split - weighted_entropy_after_split 
print "Information gain of the root node split:", information_gain_root_node
Information gain of the root node split: 0.217652094276

Exercise 3

Compute the information gain of the following split table:

class 0 class 1
feature <= v 2 13
feature > v 359 41

The numbers are the corresponding data records, e.g. there are 13 data records with target class 1 and feature <= v.

Write a python function that computes the information gain.The data is given by a python array:

In [11]:
data = np.array([[2.,13.],[359., 41.]])

Solution

Very explicit and special:

In [12]:
def entropy(p_):
  p = p_.copy()
  p[p != 0] = - p[p != 0] * np.log2(p[p != 0] )
  return p.sum()

def information_gain(data):
    entropy_feature_smaller_v = entropy(data[0]/data[0].sum())
    entropy_feature_bigger_v = entropy(data[1]/data[1].sum())
    entropy_before_split = entropy(data.sum(axis=0)/data.sum(axis=0).sum())
    weights = data.sum(axis=1)/data.sum()
    weighted_entropy_after_split = weights[0] * entropy_feature_smaller_v + weights[1] * entropy_feature_bigger_v
    return entropy_before_split - weighted_entropy_after_split

print "Information Gain:", information_gain(data)
Information Gain: 0.0776578780428

More general and reusable solution based on a probability table:

In [13]:
def p_log_p(p_):
  p = p_.copy()
  p[p != 0] = - p[p != 0] * np.log2(p[p != 0] )
  return p

def entropy(p):
  assert (p>=0.).all()
  assert np.allclose(1., p.sum(), atol=1e-08)
  return p_log_p(p).sum()
  
def conditional_entropy(p, axis=0):
  if axis == 1:
    p = p.T
  p_y_given_x = p / p.sum(axis=0)
  c = p_log_p(p_y_given_x) 
  p_x = p.sum(axis=0)/p.sum()
  s = c * p_x
  return s.sum()
 
def information_gain(p0, axis = 0):
  p = p0.copy()
  if axis == 1:
    p = p.T
  p_ = p.sum(axis=1)
  return entropy(p_) - conditional_entropy(p)

p = data / data.sum()
p = p.T
p_y = p.sum(axis=1)

print "Entropy H(Y) = ", entropy(p_y)  
print "Conditional-Entropy H(Y|X) = ",  conditional_entropy(p)
print "Information Gain I(Y; X) = H(Y) - H(Y|X) =", information_gain(p)
Entropy H(Y) =  0.557768514612
Conditional-Entropy H(Y|X) =  0.48011063657
Information Gain I(Y; X) = H(Y) - H(Y|X) = 0.0776578780428

Prediction

To make predictions with scikit learn, we must convert the data in the same way as we did it with the training data. Then we could use the method:

clf.predict(testdata)

Note: The depth of the decision tree is a hyperparameter. So the depth should be determined with the help of a validation set or by cross validation.