In a decision tree the data is split at each node according to a decision rule. This corresponds to nested *if-then-else*-rules. In the *if*-part of such a rule are decison is made based on a feature of the data record.

We will use the scikit learn implementation. For this implementation the features must be binary or have (at least) ordinal characteristic. If a feature is e.g. nominal with many values, it must be converted to a set of binary (one-hot-coded) features.

The splitting rules in the scikit learn implementation are binary and are based on a threshold, e.g.

*if $x_6 <= 2.14$ then*left subbranch,*else*right subbranch.- binary features must be coded as 0/1, so the rule becomes: if $x_2 <= 0.5$
*then*left subbranch,*else*right subbranch.

In the leafs of the tree the (class) predictions are made. There are two possibilities for such an inference:

- hard assignment: Predict for the data records which end up on a leaf by the majority class of the training data that end up on that leaf.
- soft assignment: Assign probabilities according to the distribution of the classes in respect to the training data which end up on that leaf.

As an example of a decision tree we will learn the following tree from the titanic data set:

A full explanation of the tree will be given later. Here just look at the desion rules (first line of the inner nodes) and at the last line of the leafs. In each leaf you see a array with counts of the different targets (train data): [number_died number_survivors] .

Finding a tree that splits the traing data optimal is np-hard. Therefore often a *greedy*-strategy is used:

To build up a decision tree the algorithm starts at the root of the tree. The feature and the threshold that splits the training data best (with respect to the classes) are choosen. In an iterative way the whole tree is build up by such splitting rules.

There are different criteria for measuring the "separation (split) quality". The most important ones are:

- Gini Impurity
- Information Gain

In this tutorial we use the information gain.

The entropy with respect to the target class variable $y$ of a training data set $\mathcal D$ is defined as:

$$ H(y, \mathcal D) = - \sum_{y \in \mathcal Y} p(y|\mathcal D) \log_2 p(y|\mathcal D) $$ with the domain of the target values $\mathcal Y = \{t_1, t_2,... \}$.

The probabilities are estimated by $$ p(y=t_i, \mathcal D) = |\mathcal D^{(y=t_i)}| /|\mathcal D| $$

with the number of training data $|\mathcal D|$ and the number of training data $|\mathcal D^{(y=t_i)}|$ with target label $t_i$:

On a node a (binary) split on a feature $x_k$ is made by the split rule $x_k \leq v$. As result there are two data sets $\mathcal D_0$ and $\mathcal D_1$ for the left resp. the right branch.

The feature $x_k$ and the split value $v$ are choosen that they maximize the 'reduction of the entropy' measured by the information gain $I$: $$ I(y; x_k) = H(y, \mathcal D) - H(y|x_k) = H(y, \mathcal D) - \sum_{j=0}^1 p_jH(y, \mathcal D_j) = H(y, \mathcal D) + \sum_{j=0}^1 \sum_{y \in \mathcal Y} \frac{|\mathcal D_j|}{|\mathcal D|} p(y|\mathcal D_j) \log_2 p(y|\mathcal D_j) $$ Note that $p_{j=0}$ is the estimated probability that a random data record of $\mathcal D$ has feature value $x_k \leq v$ which can be estimated by ${|\mathcal D_0|}/{|\mathcal D|}$ (analog for $j=1$).

$p(y=t_i|\mathcal D_0)$ can also be estimated by the fraction of the counts ${|\mathcal D_0^{(y=t_i)}|}/{|\mathcal D_0|}$. So the information gain can be computed just with counts:

$$ I(y; x_k) = \sum_{y \in \mathcal Y} \frac{|\mathcal D^{(y=t_i)}|}{|\mathcal D|} \log_2 \frac{|\mathcal D^{(y=t_i)}|}{|\mathcal D|} + \sum_{j=0}^1 \sum_{y \in \mathcal Y} \frac{|\mathcal D_j^{(y=t_i)}|}{|\mathcal D|} \log_2 \frac{|\mathcal D_j^{(y=t_i)}|}{|\mathcal D_j|} $$

Deep decision trees generalize often poorly. The following remedies reduce overfitting:

- Limitation of the maximal depth of the tree.
- Pruning with an validation set either during training (pre-pruning) or after training (post-pruning).
- Dimensionality reduction (reducing the number of features before training)

Also often combining decision trees to an ensemble (decision forests) is used against overfitting.

First we read in the titanic data with pandas:

In [2]:

```
import pandas as pd
train_df = pd.read_csv("data/titanic-train.csv")
```

`train_df`

is a *pandas* data frame.
Let's view the data.

In [3]:

```
train_df
```

Out[3]:

`Sex`

feature.

In [4]:

```
train_df["Sex"] = train_df["Sex"].apply(lambda sex: 0 if sex == 'male' else 1)
# or
#df.Sex[df["Sex"]=='female'] = 1
#df.Sex[df["Sex"]=='male'] = 0
```

`Survived`

is the target, that we want to predict from the values of the other columns.

But not all of the other columns are helpful for classification. So we choose a feature set by hand and convert the features into a numpy array for scikit learn.

In [5]:

```
y = targets = labels = train_df["Survived"].values
columns = ["Fare", "Pclass", "Sex", "Age", "SibSp"]
features = train_df[list(columns)].values
features
```

Out[5]:

`nan`

). We use the scikit learn `Imputer`

to replace them by the mean of the columns.

In [6]:

```
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X = imp.fit_transform(features)
X
```

Out[6]:

In [14]:

```
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion="entropy", max_depth=3)
clf = clf.fit(X, y)
```

`clf`

is an instance of a trained decision tree classifier.

The decision tree can be visualized. For this we must write an graphviz dot-File

In [8]:

```
from sklearn.externals.six import StringIO
with open("data/titanic.dot", 'w') as f:
f = tree.export_graphviz(clf, out_file=f, feature_names=columns)
```

The dot file can be converted with the graphiz `dot`

- renderer to an image.

```
dot -Tpng titanic.dot -o titanic.png
```

Here is the graph:

According to the decision tree the main criterion (root node) for survival is the sex of the passenger. In the left subtree are the male passengers (sex = 0), in the right subtree the female (sex=1).

In the leafs the class information is given by a `value`

array. Here the second value is the number of survivers in the leafs.

For example the leftmost leaf represents passengers that are male (sex=0) with fare<=26.2687 and age<=13.5. 13 of such boys survived and 2 of them died.

The entropy $- \sum p_i \log_2 (p_i)$ is displayed also at each node (splitting criterion).

Recompute the root node entropy.

In [9]:

```
import numpy as np
nb_females = y.sum()
p_female = float(y.sum())/len(y)
p_male = 1. - p_female
entropy_root = - p_female * np.log2(p_female) - p_male * np.log2(p_male)
entropy_root
```

Out[9]:

Compute the information gain of the first split node (root node). Use the shown entropy values and number of data records (samples).

In [10]:

```
entropy_before_split = 0.9607
weighted_entropy_after_split = 577./(577+314) * 0.699182 + 314./(577+314) * 0.823655
information_gain_root_node = entropy_before_split - weighted_entropy_after_split
print "Information gain of the root node split:", information_gain_root_node
```

Compute the information gain of the following split table:

class 0 | class 1 | |
---|---|---|

feature <= v | 2 | 13 |

feature > v | 359 | 41 |

The numbers are the corresponding data records, e.g. there are 13 data records with target class 1 and feature <= v.

Write a python function that computes the information gain.The data is given by a python array:

In [11]:

```
data = np.array([[2.,13.],[359., 41.]])
```

Very explicit and special:

In [12]:

```
def entropy(p_):
p = p_.copy()
p[p != 0] = - p[p != 0] * np.log2(p[p != 0] )
return p.sum()
def information_gain(data):
entropy_feature_smaller_v = entropy(data[0]/data[0].sum())
entropy_feature_bigger_v = entropy(data[1]/data[1].sum())
entropy_before_split = entropy(data.sum(axis=0)/data.sum(axis=0).sum())
weights = data.sum(axis=1)/data.sum()
weighted_entropy_after_split = weights[0] * entropy_feature_smaller_v + weights[1] * entropy_feature_bigger_v
return entropy_before_split - weighted_entropy_after_split
print "Information Gain:", information_gain(data)
```

More general and reusable solution based on a probability table:

In [13]:

```
def p_log_p(p_):
p = p_.copy()
p[p != 0] = - p[p != 0] * np.log2(p[p != 0] )
return p
def entropy(p):
assert (p>=0.).all()
assert np.allclose(1., p.sum(), atol=1e-08)
return p_log_p(p).sum()
def conditional_entropy(p, axis=0):
if axis == 1:
p = p.T
p_y_given_x = p / p.sum(axis=0)
c = p_log_p(p_y_given_x)
p_x = p.sum(axis=0)/p.sum()
s = c * p_x
return s.sum()
def information_gain(p0, axis = 0):
p = p0.copy()
if axis == 1:
p = p.T
p_ = p.sum(axis=1)
return entropy(p_) - conditional_entropy(p)
p = data / data.sum()
p = p.T
p_y = p.sum(axis=1)
print "Entropy H(Y) = ", entropy(p_y)
print "Conditional-Entropy H(Y|X) = ", conditional_entropy(p)
print "Information Gain I(Y; X) = H(Y) - H(Y|X) =", information_gain(p)
```

To make predictions with scikit learn, we must convert the data in the same way as we did it with the training data. Then we could use the method:

```
clf.predict(testdata)
```

Note: The depth of the decision tree is a hyperparameter. So the depth should be determined with the help of a validation set or by cross validation.