Decision Trees

8.2. Decision Trees#

In this notebook, we will implement a decision in Python. We’ll start with a single decision tree and a simple problem, and then work our way to a random forest. Once we understand how a single decision tree works, we can transfer this knowledge to an entire forest of trees.

Nice introductory video

8.2.1. What is a decision tree#

A simple linear classifier will not be able to draw a boundary that separates the classes. The single decision tree will be able to completely separate the points because it essentially draws many repeated linear boundaries between points. A decision tree is a non-parametric model because the number of parameters grows with the size of the data.

Decision tree pseudocode

Calculate Total Gini Impurity for all possible nodes or subdivisions.
Select the node with the smallest total Gini impurity
Go to the next decision level of a (sub)tree
Repeat as required or until only ‘pure’ nodes are created

NOTE: DTs are easy to built, use and interpret BUT they are limited!

Their main disadvantage: INACCURAY, since they are not flexible with new samples

8.2.1.1. Example: Building a Decision Tree from scratch#

First, a decision tree is to be generated by hand from the following admittedly rather artificial data set.

Clouds	Temperature	Rain
yes	mild	yes
no	mild	no
yes	cold	no
no	cold	no
no	hot	no
no	mild	no
yes	hot	yes

The tree should give us the answer as to whether it is raining or not - so “rain” should close the tree as the leaf node. The first step is to determine whether “clouds” or “temperature” is a better choice for the first decision level, i.e. the root node of the tree.

Gini impurity is used for this decision:

\[\begin{eqnarray*} Gini(D) = 1 - \sum_{i=1}^k p_i^2. \end{eqnarray*}\]

The running index “i” \(\in\) [1,2] describes the target classes of the leaf nodes to be distinguished, in this case [rain-yes,rain-no]. The probability \(p_1\) or \(p_2\) is therefore the probability of rain or no rain in data set D.

The total gini-impurity of a node which splits data set D with \(n\) entries into e.g. 2 sub-data sets \(D_a\) and \(D_b\) with \(n_a\) and \(n_b\) entries respectively, is calculated from the weighted average of the respective individual gini-impurities:

\[\begin{eqnarray*} Gini(D) = \frac{n_a}{n} Gini(D_a) + \frac{n_b}{n} Gini(D_b), \end{eqnarray*}\]

where the indices \(a\) and \(b\) could stand for clouds=yes and clouds=no, for example.

Exercise A:

We now calculate the total gini-impurity of each of the following possible root nodes:

Clouds \(=\) yes or no

Temperature \(=\) cold, mild or hot

Exercise B:

The node with the lowest total Gini impurity is suitable as the first decision node or root node of the tree. Determine the total Gini impurity for the first internal nodes of the 2 remaining discriminators, namely for the 2 branches of the root node.

### Your code here ####

8.2.1.2. solution#

def gini(p1,p2):
    GI = 1 - (p1**2 + p2**2)
    return GI

## 1. cloud no
## results in: rain no --> 4 times, rain yes --> 0 times
p1 = 0/4 # rain no
p2 = 4/4 # rain yes

gini(p1,p2)

0.0

## 2. cloud yes
## results in: rain no --> 1 times, rain yes --> 2 times

p1 = 2/3 # rain yes
p2 = 1/3 # rain no 

gini(p1,p2)

0.4444444444444444

## Calculating total GINI Impurity for the node "clouds"

na = 4 # total entries of clouds no
nb = 3 # total entries of clouds yes

p1a = 0/4 # cloud no, rain no
p2a = 4/4 # cloud no, rain yes

p1b = 2/3 # cloud yes, rain yes
p2b = 1/3 # cloud yes, rain no 


def total_gini(p1a,p2a,p1b,p2b,na, nb):
    GI = na/(na+nb) * gini(p1a,p2a) + nb/(na+nb) * gini(p1b,p2b)
    return GI

total_gini(p1a,p2a,p1b,p2b,na, nb)

0.19047619047619047

Calculating total GINI Impurity for the node “temperature”

## 1. temperature cold 
## results in: rain no --> 2 times, rain yes --> 0 times

p1 = 0/2 # rain yes
p2 = 2/2 # rain no 

gini(p1,p2)

0.0

## 2. temperature mild 
## results in: rain no --> 2 times, rain yes --> 1 times

p1 = 1/3 # rain yes
p2 = 2/3 # rain no 

gini(p1,p2)

0.4444444444444444

## 2. temperature hot 
## results in: rain no --> 1 times, rain yes --> 1 times

p1 = 1/2 # rain yes
p2 = 1/2 # rain no 

gini(p1,p2)

0.5

8.2.2. Decision trees in Scikit-learn#

Next, we build and train a single decision tree on the data using scikit-learn. The tree will learn how to separate the points, building a flowchart of questions based on the feature values and the labels. At each stage, the decision tree makes splits by maximizing the reduction in Gini impurity. We’ll use the default hyperparameters for the decision tree which means it can grow as deep as necessary in order to completely separate the classes. This will lead to overfitting because the model memorizes the training data, and in practice, we usually want to limit the depth of the tree so it can generalize to testing data.

# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier  # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split  # Import train_test_split function
from sklearn import (
    metrics,
)  # Import scikit-learn metrics module for accuracy calculation
from sklearn import tree

# load data

data = {
    "clouds": [1, 0, 1, 0, 0, 0, 1],
    "temp": [10, 20, -3, -5, 25, 10, 30],
    "rain": [1, 0, 0, 0, 0, 0, 1],
}
df = pd.DataFrame.from_dict(data)
df

	clouds	temp	rain
0	1	10	1
1	0	20	0
2	1	-3	0
3	0	-5	0
4	0	25	0
5	0	10	0
6	1	30	1

# split dataset in features and target variable
feature_cols = ["clouds", "temp"]
X = df[feature_cols]  # Features
y = df.rain  # Target variable

# X = df.iloc[:, :-1].values
# y = df.iloc[:, -1].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)  # 70% training and 30% test

X_train

	clouds	temp
0	1	10
4	0	25
3	0	-5
5	0	10

# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="gini")

# Train Decision Tree Classifer
clf = clf.fit(X_train, y_train)


y_train_pred = clf.predict(X_train)

# Model Accuracy based on training data, how often is the classifier correct?
print("Accuracy:", metrics.accuracy_score(y_train, y_train_pred))

Accuracy: 1.0

tree.plot_tree(clf, feature_names=feature_cols)

[Text(0.5, 0.75, 'clouds <= 0.5\ngini = 0.375\nsamples = 4\nvalue = [3, 1]'),
 Text(0.25, 0.25, 'gini = 0.0\nsamples = 3\nvalue = [3, 0]'),
 Text(0.75, 0.25, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]')]

../../../_images/d9daad214944d7c137a5f882eec9e11e4df787b00337893a3917eda4da458568.png

# Predict the response for test dataset
y_pred = clf.predict(X_test)

## Model Accuracy, how often is the classifier correct based on the test data?
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.6666666666666666

print(clf.predict([[0, -5]]))  ##clouds, temp

[0]

/opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(

# Mittelwert der quadrierten Residuen und erklärte Varianz ausgeben lassen
print(f"Mean of squared residuals: {model.score(X, y)}")
print(f"% Var explained: {model.score(X, y) * 100}")


print(f"% OOB score: {1 - model.oob_score_}")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 2
      1 # Mittelwert der quadrierten Residuen und erklärte Varianz ausgeben lassen
----> 2 print(f"Mean of squared residuals: {model.score(X, y)}")
      3 print(f"% Var explained: {model.score(X, y) * 100}")
      6 print(f"% OOB score: {1 - model.oob_score_}")

NameError: name 'model' is not defined

# Statistiken über das Modell ausgeben lassen und die 10 wichtigsten Variablen extrahieren
stats = model.get_params()
feature_importance = model.feature_importances_
features = X.columns
var_imp_df = pd.DataFrame({'Feature': features, 'Importance': feature_importance})
var_imp_df = var_imp_df.sort_values(by = 'Importance', ascending = False).head(54)
var_imp_10 = var_imp_df.head(10)

# Plotten der wichtigsten Variablen
plt.figure(figsize = (8, 3))
sns.barplot(x = 'Importance', y = 'Feature', data = var_imp_10, color = 'gray')
plt.title("MDI")
plt.xlabel("Importance score")
plt.ylabel("Feature")

plt.show()

../../../_images/468d1fc1314af8dd65ce647ed9ee7b599a80eddc67e5e1c5327b6165cd0b44ab.png

Sample locations

The data set and further information about the sampling process can be found here.

Let us take a closer look at the data:

# Importieren der pandas- und requests-Bibliothek sowie des StringIO-Moduls
import pandas as pd
import requests
from io import StringIO

# Daten über URL einlesen
url = "https://doi.pangaea.de/10.1594/PANGAEA.944811?format=textfile"

response = requests.get(url)
IsoW06 = pd.read_csv(StringIO(response.text), sep = '\t', skiprows = 267, header = 1, encoding = "UTF-8", 
                     engine = 'python', on_bad_lines = 'skip')

# Anzeigen der ersten 6 Dateneinträge
IsoW06.head(6)

	Event	Sample ID	Latitude	Longitude	Date/Time	Samp type	Sample comment	δ18O H2O [‰ SMOW]	δD H2O [‰ SMOW]	δ18O H2O std dev [±]	δD H2O std dev [±]
0	WaterSA_SLW1	SLW1	-33.88917	18.96917	2016-08-29	River	River at Pniel	-3.54	-14.50	0.09	0.64
1	WaterSA_SLW2	SLW2	-33.87800	19.03517	2016-08-29	River	River Berg; abundant with insect larvae; dam u...	-3.33	-13.62	0.09	0.45
2	WaterSA_SLW3	SLW3	-33.93667	19.17000	2016-08-29	River	Minor waterfall; iron rich	-4.44	-22.33	0.04	0.59
3	WaterSA_SLW4	SLW4	-33.69350	19.32483	2016-08-29	River	River; abundant with insect larvae	-4.28	-22.70	0.07	0.30
4	WaterSA_SLW5	SLW5	-33.54333	19.20733	2016-08-29	River	River Bree	-4.09	-18.99	0.04	0.34
5	WaterSA_SLW6	SLW6	-33.33367	19.87767	2016-08-30	Lake	Reservoir lake; under almost natural condition...	-2.59	-18.59	0.10	0.29

The data set contains 188 samples and the following 11 variables: Event, Sample ID, Latitude, Longitude, Date/Time, Samp type, Sample comment, δ18O H2O [‰ SMOW], δD H2O [‰ SMOW], δ18O H2O std dev [±], δD H2O std dev [±]. The isotope ratios are expressed in the conventional delta notation (δ18O, δ2H) in per mil (‰) relative to VSMOW (Vienna Standard Mean Ocean Water.

8.2.2.1. Ressources for this script:#

Koehrsen 2010

from IPython.display import IFrame

IFrame(
    src="../../citations/citation_Marie.html",
    width=900,
    height=200,
)