8.4. Random Forest Regression#

mloverview

8.4.1. Random Forest Tutorial Aims:#

  • Random Forest classification with scikit-learn

  • How random forests work

  • How to use them for regression

  • How to evaluate their performance (& tune Hyper-parameters)

mloverview

8.4.2. Water level prediction in Baden-Württemberg#

A typical machine learning approach consists of several steps that can be arranged in a kind of standard workflow. Training of the model, which we have done in the previous examples of different algorithms, is only a small part of this workflow. Data preprocessing and model evaluation and selection are by far the more extensive tasks. In the following we will walk through the entire workflow using a simple but more realistic example than using toy data. We will build a Random Forest Regression as an Example for a typical ML workflow.

Our goal is to use climate data (temperature and precipitation) and predict the daily mean water level at a water level monitoring station in Baden-Württemberg.

8.4.3. Data Collection#

For this example, we will use a daily time series from the water level (Pegel) stations in Baden-Württemberg (BW), downloadable from UDO. The data provided here was downloaded on 2024-07-22. A detailed description of the variables is available here. For the purpose of this tutorial the dataset is provided in here.

../../../_images/pegel_BW.png ../../../_images/ausgew%C3%A4hlte_stations.png

As for the respective climate data we use the HYRAS Dataset (Hydrometeorologische Rasterdatensätze) provided by the Deutsche Wetterdienst DWD. The respective data is open access and can be easily downloaded here. The respective climate data at the water level monitoring station is already appended in the provided csv file.

8.4.4. Read Data Set#

%load_ext lab_black
# First, let's import all the needed libraries.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import os
os.getcwd()
os.chdir("..")
os.getcwd()
'/Users/marie-christineckert/Nextcloud/TU/ML_jupyter_reader/contents/ML_algorithms'
data_raw = pd.read_csv(
    "data/Jagstzell_Jagst_dailyData_HYRAS_Pegelstand.csv",
    # sep=";",
    # na_values=["-999"],
    skipinitialspace=True,
)

data_raw = data_raw.drop(columns=["Unnamed: 0"])
data_raw.head(10)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 data_raw = pd.read_csv(
      2     "data/Jagstzell_Jagst_dailyData_HYRAS_Pegelstand.csv",
      3     # sep=";",
      4     # na_values=["-999"],
      5     skipinitialspace=True,
      6 )
      8 data_raw = data_raw.drop(columns=["Unnamed: 0"])
      9 data_raw.head(10)

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'data/Jagstzell_Jagst_dailyData_HYRAS_Pegelstand.csv'

The data set contains 18628 samples and the following 3 variables: pr, tas and Wert.

data_raw.dtypes
Date        object
pr         float64
tas        float64
Wert       float64
Einheit     object
Produkt     object
dtype: object

8.4.4.1. Remove unnecessary columns#

By looking at the data and consulting the variable description, we can immediately find variables that are not useful for our purpose and drop them.

  • Einheit

  • Produkt

data = data_raw.drop(columns=["Einheit", "Produkt"])
data
Date pr tas Wert
0 1970-01-01 12:00:00 0.3 -5.6 NaN
1 1970-01-02 12:00:00 0.6 -3.7 NaN
2 1970-01-03 12:00:00 3.6 -0.5 NaN
3 1970-01-04 12:00:00 11.2 -0.4 NaN
4 1970-01-05 12:00:00 1.5 -2.5 NaN
... ... ... ... ...
18623 2020-12-27 12:00:00 23.6 1.7 36.0
18624 2020-12-28 12:00:00 6.5 2.5 32.0
18625 2020-12-29 12:00:00 1.9 1.7 32.0
18626 2020-12-30 12:00:00 0.2 2.4 28.0
18627 2020-12-31 12:00:00 2.5 1.1 28.0

18628 rows × 4 columns

8.4.4.2. Rename columns for convenience#

To improve the readability, we rename the remaining columns.

data = data.rename(
    columns={
        "pr": "prec_sum",
        "tas": "temp_mean",
        "Wert": "water_level",
    }
)
data
Date prec_sum temp_mean water_level
0 1970-01-01 12:00:00 0.3 -5.6 NaN
1 1970-01-02 12:00:00 0.6 -3.7 NaN
2 1970-01-03 12:00:00 3.6 -0.5 NaN
3 1970-01-04 12:00:00 11.2 -0.4 NaN
4 1970-01-05 12:00:00 1.5 -2.5 NaN
... ... ... ... ...
18623 2020-12-27 12:00:00 23.6 1.7 36.0
18624 2020-12-28 12:00:00 6.5 2.5 32.0
18625 2020-12-29 12:00:00 1.9 1.7 32.0
18626 2020-12-30 12:00:00 0.2 2.4 28.0
18627 2020-12-31 12:00:00 2.5 1.1 28.0

18628 rows × 4 columns

8.4.4.3. Drop missing data#

As a next step, we check the data for missing values.

data.isnull().sum()
Date              0
prec_sum          0
temp_mean         0
water_level    4292
dtype: int64

As we can see, the percentage of missing values in our data is quite high (4292 out of 18628 rows). Since we only have missings in the water_levelcolumn, we might conclude that our water level time series is significantly shorter than the climate data time series. Therefore, the simplest solution is to simply drop all rows with missing data.

If our modelling goal was different, other solutions might be more useful, such as imputing values for the missing values ( we can even use a Random Forest for that!)

## check how many rows are empty from the beginning
first_row_not_na = data.water_level.notna().idxmax()
data = data.iloc[first_row_not_na : len(data)]
first_not_na = data.water_level.notna().idxmax()
data = data.iloc[first_not_na : len(data)]

8.4.4.4. Convert Index to DateTimeindex#

import datetime
data = data.set_index(pd.DatetimeIndex(data["Date"]))
del data["Date"]

8.4.4.5. Append information about Month#

We know that the infiltration behaviour of soils changes with the season. That’s why we want to keep this information as a proxy for season.

data["month"] = data.index.month

8.4.4.6. Append Rolling Mean#

The discharge behaviour of a river is of course not only dependent on the precipitation and temperature on the respective day. The preceding days should also have an influence. To cover this, we add a moving average column over 5 and 10 days for precipitation and temperature.

data["prec_sum_rolling_5"] = data["prec_sum"].rolling(5).sum()
data["prec_sum_rolling_10"] = data["prec_sum"].rolling(10).sum()


data["temp_mean_rolling_5"] = data["temp_mean"].rolling(5).mean()
data["temp_mean_rolling_10"] = data["temp_mean"].rolling(10).mean()

8.4.4.7. Train-Test split#

from sklearn.model_selection import train_test_split

# split features and target first
X, y = data.drop(columns=["water_level"]), data["water_level"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(len(X_train), len(X_test), len(y_train), len(y_test))
8035 2009 8035 2009

Note: The records are randomly shuffled and split to avoid any effects from the original order of the records.

8.4.5. Initial Model Set Up#

All the necessary pre-processing is now done and we can finally get to the actual model. The Random Forest (RF) algorithm is applied to assess the relative importance of meteorological variables on the water level. Our goal is then to predict the mean water level (cm).

We start with a performance test on our data for the RF algorithm by creating an initial RF model using the RandomForestRegressor() function. It predicts water level based on all other variables with 2000 trees.

# Import modules
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Initialwert und Anzahl der Bäume für den Random Forest festlegen
seed = 196
n_estimators = 1000

# RandomForest erstellen, X und y definieren und Modell trainieren
model = RandomForestRegressor(
    n_estimators=n_estimators,
    random_state=seed,
    max_features=1.0,
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=None,
)
model.fit(X, y)
RandomForestRegressor(n_estimators=1000, random_state=196)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Mittelwert der quadrierten Residuen und erklärte Varianz ausgeben lassen
print(f"Mean of squared residuals: {model.score(X, y)}")
print(f"% Var explained: {model.score(X, y) * 100}")
Mean of squared residuals: 0.9251145889770064
% Var explained: 92.51145889770063

In each split of a tree, 2 variables were randomly chosen to determine the best split. About 57% of the variability in the water_level can be explained by the model.

By visualizing the model performance, we can observe how error decreases with the number of trees.

# DAUERT 2-3 Minuten!
# Liste und Schleife für die Durchführung des Models mit unterschiedlichen Anzahlen an Bäumen erstellen
n_estimators_range = list(range(1, 1000, 50))
errors = []
oob_error = []

for n_estimators in n_estimators_range:
    model = RandomForestRegressor(
        n_estimators=n_estimators, random_state=seed, oob_score=True
    )
    model.fit(X, y)
    errors.append(model.score(X, y))
    oob_error.append(1 - model.oob_score_)
#  mse = mean_squared_error(labels, y_pred)
#    mse_scores.append(mse)
/opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:615: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates.
  warn(
# Importieren von Modulen/Bibliotheken
import matplotlib.pyplot as plt

# Plotten der error-Werte in Abhängigkeit der Anzahl der Bäume des Random Forests, um optimalen n_estimators-Wert festzulegen
plt.plot(n_estimators_range, oob_error, marker="o")
plt.xlabel("trees")
plt.ylabel("Error")
plt.title("model")
plt.show()
../../../_images/f4947bd7e485f53afadeb5dd2f9cb71d33865f73e80e156f6239b8955d464a4d.png

As shown in the figure, the mean squared residuals stabilize around 500 trees. Thus, in the following step we use n_estimators=500 and optimize the tree depth.

8.4.6. Hyper-parameter Tuning#

To look at the available hyperparameters, we can examine the default values of our random forest model.

model.get_params()
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 951,
 'n_jobs': None,
 'oob_score': True,
 'random_state': 196,
 'verbose': 0,
 'warm_start': False}

That is quite a long list! How do we know where to start? A good place is the documentation on the random forest in Scikit-Learn. This tells us the most important settings are the number of trees in the forest (n_estimators) and the number of features considered for splitting at each leaf node (max_features).

  • n_estimators = number of trees in the foreset

  • max_features = max number of features considered for splitting a node

  • max_depth = max number of levels in each decision tree

  • min_samples_split = min number of data points placed in a node before the node is split

  • min_samples_leaf = min number of data points allowed in a leaf node

  • bootstrap = method for sampling data points (with or without replacement)

We could go read the research papers on the random forest and try to theorize the best hyperparameters, but a more efficient use of our time is just to try out a wide range of values and see what works!

8.4.6.1. max_features#

Now, we will tune the parameter that determines the number of variables that are randomly sampled as candidates at each split (max_features). The numbers of trees and variables are crucial for the model performance. We apply a range of max_features from 1 to 5 while n_estimators is fixed at 500. This grid represents different combinations of hyper-parameters to be tested.

# Importieren von Modulen/Bibliotheken
import numpy as np

# Initialwert und Anzahl der Bäume für den Random Forest festlegen und Zufallszahlengenerator laufen lassen
seed = 196
n_estimators = 500
np.random.seed(seed)

# Hyperparameteroptimierung für das Modell durchführen und Identifikation der besten Modelle (bzgl. OOB-RMSE)
max_features_range = list(range(1, 5, 1))
hyper_grid = {
    "max_features": max_features_range,
    "n_estimators": [n_estimators] * len(max_features_range),
    "OOB_RMSE": [0] * len(max_features_range),
}

for i, params in enumerate(max_features_range):
    model = RandomForestRegressor(
        n_estimators=n_estimators,
        max_features=params,
        oob_score=True,
        random_state=seed,
    )
    model.fit(X, y)
    hyper_grid["OOB_RMSE"][i] = model.oob_score_

hyper_grid_df = pd.DataFrame(hyper_grid)
hyper_grid_df = hyper_grid_df.sort_values(by="OOB_RMSE")  # .head(10)
hyper_grid_df
max_features n_estimators OOB_RMSE
0 1 500 0.432123
1 2 500 0.450079
3 4 500 0.455360
2 3 500 0.455568

For each parameter combination, the Out-of-Bag (OOB) error is calculated. The best model with 500 trees has an max_features value of 1 (OOB_RMSE: 0.432123).

8.4.7. Final Model#

Now, the final RF model with optimized parameters can be created. The varimp() function is used to calculate the importance of each variable in the model. This tells us which variables are most predictive of O18.

# Importieren von Modulen/Bibliotheken
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Initialwert und Anzahl der Bäume für den Random Forest festlegen und Zufallszahlengenerator laufen lassen
seed = 196
n_estimators = 500
np.random.seed(seed)
# Optimalen max_features-Wert finden (Anzahl der zu betrachtenden, zufällig ausgewählten Features für jeden Baum)
min_idx = np.argmin(hyper_grid["OOB_RMSE"])
max_features = hyper_grid["max_features"][min_idx]
# Finales RandomForest erstellen und trainieren
model = RandomForestRegressor(
    n_estimators=n_estimators, max_features=max_features, random_state=seed
)
model.fit(X, y)
RandomForestRegressor(max_features=1, n_estimators=500, random_state=196)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Statistiken über das Modell ausgeben lassen und die 10 wichtigsten Variablen extrahieren
stats = model.get_params()
feature_importance = model.feature_importances_
features = X.columns
var_imp_df = pd.DataFrame({"Feature": features, "Importance": feature_importance})
var_imp_df = var_imp_df.sort_values(by="Importance", ascending=False).head(54)
var_imp_10 = var_imp_df.head(10)
# RMSE (Root Mean Square Error) und R² (R-squared) für Bewertung des Modells extrahieren
y_pred = model.predict(X)
rmse = np.sqrt(mean_squared_error(y, y_pred))
r2 = r2_score(y, y_pred)
# Plotten der wichtigsten Variablen
plt.figure(figsize=(8, 3))
sns.barplot(x="Importance", y="Feature", data=var_imp_10, color="gray")
plt.title("δ¹⁸O")
plt.xlabel("Importance score")
plt.ylabel("Feature")
plt.text(
    0,
    -1.5,
    f"n_estimators = {n_estimators}, max_features = {max_features}\nRMSE = {rmse:.2f}, R² = {r2:.2f}",
    fontsize=10,
    ha="left",
    va="center",
)
plt.show()
../../../_images/34d2dd45844bd21bf45af60c20d03fc3c13ee37fb142bd87e7f056b40d9bc5f1.png

For predicting the water level, the precipitation over the past 10 days at sampling site seems to be the most important variable, among the precipitation over the past 5 days, and the general location temperature.


8.4.7.1. Ressources for this script:#

from IPython.display import IFrame

IFrame(
    src="../../citations/citation_marie.html",
    width=900,
    height=200,
)