Random Forest Regression

8.4. Random Forest Regression#

8.4.1. Random Forest Tutorial Aims:#

Random Forest classification with scikit-learn
How random forests work
How to use them for regression
How to evaluate their performance (& tune Hyper-parameters)

8.4.2. Water level prediction in Baden-Württemberg#

A typical machine learning approach consists of several steps that can be arranged in a kind of standard workflow. Training of the model, which we have done in the previous examples of different algorithms, is only a small part of this workflow. Data preprocessing and model evaluation and selection are by far the more extensive tasks. In the following we will walk through the entire workflow using a simple but more realistic example than using toy data. We will build a Random Forest Regression as an Example for a typical ML workflow.

Our goal is to use climate data (temperature and precipitation) and predict the daily mean water level at a water level monitoring station in Baden-Württemberg.

8.4.3. Data Collection#

For this example, we will use a daily time series from the water level (Pegel) stations in Baden-Württemberg (BW), downloadable from UDO. The data provided here was downloaded on 2024-07-22. A detailed description of the variables is available here. For the purpose of this tutorial the dataset is provided in here.

../../../_images/ausgew%C3%A4hlte_stations.png

As for the respective climate data we use the HYRAS Dataset (Hydrometeorologische Rasterdatensätze) provided by the Deutsche Wetterdienst DWD. The respective data is open access and can be easily downloaded here. The respective climate data at the water level monitoring station is already appended in the provided csv file.

8.4.4. Read Data Set#

%load_ext lab_black

# First, let's import all the needed libraries.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import os

os.getcwd()
os.chdir("..")
os.getcwd()

'/Users/marie-christineckert/Nextcloud/TU/ML_jupyter_reader/contents/ML_algorithms'

data_raw = pd.read_csv(
    "data/Jagstzell_Jagst_dailyData_HYRAS_Pegelstand.csv",
    # sep=";",
    # na_values=["-999"],
    skipinitialspace=True,
)

data_raw = data_raw.drop(columns=["Unnamed: 0"])
data_raw.head(10)

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 data_raw = pd.read_csv(
   "data/Jagstzell_Jagst_dailyData_HYRAS_Pegelstand.csv",
   # sep=";",
   # na_values=["-999"],
   skipinitialspace=True,
)
data_raw = data_raw.drop(columns=["Unnamed: 0"])
data_raw.head(10)

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)
   dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File /opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'data/Jagstzell_Jagst_dailyData_HYRAS_Pegelstand.csv'

The data set contains 18628 samples and the following 3 variables: pr, tas and Wert.

data_raw.dtypes

Date        object
pr         float64
tas        float64
Wert       float64
Einheit     object
Produkt     object
dtype: object

8.4.4.1. Remove unnecessary columns#

By looking at the data and consulting the variable description, we can immediately find variables that are not useful for our purpose and drop them.

Einheit
Produkt

data = data_raw.drop(columns=["Einheit", "Produkt"])
data

	Date	pr	tas	Wert
0	1970-01-01 12:00:00	0.3	-5.6	NaN
1	1970-01-02 12:00:00	0.6	-3.7	NaN
2	1970-01-03 12:00:00	3.6	-0.5	NaN
3	1970-01-04 12:00:00	11.2	-0.4	NaN
4	1970-01-05 12:00:00	1.5	-2.5	NaN
...	...	...	...	...
18623	2020-12-27 12:00:00	23.6	1.7	36.0
18624	2020-12-28 12:00:00	6.5	2.5	32.0
18625	2020-12-29 12:00:00	1.9	1.7	32.0
18626	2020-12-30 12:00:00	0.2	2.4	28.0
18627	2020-12-31 12:00:00	2.5	1.1	28.0

18628 rows × 4 columns

8.4.4.2. Rename columns for convenience#

To improve the readability, we rename the remaining columns.

data = data.rename(
    columns={
        "pr": "prec_sum",
        "tas": "temp_mean",
        "Wert": "water_level",
    }
)
data

	Date	prec_sum	temp_mean	water_level
0	1970-01-01 12:00:00	0.3	-5.6	NaN
1	1970-01-02 12:00:00	0.6	-3.7	NaN
2	1970-01-03 12:00:00	3.6	-0.5	NaN
3	1970-01-04 12:00:00	11.2	-0.4	NaN
4	1970-01-05 12:00:00	1.5	-2.5	NaN
...	...	...	...	...
18623	2020-12-27 12:00:00	23.6	1.7	36.0
18624	2020-12-28 12:00:00	6.5	2.5	32.0
18625	2020-12-29 12:00:00	1.9	1.7	32.0
18626	2020-12-30 12:00:00	0.2	2.4	28.0
18627	2020-12-31 12:00:00	2.5	1.1	28.0

18628 rows × 4 columns

8.4.4.3. Drop missing data#

As a next step, we check the data for missing values.

data.isnull().sum()

Date              0
prec_sum          0
temp_mean         0
water_level    4292
dtype: int64

As we can see, the percentage of missing values in our data is quite high (4292 out of 18628 rows). Since we only have missings in the water_levelcolumn, we might conclude that our water level time series is significantly shorter than the climate data time series. Therefore, the simplest solution is to simply drop all rows with missing data.

If our modelling goal was different, other solutions might be more useful, such as imputing values for the missing values ( we can even use a Random Forest for that!)

## check how many rows are empty from the beginning
first_row_not_na = data.water_level.notna().idxmax()
data = data.iloc[first_row_not_na : len(data)]

first_not_na = data.water_level.notna().idxmax()
data = data.iloc[first_not_na : len(data)]

8.4.4.4. Convert Index to `DateTimeindex`#

import datetime

data = data.set_index(pd.DatetimeIndex(data["Date"]))
del data["Date"]

8.4.4.5. Append information about Month#

We know that the infiltration behaviour of soils changes with the season. That’s why we want to keep this information as a proxy for season.

data["month"] = data.index.month

8.4.4.6. Append Rolling Mean#

The discharge behaviour of a river is of course not only dependent on the precipitation and temperature on the respective day. The preceding days should also have an influence. To cover this, we add a moving average column over 5 and 10 days for precipitation and temperature.

data["prec_sum_rolling_5"] = data["prec_sum"].rolling(5).sum()
data["prec_sum_rolling_10"] = data["prec_sum"].rolling(10).sum()


data["temp_mean_rolling_5"] = data["temp_mean"].rolling(5).mean()
data["temp_mean_rolling_10"] = data["temp_mean"].rolling(10).mean()

8.4.4.7. Train-Test split#

from sklearn.model_selection import train_test_split

# split features and target first
X, y = data.drop(columns=["water_level"]), data["water_level"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(len(X_train), len(X_test), len(y_train), len(y_test))

8035 2009 8035 2009

Note: The records are randomly shuffled and split to avoid any effects from the original order of the records.

8.4.5. Initial Model Set Up#

All the necessary pre-processing is now done and we can finally get to the actual model. The Random Forest (RF) algorithm is applied to assess the relative importance of meteorological variables on the water level. Our goal is then to predict the mean water level (cm).

We start with a performance test on our data for the RF algorithm by creating an initial RF model using the RandomForestRegressor() function. It predicts water level based on all other variables with 2000 trees.

# Import modules
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Initialwert und Anzahl der Bäume für den Random Forest festlegen
seed = 196
n_estimators = 1000

# RandomForest erstellen, X und y definieren und Modell trainieren
model = RandomForestRegressor(
    n_estimators=n_estimators,
    random_state=seed,
    max_features=1.0,
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=None,
)
model.fit(X, y)

RandomForestRegressor(n_estimators=1000, random_state=196)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Mittelwert der quadrierten Residuen und erklärte Varianz ausgeben lassen
print(f"Mean of squared residuals: {model.score(X, y)}")
print(f"% Var explained: {model.score(X, y) * 100}")

Mean of squared residuals: 0.9251145889770064
% Var explained: 92.51145889770063

In each split of a tree, 2 variables were randomly chosen to determine the best split. About 57% of the variability in the water_level can be explained by the model.

By visualizing the model performance, we can observe how error decreases with the number of trees.

# DAUERT 2-3 Minuten!
# Liste und Schleife für die Durchführung des Models mit unterschiedlichen Anzahlen an Bäumen erstellen
n_estimators_range = list(range(1, 1000, 50))
errors = []
oob_error = []

for n_estimators in n_estimators_range:
    model = RandomForestRegressor(
        n_estimators=n_estimators, random_state=seed, oob_score=True
    )
    model.fit(X, y)
    errors.append(model.score(X, y))
    oob_error.append(1 - model.oob_score_)
#  mse = mean_squared_error(labels, y_pred)
#    mse_scores.append(mse)

/opt/anaconda3/envs/jupyter_book_ml/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:615: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates.
  warn(

# Importieren von Modulen/Bibliotheken
import matplotlib.pyplot as plt

# Plotten der error-Werte in Abhängigkeit der Anzahl der Bäume des Random Forests, um optimalen n_estimators-Wert festzulegen
plt.plot(n_estimators_range, oob_error, marker="o")
plt.xlabel("trees")
plt.ylabel("Error")
plt.title("model")
plt.show()

../../../_images/f4947bd7e485f53afadeb5dd2f9cb71d33865f73e80e156f6239b8955d464a4d.png

As shown in the figure, the mean squared residuals stabilize around 500 trees. Thus, in the following step we use n_estimators=500 and optimize the tree depth.

8.4.6. Hyper-parameter Tuning#

To look at the available hyperparameters, we can examine the default values of our random forest model.

model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 951,
 'n_jobs': None,
 'oob_score': True,
 'random_state': 196,
 'verbose': 0,
 'warm_start': False}

That is quite a long list! How do we know where to start? A good place is the documentation on the random forest in Scikit-Learn. This tells us the most important settings are the number of trees in the forest (n_estimators) and the number of features considered for splitting at each leaf node (max_features).

n_estimators = number of trees in the foreset
max_features = max number of features considered for splitting a node
max_depth = max number of levels in each decision tree
min_samples_split = min number of data points placed in a node before the node is split
min_samples_leaf = min number of data points allowed in a leaf node
bootstrap = method for sampling data points (with or without replacement)

We could go read the research papers on the random forest and try to theorize the best hyperparameters, but a more efficient use of our time is just to try out a wide range of values and see what works!

8.4.6.1. max_features#

Now, we will tune the parameter that determines the number of variables that are randomly sampled as candidates at each split (max_features). The numbers of trees and variables are crucial for the model performance. We apply a range of max_features from 1 to 5 while n_estimators is fixed at 500. This grid represents different combinations of hyper-parameters to be tested.

# Importieren von Modulen/Bibliotheken
import numpy as np

# Initialwert und Anzahl der Bäume für den Random Forest festlegen und Zufallszahlengenerator laufen lassen
seed = 196
n_estimators = 500
np.random.seed(seed)

# Hyperparameteroptimierung für das Modell durchführen und Identifikation der besten Modelle (bzgl. OOB-RMSE)
max_features_range = list(range(1, 5, 1))
hyper_grid = {
    "max_features": max_features_range,
    "n_estimators": [n_estimators] * len(max_features_range),
    "OOB_RMSE": [0] * len(max_features_range),
}

for i, params in enumerate(max_features_range):
    model = RandomForestRegressor(
        n_estimators=n_estimators,
        max_features=params,
        oob_score=True,
        random_state=seed,
    )
    model.fit(X, y)
    hyper_grid["OOB_RMSE"][i] = model.oob_score_

hyper_grid_df = pd.DataFrame(hyper_grid)
hyper_grid_df = hyper_grid_df.sort_values(by="OOB_RMSE")  # .head(10)

hyper_grid_df

	max_features	n_estimators	OOB_RMSE
0	1	500	0.432123
1	2	500	0.450079
3	4	500	0.455360
2	3	500	0.455568

For each parameter combination, the Out-of-Bag (OOB) error is calculated. The best model with 500 trees has an max_features value of 1 (OOB_RMSE: 0.432123).

8.4.7. Final Model#

Now, the final RF model with optimized parameters can be created. The varimp() function is used to calculate the importance of each variable in the model. This tells us which variables are most predictive of O18.

# Importieren von Modulen/Bibliotheken
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Initialwert und Anzahl der Bäume für den Random Forest festlegen und Zufallszahlengenerator laufen lassen
seed = 196
n_estimators = 500
np.random.seed(seed)

# Optimalen max_features-Wert finden (Anzahl der zu betrachtenden, zufällig ausgewählten Features für jeden Baum)
min_idx = np.argmin(hyper_grid["OOB_RMSE"])
max_features = hyper_grid["max_features"][min_idx]

# Finales RandomForest erstellen und trainieren
model = RandomForestRegressor(
    n_estimators=n_estimators, max_features=max_features, random_state=seed
)
model.fit(X, y)

RandomForestRegressor(max_features=1, n_estimators=500, random_state=196)