Time Series Preprocessing

6.1. Time Series Preprocessing#

We begin this chapter with loading the meteorological data set from the DWD.

# First, let's import all the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


# Read the data set:
df = pd.read_csv("data/Tagesmittel_Dahlem.csv", sep=";")

# Name the columns:
df.columns = [
    "STAT",
    "Date",
    "QN",
    "TG",
    "TN",
    "TM",
    "TX",
    "RFM",
    "FM",
    "FX",
    "SO",
    "NM",
    "RR",
    "PM",
]

# Drop columns we want to ignore:
df = df.drop(columns=["STAT", "QN", "PM"])

df

	Date	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR
0	20200325	-7.3	-2.6	3.9	11.0	41.0	3.0	11.5	11.8	0.5	0.0
1	20200324	-8.3	-4.8	2.1	8.7	46.5	3.0	10.3	11.9	0.7	0.0
2	20200323	-8.2	-4.4	0.8	6.7	52.5	2.0	8.1	11.8	0.2	0.0
3	20200322	-7.1	-2.5	1.8	6.5	55.3	3.0	11.0	11.8	0.4	0.0
4	20200321	-4.5	0.2	4.1	7.4	64.8	4.0	13.6	8.0	2.8	0.0
...	...	...	...	...	...	...	...	...	...	...	...
494	20181117	-5.5	-2.5	1.5	5.5	84.7	3.0	9.4	7.8	1.3	0.0
495	20181116	-1.6	1.6	5.0	9.3	79.9	3.0	11.5	8.2	2.3	0.0
496	20181115	1.1	4.2	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0
497	20181114	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0
498	20181113	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9

499 rows × 11 columns

6.1.1. Data information: Descriptions of the German Weather Service (Deutscher Wetterdienst; DWD):#

Abbreviation	Parameter description
QN	Quality level of the data
TG	Min. temp. at 5 cm above the ground; °C TIII of the previous day to TI 23:51 previous day - 23:50 UTC
TN	Min. temp. at 2 m above the ground; °C TIII of the previous day to TIII 23:51 previous day - 23:50 UTC
TM	Mean temp. at 2 m above the ground; °C (TI+TII+2*TIII) / 4 Mean of 24 date values
TX	Max. temp. at 2 m above the ground; °C TIII of the previous day to TIII 23:51 previous day - 23:50 UTC
RFM	Mean of relative humidity % (TI+TII+TIII)/3 Mean of 24 date values
FM	Mean of wind force Bft (TI+TII+TIII)/3 Mean of 24 date values
FX	Maximum wind speed (peak gust) m/sec 00 - 23:59 UTC 23:51 previous day - 23:50 UTC
SO	Sum of sunshine hours 00 - 23:59 UTC 23:51 previous day - 23:50 UTC
NM	Mean of the eighth degree of cloud cover (TI+TII+TIII)/3 Mean of 24 date values
RR	Precipitation height mm TI to TI of the following day 05:51 - 05:50 UTC of the following day
PM	Mean air pressure at station altitude hpa (TI+TII+TIII)/3 Mean of 24 date values

In general, the following applies to the calculation of daily averages:

As of 01.04.2001, the standard was changed as follows:

Calculation of daily averages from 24 hourly values. If more than 3 hourly values are missing -> calculation from the 4 main dates (00, 06, 12, 18 UTC). Reference time for a day usually 23:51 UTC of the previous day to 23:50 UTC only the precipitation of the previous day is measured at 05:50 UTC in the morning

The observation dates are based on the global time in Greenwich (GMT or UTC). The observation time is always 10 minutes before the reference date (hence the crooked times). This change was necessary after the station network was largely automated.

6.1.1.1. Sort the data set by date: 2018…2020#

df = df.sort_values(by=["Date"], ascending=True)
df

	Date	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR
498	20181113	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9
497	20181114	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0
496	20181115	1.1	4.2	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0
495	20181116	-1.6	1.6	5.0	9.3	79.9	3.0	11.5	8.2	2.3	0.0
494	20181117	-5.5	-2.5	1.5	5.5	84.7	3.0	9.4	7.8	1.3	0.0
...	...	...	...	...	...	...	...	...	...	...	...
4	20200321	-4.5	0.2	4.1	7.4	64.8	4.0	13.6	8.0	2.8	0.0
3	20200322	-7.1	-2.5	1.8	6.5	55.3	3.0	11.0	11.8	0.4	0.0
2	20200323	-8.2	-4.4	0.8	6.7	52.5	2.0	8.1	11.8	0.2	0.0
1	20200324	-8.3	-4.8	2.1	8.7	46.5	3.0	10.3	11.9	0.7	0.0
0	20200325	-7.3	-2.6	3.9	11.0	41.0	3.0	11.5	11.8	0.5	0.0

499 rows × 11 columns

Exercise:

What data type have the values in the Date column?

### Your code here ####

6.1.1.2. solution#

df["Date"]

  20181113
  20181114
  20181115
  20181116
  20181117
         ...   
    20200321
    20200322
    20200323
    20200324
    20200325
Name: Date, Length: 499, dtype: int64

We found that we are dealing with int 64 (Integers) here. But we would like to transorm these numbers to dates!

6.1.2. Python’s Datetime#

Transform integers to dates:

from datetime import datetime

df["Datum"] = pd.to_datetime(df["Date"], format="%Y%m%d")
df.head()

	Date	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR	Datum
498	20181113	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9	2018-11-13
497	20181114	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0	2018-11-14
496	20181115	1.1	4.2	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	2018-11-15
495	20181116	-1.6	1.6	5.0	9.3	79.9	3.0	11.5	8.2	2.3	0.0	2018-11-16
494	20181117	-5.5	-2.5	1.5	5.5	84.7	3.0	9.4	7.8	1.3	0.0	2018-11-17

df["Datum"]

 2018-11-13
 2018-11-14
 2018-11-15
 2018-11-16
 2018-11-17
         ...    
   2020-03-21
   2020-03-22
   2020-03-23
   2020-03-24
   2020-03-25
Name: Datum, Length: 499, dtype: datetime64[ns]

Well done! The type is now datetime! Now we would like the dates to be our indices instead of the numbers.

6.1.2.1. Set the date as Index:#

First, delete the ‘Data’ - column. Second, set the ‘Datum’ columns as index:

# set the 'Datum' columns as index:
df = df.set_index("Datum")

# Drop columns we want to ignore:
df = df.drop(columns=["Date"])

df

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR
Datum
2018-11-13	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9
2018-11-14	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0
2018-11-15	1.1	4.2	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0
2018-11-16	-1.6	1.6	5.0	9.3	79.9	3.0	11.5	8.2	2.3	0.0
2018-11-17	-5.5	-2.5	1.5	5.5	84.7	3.0	9.4	7.8	1.3	0.0
...	...	...	...	...	...	...	...	...	...	...
2020-03-21	-4.5	0.2	4.1	7.4	64.8	4.0	13.6	8.0	2.8	0.0
2020-03-22	-7.1	-2.5	1.8	6.5	55.3	3.0	11.0	11.8	0.4	0.0
2020-03-23	-8.2	-4.4	0.8	6.7	52.5	2.0	8.1	11.8	0.2	0.0
2020-03-24	-8.3	-4.8	2.1	8.7	46.5	3.0	10.3	11.9	0.7	0.0
2020-03-25	-7.3	-2.6	3.9	11.0	41.0	3.0	11.5	11.8	0.5	0.0

499 rows × 10 columns

6.1.3. Benefits of the DatetimeIndex#

6.1.3.1. Subsetting and slicing#

A very common task in time series analysis is the subsetting of a time series.

Mybe you are interested in the oldest or newest observations, or you would like to extract a certain range of data:

Since we have set the data as index, we can easily look at the data of a certain time:

df.loc["2018-11-13"]

TG      4.4
TN      6.8
TM      9.7
TX     12.3
RFM    97.4
FM      2.0
FX     11.2
SO      1.1
NM      6.0
RR      8.9
Name: 2018-11-13 00:00:00, dtype: float64

Or we can have a look at the data of a certain time interval:

How was the weather last June?

df_June = df["2019-06-1":"2019-06-30"]
df_June

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR
Datum
2019-06-01	10.3	14.0	20.3	27.1	68.0	2.0	10.9	12.7	4.0	0.0
2019-06-02	10.1	14.6	22.7	31.3	60.3	2.0	7.4	13.4	3.1	0.0
2019-06-03	11.4	15.6	24.8	33.2	52.8	3.0	17.1	11.9	2.9	0.0
2019-06-04	13.3	17.2	23.6	30.6	61.0	2.0	8.8	11.6	2.9	0.0
2019-06-05	12.2	16.0	25.7	33.6	56.1	3.0	10.7	14.2	1.3	0.0
2019-06-06	14.1	14.5	21.0	31.7	78.3	3.0	13.4	8.4	4.8	5.4
2019-06-07	10.1	12.8	19.4	25.7	68.9	2.0	9.8	14.1	2.5	0.1
2019-06-08	5.9	10.4	17.9	22.6	59.3	3.0	16.8	8.6	3.8	0.0
2019-06-09	4.8	8.7	18.8	25.9	53.9	2.0	9.5	12.5	4.7	3.4
2019-06-10	14.0	14.5	21.1	27.2	72.0	2.0	10.2	5.8	6.8	1.8
2019-06-11	14.6	16.8	23.0	29.1	74.8	2.0	16.0	11.3	6.3	48.0
2019-06-12	15.5	16.6	22.0	33.0	79.0	3.0	24.0	9.7	5.9	13.6
2019-06-13	12.5	14.7	18.9	24.6	72.3	2.0	7.5	8.9	3.6	0.0
2019-06-14	10.1	12.8	22.8	29.5	56.3	3.0	11.0	12.1	4.1	0.0
2019-06-15	15.7	17.4	24.9	31.5	62.9	4.0	16.0	10.8	5.4	0.0
2019-06-16	11.3	14.0	17.9	21.5	73.0	2.0	8.4	5.8	6.8	0.0
2019-06-17	12.0	14.4	20.4	27.1	66.1	2.0	7.2	10.3	3.2	0.0
2019-06-18	11.1	14.6	22.8	29.4	57.1	3.0	10.6	15.3	2.6	0.0
2019-06-19	12.6	16.6	24.3	32.1	60.9	2.0	11.6	12.8	5.3	0.9
2019-06-20	14.2	17.1	21.0	25.7	78.2	2.0	14.5	7.0	6.0	1.9
2019-06-21	10.7	14.9	19.5	24.7	70.8	2.0	9.2	10.0	4.8	0.0
2019-06-22	9.9	14.2	20.1	25.7	63.4	3.0	11.8	12.4	2.8	0.0
2019-06-23	10.0	14.0	21.2	27.8	53.6	3.0	11.7	14.6	1.4	0.0
2019-06-24	10.1	14.8	22.8	29.7	47.8	3.0	9.8	15.9	3.7	0.0
2019-06-25	12.5	17.5	25.5	33.5	51.2	3.0	11.3	15.4	4.5	0.0
2019-06-26	15.1	18.7	27.9	36.1	53.7	3.0	14.1	15.1	2.0	0.0
2019-06-27	12.8	14.6	19.3	25.3	57.8	3.0	11.7	14.4	1.7	0.0
2019-06-28	8.4	11.7	17.9	23.9	65.0	2.0	8.8	10.6	3.3	0.0
2019-06-29	8.4	12.2	21.7	30.3	57.0	2.0	7.3	15.7	0.6	0.0
2019-06-30	10.1	14.6	27.6	37.6	47.6	3.0	12.4	15.0	1.3	0.0

Another nice feature using Python and Python’s datetime is to generate a new time series. We can define a starting date start="6/1/2019and a end date end="6/30/2020of the series.

Moreover, we can define a frequency, for example daily, ‘D’, hourly (‘H’), calendar daily (‘D’), business daily (‘B’), weekly (‘W’), monthly (‘M’), quarterly (‘Q’), annual (‘A’), and many others. Frequencies can also be specified as multiples of any of the base frequencies, for example ‘5D’ for every five days. There are many more…

For example:

New_series = pd.date_range(start="6/1/2019", end="6/1/2020", freq="B")
New_series

DatetimeIndex(['2019-06-03', '2019-06-04', '2019-06-05', '2019-06-06',
               '2019-06-07', '2019-06-10', '2019-06-11', '2019-06-12',
               '2019-06-13', '2019-06-14',
               ...
               '2020-05-19', '2020-05-20', '2020-05-21', '2020-05-22',
               '2020-05-25', '2020-05-26', '2020-05-27', '2020-05-28',
               '2020-05-29', '2020-06-01'],
              dtype='datetime64[ns]', length=261, freq='B')

or:

daily_index = pd.date_range(start="4/1/2018", end="4/30/2018", freq="D")
daily_index

DatetimeIndex(['2018-04-01', '2018-04-02', '2018-04-03', '2018-04-04',
               '2018-04-05', '2018-04-06', '2018-04-07', '2018-04-08',
               '2018-04-09', '2018-04-10', '2018-04-11', '2018-04-12',
               '2018-04-13', '2018-04-14', '2018-04-15', '2018-04-16',
               '2018-04-17', '2018-04-18', '2018-04-19', '2018-04-20',
               '2018-04-21', '2018-04-22', '2018-04-23', '2018-04-24',
               '2018-04-25', '2018-04-26', '2018-04-27', '2018-04-28',
               '2018-04-29', '2018-04-30'],
              dtype='datetime64[ns]', freq='D')

6.1.3.2. Easy plotting#

We take again a look at the DWD-data to see one major benefit of the DatetimeIndex.

Plot the whole data set

import matplotlib.pyplot as plt

df.plot()

<Axes: xlabel='Datum'>

../../_images/ea363465b6f1e17c7b31b374048ad98bf14062ad303e700c873f9a4b1bf36147.png

Or a variable that you yould like to analyize, for exmaple the mean temperature:

df["TM"].plot()

<Axes: xlabel='Datum'>

../../_images/2f8a3118393b97cd3f5ecc520e6e86a981f5ec942b32bfedc7256f14ebb0566d.png

Exercise:

Was the sun shining summer 2019? Plot the Sonnenscheindauer ‘SO’ (or another variable) from the DWD data set for the time slice Mai, Juni, July and August 2019

6.1.4. Summary statistics#

Recall the basic functions for statistics using pandas:

Quantiles: DataFrame.quantile(self, q=0.5, axis=0, numeric_only=True, interpolation='linear')
Median: `DataFrame.median(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Mean: DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Variance: DataFrame.var(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Standard deviation: DataFrame.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

To get a quick overview on the statistical characteristics of time series we can also use the describe() function. The function returns basic statistics for the whole data set.

df.describe()

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR
count	499.000000	499.000000	499.000000	499.000000	499.000000	499.000000	498.000000	499.000000	499.000000	499.000000
mean	2.829259	5.308016	9.466934	13.663327	77.739880	2.803607	11.627309	4.785571	5.353507	1.424048
std	5.781158	5.769047	6.934181	8.509267	13.515453	0.689499	3.813549	4.560062	2.019166	3.571932
min	-8.400000	-6.800000	-5.500000	-3.500000	41.000000	1.000000	3.500000	0.000000	0.000000	0.000000
25%	-1.600000	0.950000	4.350000	7.000000	67.600000	2.000000	8.925000	0.300000	4.100000	0.000000
50%	1.700000	4.100000	7.500000	11.400000	78.800000	3.000000	11.000000	3.500000	5.800000	0.000000
75%	6.700000	9.400000	14.400000	20.150000	88.450000	3.000000	13.600000	8.350000	7.000000	1.200000
max	19.200000	20.300000	27.900000	37.600000	100.000000	5.000000	24.300000	15.900000	8.000000	48.000000

df.describe().round(1)

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR
count	499.0	499.0	499.0	499.0	499.0	499.0	498.0	499.0	499.0	499.0
mean	2.8	5.3	9.5	13.7	77.7	2.8	11.6	4.8	5.4	1.4
std	5.8	5.8	6.9	8.5	13.5	0.7	3.8	4.6	2.0	3.6
min	-8.4	-6.8	-5.5	-3.5	41.0	1.0	3.5	0.0	0.0	0.0
25%	-1.6	1.0	4.4	7.0	67.6	2.0	8.9	0.3	4.1	0.0
50%	1.7	4.1	7.5	11.4	78.8	3.0	11.0	3.5	5.8	0.0
75%	6.7	9.4	14.4	20.2	88.4	3.0	13.6	8.4	7.0	1.2
max	19.2	20.3	27.9	37.6	100.0	5.0	24.3	15.9	8.0	48.0

The describe function can be applied to time slices, too!

df["2019-01-01":"2020-01-1"].describe().round(1)

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR
count	366.0	366.0	366.0	366.0	366.0	366.0	366.0	366.0	366.0	366.0
mean	3.9	6.5	11.2	15.8	76.0	2.7	11.4	5.6	5.1	1.4
std	6.1	6.1	7.2	8.8	13.6	0.7	3.6	4.7	2.0	3.8
min	-8.4	-6.8	-5.5	-3.5	42.3	1.0	4.4	0.0	0.0	0.0
25%	-1.0	1.7	5.5	8.4	65.8	2.0	8.9	1.2	3.9	0.0
50%	3.2	6.0	10.3	15.6	76.0	3.0	10.8	4.9	5.4	0.0
75%	8.9	11.4	17.0	22.7	87.2	3.0	13.1	9.7	6.7	1.1
max	19.2	20.3	27.9	37.6	100.0	5.0	24.3	15.9	8.0	48.0

.. you can ask only for the mean (standard deviation and so on…) seperately. What are the mean values in January 2020?

6.1.5. Generating DatetimeIndex with periods and frequency arguments#

If we need to create a index with the dates of our data, we can do it with Python as follows:

pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs

Return a fixed frequency DatetimeIndex

Parameters:

start: str or datetime-like, optional, Left bound for generating dates.
end: str or datetime-like, optional, Right bound for generating dates.
periods: int, optional, Number of periods to generate.
freq: str or DateOffset, default ‘D’
Frequency strings can have multiples, e.g. ‘5H’. See here for a list of frequency aliases.
tz: str or tzinfo, optional, Time zone name for returning localized DatetimeIndex, for example ‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is timezone-naive.
normalize: bool, default False
Normalize start/end dates to midnight before generating date range.
namestr, default None, Name of the resulting DatetimeIndex.
closed{None, ‘left’, ‘right’}, optional, Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None, the default).

The default input for the frequency is ‘D’ which stands for daily. But there are many more specifications for the frequency:

B business day frequency
C custom business day frequency
D calendar day frequency
W weekly frequency
M month end frequency
Q quarter end frequency
H hourly frequency
T, min minutely frequency
S secondly frequency

There are more… see: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases

Let us create a data set starting from the first of January 2011 with 72 entries on a daily basis:

rng = pd.date_range("1/1/2019", periods=72, freq="D")

rng

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08',
               '2019-01-09', '2019-01-10', '2019-01-11', '2019-01-12',
               '2019-01-13', '2019-01-14', '2019-01-15', '2019-01-16',
               '2019-01-17', '2019-01-18', '2019-01-19', '2019-01-20',
               '2019-01-21', '2019-01-22', '2019-01-23', '2019-01-24',
               '2019-01-25', '2019-01-26', '2019-01-27', '2019-01-28',
               '2019-01-29', '2019-01-30', '2019-01-31', '2019-02-01',
               '2019-02-02', '2019-02-03', '2019-02-04', '2019-02-05',
               '2019-02-06', '2019-02-07', '2019-02-08', '2019-02-09',
               '2019-02-10', '2019-02-11', '2019-02-12', '2019-02-13',
               '2019-02-14', '2019-02-15', '2019-02-16', '2019-02-17',
               '2019-02-18', '2019-02-19', '2019-02-20', '2019-02-21',
               '2019-02-22', '2019-02-23', '2019-02-24', '2019-02-25',
               '2019-02-26', '2019-02-27', '2019-02-28', '2019-03-01',
               '2019-03-02', '2019-03-03', '2019-03-04', '2019-03-05',
               '2019-03-06', '2019-03-07', '2019-03-08', '2019-03-09',
               '2019-03-10', '2019-03-11', '2019-03-12', '2019-03-13'],
              dtype='datetime64[ns]', freq='D')

We have created an index and now fill randomly numbers between 0 and 9.

import numpy as np

ts = pd.Series(np.random.randint(0, 10, len(rng)), index=rng)
ts

2019-01-01    9
2019-01-02    3
2019-01-03    2
2019-01-04    2
2019-01-05    8
             ..
2019-03-09    9
2019-03-10    4
2019-03-11    6
2019-03-12    8
2019-03-13    6
Freq: D, Length: 72, dtype: int64

As another example, let’s create a date range at hourly frequency, specifying the start date and number of periods, instead of the start date and end date.

pd.date_range("2004-09-20", periods=8, freq="H")

/var/folders/vs/k3tcf9r11xq5pqzsg6kq3sx00000gn/T/ipykernel_9784/360346212.py:1: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
  pd.date_range("2004-09-20", periods=8, freq="H")

DatetimeIndex(['2004-09-20 00:00:00', '2004-09-20 01:00:00',
               '2004-09-20 02:00:00', '2004-09-20 03:00:00',
               '2004-09-20 04:00:00', '2004-09-20 05:00:00',
               '2004-09-20 06:00:00', '2004-09-20 07:00:00'],
              dtype='datetime64[ns]', freq='h')

# To select an arbitrary sequence of date/time values from a pandas time series,
# we need to use a DatetimeIndex, rather than simply a list of date/time strings

times_sample = pd.to_datetime(["2018-11-13", "2018-11-14", "2018-11-15"])

# Select the specified dates and just the Consumption column
TM_sample = df.loc[times_sample, ["TM", "RR"]].copy()
TM_sample

	TM	RR
2018-11-13	9.7	8.9
2018-11-14	7.2	0.0
2018-11-15	6.3	0.0

df

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR
Datum
2018-11-13	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9
2018-11-14	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0
2018-11-15	1.1	4.2	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0
2018-11-16	-1.6	1.6	5.0	9.3	79.9	3.0	11.5	8.2	2.3	0.0
2018-11-17	-5.5	-2.5	1.5	5.5	84.7	3.0	9.4	7.8	1.3	0.0
...	...	...	...	...	...	...	...	...	...	...
2020-03-21	-4.5	0.2	4.1	7.4	64.8	4.0	13.6	8.0	2.8	0.0
2020-03-22	-7.1	-2.5	1.8	6.5	55.3	3.0	11.0	11.8	0.4	0.0
2020-03-23	-8.2	-4.4	0.8	6.7	52.5	2.0	8.1	11.8	0.2	0.0
2020-03-24	-8.3	-4.8	2.1	8.7	46.5	3.0	10.3	11.9	0.7	0.0
2020-03-25	-7.3	-2.6	3.9	11.0	41.0	3.0	11.5	11.8	0.5	0.0

499 rows × 10 columns

Creating a time range with dateteime for example, for the year 2011

from datetime import datetime

start = datetime(2011, 1, 1)

end = datetime(2012, 1, 1)

index = pd.date_range(start, end)

pd.date_range(start, periods=10, freq="M")

/var/folders/vs/k3tcf9r11xq5pqzsg6kq3sx00000gn/T/ipykernel_9784/453833206.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  pd.date_range(start, periods=10, freq="M")

DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30',
               '2011-05-31', '2011-06-30', '2011-07-31', '2011-08-31',
               '2011-09-30', '2011-10-31'],
              dtype='datetime64[ns]', freq='ME')

pd.date_range(start, periods=4, freq="QS")

DatetimeIndex(['2011-01-01', '2011-04-01', '2011-07-01', '2011-10-01'], dtype='datetime64[ns]', freq='QS-JAN')

6.1.6. Pandas series vs. dataframe#

Recall: While a dataframe can be used to analyse many variables, the pandas series is a One-dimensional ndarray with axis labels (including time series). Like this example:

index = pd.DatetimeIndex(["2014-07-04", "2014-08-04", "2015-07-04", "2015-08-04"])
data = pd.Series([0, 1, 2, 3], index=index)
data

2014-07-04    0
2014-08-04    1
2015-07-04    2
2015-08-04    3
dtype: int64

6.1.7. Dealing with missing values (NaNs)?#

Missing values in data sets are a well-known problem as nearly everywhere, where data is measured and recorded, issues with missing values occur. Missing values can lead to problems, because often further data processing and analysis steps rely on complete data sets. Therefore missing values need to be replaced with reasonable values. In statistics this process is called imputation.

6.1.8. Imputation of univariate time series#

Techniques capable of doing imputation for univariate time series can be roughly divided into three categories (Moritz et al. 2015):

Univariate algorithms These algorithms work with univariate inputs, but typically do not employ the time series character of the data set. Examples are: mean, mode, median, random sample.
Univariate time series algorithms These algorithms are also able to work with univariate inputs, but make use of the time series characteristics. Examples of simple algorithms of this category are last observation carried forward (locf), next observation carried backward (nocb), arithmetic smoothing and linear interpolation. The more advanced algorithms are based on structural time series models and can handle seasonality.
Multivariate algorithms on lagged data Usually, multivariate algorithms can not be applied on univariate data. But since time is an implicit variable for time series, it is possible to add time information as covariates in order to make it possible to apply multivariate imputation algorithms.

Let us test our possibilities to handle missing data. We will continue using our loaded data set Tagesmittel_Dahlem_MD.csv

import pandas as pd

df = pd.read_csv(
    "data/Tagesmittel_Dahlem_MD.csv", sep=";"
)  # ,parse_dates=["Date"], index_col="Date")

df.columns = [
    "STAT",
    "Date",
    "QN",
    "TG",
    "TN",
    "TM",
    "TX",
    "RFM",
    "FM",
    "FX",
    "SO",
    "NM",
    "RR",
    "PM",
]

df = df.drop(columns=["STAT", "QN"])

df = df.sort_values(by=["Date"], ascending=True)

df["Datum"] = pd.to_datetime(df["Date"], format="%Y%m%d")

df = df.set_index("Datum")
df = df.drop(["Date"], axis=1)

df

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR	PM
Datum
2018-11-13	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9	1012.3
2018-11-14	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0	1019.5
2018-11-15	1.1	NaN	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	1020.6
2018-11-16	-1.6	1.6	5.0	9.3	79.9	3.0	11.5	8.2	2.3	NaN	1025.1
2018-11-18	-6.4	-3.8	1.6	5.8	87.8	2.0	9.8	1.6	5.0	0.0	1020.9
...	...	...	...	...	...	...	...	...	...	...	...
2020-03-20	4.7	5.2	7.7	9.5	73.0	2.0	11.2	0.0	7.5	0.0	1016.1
2020-03-21	-4.5	0.2	4.1	7.4	NaN	4.0	13.6	8.0	2.8	0.0	1021.0
2020-03-23	-8.2	-4.4	0.8	6.7	52.5	2.0	8.1	NaN	0.2	0.0	1029.5
2020-03-24	-8.3	-4.8	NaN	8.7	46.5	3.0	10.3	11.9	0.7	0.0	1027.2
2020-03-25	-7.3	-2.6	3.9	11.0	NaN	3.0	11.5	11.8	0.5	0.0	1022.7

492 rows × 11 columns

6.1.8.1. Count the numbers of NaN in each column#

df.isnull().sum()

TG     4
TN     8
TM     8
TX     3
RFM    5
FM     1
FX     4
SO     2
NM     1
RR     3
PM     3
dtype: int64

Further, we calculate the percentage of missing values for the data set.

df.isnull().sum() / df.shape[0]

TG     0.008130
TN     0.016260
TM     0.016260
TX     0.006098
RFM    0.010163
FM     0.002033
FX     0.008130
SO     0.004065
NM     0.002033
RR     0.006098
PM     0.006098
dtype: float64

df.shape[0]

6.1.8.2. Fill the NaNs#

Fill the NaNs with values using the fillna() function:

df["RR"] = df["RR"].fillna(0)
df["RR"]

Datum
2018-11-13    8.9
2018-11-14    0.0
2018-11-15    0.0
2018-11-16    0.0
2018-11-18    0.0
             ... 
2020-03-20    0.0
2020-03-21    0.0
2020-03-23    0.0
2020-03-24    0.0
2020-03-25    0.0
Name: RR, Length: 492, dtype: float64

df["TN"] = df["TN"].fillna(df["TN"].mean())

df

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR	PM
Datum
2018-11-13	4.4	6.800000	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9	1012.3
2018-11-14	1.2	4.700000	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0	1019.5
2018-11-15	1.1	5.382438	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	1020.6
2018-11-16	-1.6	1.600000	5.0	9.3	79.9	3.0	11.5	8.2	2.3	0.0	1025.1
2018-11-18	-6.4	-3.800000	1.6	5.8	87.8	2.0	9.8	1.6	5.0	0.0	1020.9
...	...	...	...	...	...	...	...	...	...	...	...
2020-03-20	4.7	5.200000	7.7	9.5	73.0	2.0	11.2	0.0	7.5	0.0	1016.1
2020-03-21	-4.5	0.200000	4.1	7.4	NaN	4.0	13.6	8.0	2.8	0.0	1021.0
2020-03-23	-8.2	-4.400000	0.8	6.7	52.5	2.0	8.1	NaN	0.2	0.0	1029.5
2020-03-24	-8.3	-4.800000	NaN	8.7	46.5	3.0	10.3	11.9	0.7	0.0	1027.2
2020-03-25	-7.3	-2.600000	3.9	11.0	NaN	3.0	11.5	11.8	0.5	0.0	1022.7

492 rows × 11 columns

df["TN"].mean()

5.382438016528926

df["TN"] = df["TN"].round(1)

df

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR	PM
Datum
2018-11-13	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9	1012.3
2018-11-14	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0	1019.5
2018-11-15	1.1	5.4	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	1020.6
2018-11-16	-1.6	1.6	5.0	9.3	79.9	3.0	11.5	8.2	2.3	0.0	1025.1
2018-11-18	-6.4	-3.8	1.6	5.8	87.8	2.0	9.8	1.6	5.0	0.0	1020.9
...	...	...	...	...	...	...	...	...	...	...	...
2020-03-20	4.7	5.2	7.7	9.5	73.0	2.0	11.2	0.0	7.5	0.0	1016.1
2020-03-21	-4.5	0.2	4.1	7.4	NaN	4.0	13.6	8.0	2.8	0.0	1021.0
2020-03-23	-8.2	-4.4	0.8	6.7	52.5	2.0	8.1	NaN	0.2	0.0	1029.5
2020-03-24	-8.3	-4.8	NaN	8.7	46.5	3.0	10.3	11.9	0.7	0.0	1027.2
2020-03-25	-7.3	-2.6	3.9	11.0	NaN	3.0	11.5	11.8	0.5	0.0	1022.7

492 rows × 11 columns

Instead of filling NaNs by a single value we can also fill by e.g. monthly or weekly means.

Add columns with year, month…

df["Year"] = df.index.year
df["Month"] = df.index.month


df.head(5)

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR	PM	Year	Month
Datum
2018-11-13	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9	1012.3	2018	11
2018-11-14	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0	1019.5	2018	11
2018-11-15	1.1	5.4	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	1020.6	2018	11
2018-11-16	-1.6	1.6	5.0	9.3	79.9	3.0	11.5	8.2	2.3	0.0	1025.1	2018	11
2018-11-18	-6.4	-3.8	1.6	5.8	87.8	2.0	9.8	1.6	5.0	0.0	1020.9	2018	11

Group the data by e.g. month, year, weekday etc. Then fill the NaNs with the corresponding mean.

df.fillna(df.groupby(df.Month).transform("mean"), inplace=True)

df
df.round(1)

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR	PM	Year	Month
Datum
2018-11-13	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9	1012.3	2018	11
2018-11-14	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0	1019.5	2018	11
2018-11-15	1.1	5.4	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	1020.6	2018	11
2018-11-16	-1.6	1.6	5.0	9.3	79.9	3.0	11.5	8.2	2.3	0.0	1025.1	2018	11
2018-11-18	-6.4	-3.8	1.6	5.8	87.8	2.0	9.8	1.6	5.0	0.0	1020.9	2018	11
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2020-03-20	4.7	5.2	7.7	9.5	73.0	2.0	11.2	0.0	7.5	0.0	1016.1	2020	3
2020-03-21	-4.5	0.2	4.1	7.4	76.5	4.0	13.6	8.0	2.8	0.0	1021.0	2020	3
2020-03-23	-8.2	-4.4	0.8	6.7	52.5	2.0	8.1	4.0	0.2	0.0	1029.5	2020	3
2020-03-24	-8.3	-4.8	6.5	8.7	46.5	3.0	10.3	11.9	0.7	0.0	1027.2	2020	3
2020-03-25	-7.3	-2.6	3.9	11.0	76.5	3.0	11.5	11.8	0.5	0.0	1022.7	2020	3

492 rows × 13 columns

Proof if all NaNs are filled and calcualte the sum of the NaN again:

df.isnull().sum()

TG       0
TN       0
TM       0
TX       0
RFM      0
FM       0
FX       0
SO       0
NM       0
RR       0
PM       0
Year     0
Month    0
dtype: int64

Exercise:

Load the data set Tagesmittel_Dahlem_MD.csv

Fill the missing values of the maximal temperature variable TX with the monthly mean

Calculate and plot the weekly mean of maximal temperature varible TX

6.1.9. Missing data (rows)#

Next, we use the drop function to create some missing data rows of our data set:

df = df.drop(df.index[[3, 7]])
df = df.round(1)
df.head(10)

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR	PM	Year	Month
Datum
2018-11-13	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9	1012.3	2018	11
2018-11-14	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0	1019.5	2018	11
2018-11-15	1.1	5.4	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	1020.6	2018	11
2018-11-18	-6.4	-3.8	1.6	5.8	87.8	2.0	9.8	1.6	5.0	0.0	1020.9	2018	11
2018-11-19	-0.1	2.2	3.9	5.6	88.3	4.0	15.4	0.5	6.6	0.7	1013.2	2018	11
2018-11-20	1.2	1.7	2.3	3.3	77.3	4.0	16.2	0.0	7.9	0.0	1009.3	2018	11
2018-11-22	0.5	0.5	1.1	1.9	86.3	3.0	9.5	0.0	8.0	0.0	1014.8	2018	11
2018-11-23	-1.1	0.5	1.9	3.2	96.8	2.0	6.1	0.0	7.8	0.0	1012.1	2018	11
2018-11-24	0.5	0.2	2.2	3.5	98.0	2.0	10.1	0.0	7.9	0.0	1004.9	2018	11
2018-11-25	0.9	2.6	4.3	5.7	97.4	2.0	10.1	0.0	7.8	0.0	1003.0	2018	11

NaNs can also be filled when we convert TimeSeries to specified frequency with the function asfreq()

DataFrame.asfreq(self: ~ FrameOrSeries, freq, method=None, how: Union[str, NoneType] = None, normalize: bool = False, fill_value=None

There are two methods to fill the holes in reindexed Series (note this does not fill NaNs that already were present) with the last valid obervation or with the next valid oberservation:

method = ‘pad’ / ‘ffill’: propagate last valid observation forward to next valid
Or you can specify a fill value, for example fill_value = -9999

df = df.asfreq("D", method="ffill")
# df = df.asfreq('D',fill_value=-9999)
df.head(10)

	TG	TN	TM	TX	RFM	FM	FX	SO	NM	RR	PM	Year	Month
Datum
2018-11-13	4.4	6.8	9.7	12.3	97.4	2.0	11.2	1.1	6.0	8.9	1012.3	2018	11
2018-11-14	1.2	4.7	7.2	11.0	96.1	2.0	8.8	6.4	5.0	0.0	1019.5	2018	11
2018-11-15	1.1	5.4	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	1020.6	2018	11
2018-11-16	1.1	5.4	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	1020.6	2018	11
2018-11-17	1.1	5.4	6.3	10.4	93.0	3.0	8.5	0.5	4.6	0.0	1020.6	2018	11
2018-11-18	-6.4	-3.8	1.6	5.8	87.8	2.0	9.8	1.6	5.0	0.0	1020.9	2018	11
2018-11-19	-0.1	2.2	3.9	5.6	88.3	4.0	15.4	0.5	6.6	0.7	1013.2	2018	11
2018-11-20	1.2	1.7	2.3	3.3	77.3	4.0	16.2	0.0	7.9	0.0	1009.3	2018	11
2018-11-21	1.2	1.7	2.3	3.3	77.3	4.0	16.2	0.0	7.9	0.0	1009.3	2018	11
2018-11-22	0.5	0.5	1.1	1.9	86.3	3.0	9.5	0.0	8.0	0.0	1014.8	2018	11

6.1.10. Resampling#

It is often useful to resample our time series data to a lower or higher frequency. Resampling to a lower frequency (downsampling) usually involves an aggregation operation — for example, computing monthly sales totals from daily data. The daily OPSD data we’re working with in this tutorial was downsampled from the original hourly time series. Resampling to a higher frequency (upsampling) is less common and often involves interpolation or other data filling method — for example, interpolating hourly weather data to 10 minute intervals for input to a scientific model.

We will focus here on downsampling, exploring how it can help us analyze our OPSD data on various time scales. We use the DataFrame’s resample() method, which splits the DatetimeIndex into time bins and groups the data by time bin. The resample() method returns a Resampler object, similar to a pandas GroupBy object. We can then apply an aggregation method such as mean(), median(), sum(), etc., to the data group for each time bin. For example, let’s resample the data to a weekly mean time series.

df["TN"].resample("W").mean().head()

Datum
2018-11-18    3.983333
2018-11-25    1.342857
2018-12-02   -0.242857
2018-12-09    4.271429
2018-12-16    0.757143
Freq: W-SUN, Name: TN, dtype: float64

df["TN"].resample("M").mean().head()

/var/folders/vs/k3tcf9r11xq5pqzsg6kq3sx00000gn/T/ipykernel_9784/1959211299.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  df["TN"].resample("M").mean().head()

Datum
2018-11-30    1.416667
2018-12-31    2.777419
2019-01-31   -0.870968
2019-02-28    0.317857
2019-03-31    3.129032
Freq: ME, Name: TN, dtype: float64

df["TN"].resample("QS").mean().head()

Datum
2018-10-01     2.277551
2019-01-01     0.876667
2019-04-01     8.919780
2019-07-01    12.444565
2019-10-01     3.864130
Freq: QS-JAN, Name: TN, dtype: float64

df["TN"].resample("M").mean().plot(kind="bar")

/var/folders/vs/k3tcf9r11xq5pqzsg6kq3sx00000gn/T/ipykernel_9784/2194007315.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  df["TN"].resample("M").mean().plot(kind="bar")

<Axes: xlabel='Datum'>

../../_images/59b551c4efa217035365001edad18143d47c28831ef6e1824faae613c2b7aab8.png

Let’s plot the daily and weekly sunshine duration and temperature together over a winter.

data_columns = ["SO", "TN"]
# Resample to weekly frequency, aggregating with mean
df_weekly_mean = df[data_columns].resample("W").mean()
df_weekly_mean.head(10)

	SO	TN
Datum
2018-11-18	1.766667	3.983333
2018-11-25	0.071429	1.342857
2018-12-02	1.942857	-0.242857
2018-12-09	1.700000	4.271429
2018-12-16	0.214286	0.757143
2018-12-23	0.157143	1.742857
2018-12-30	0.000000	4.228571
2019-01-06	0.428571	0.500000
2019-01-13	0.071429	1.657143
2019-01-20	3.857143	-0.828571

How do the different downsamples (daily/weekly) look like?

# Start and end of the date range to extract
start, end = "2018-12", "2019-03"

# Plot daily and weekly resampled time series together
fig, ax = plt.subplots()

ax.plot(
    df.loc[start:end, "SO"], marker=".", linestyle="-", linewidth=0.5, label="Daily"
)
ax.plot(
    df_weekly_mean.loc[start:end, "SO"],
    marker="o",
    markersize=8,
    linestyle="-",
    label="Weekly Mean Resample",
)
ax.set_ylabel("S0")
ax.legend()

<matplotlib.legend.Legend at 0x1315e8a90>

../../_images/7fb9bb4961024b72085f29d378a2a0666fc57095f9c38750997572b4690780b2.png

# Compute the monthly sums, setting the value to NaN for any month which has
# fewer than 28 days of data

opsd_monthly = df["SO"].resample("M").sum(min_count=28)
opsd_monthly.head(3)

/var/folders/vs/k3tcf9r11xq5pqzsg6kq3sx00000gn/T/ipykernel_9784/2537655571.py:4: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  opsd_monthly = df["SO"].resample("M").sum(min_count=28)

Datum
2018-11-30     NaN
2018-12-31    15.8
2019-01-31    57.0
Freq: ME, Name: SO, dtype: float64

Exercise:

Compute the weekly and monthly means of the maximal temperature TX of the DWD data set in the time range 1.1.2019 – 31.12.2019. Plot the results.

6.1.11. Rolling windows#

Rolling window operations are another important transformation for time series data. Similar to downsampling, rolling windows split the data into time windows and and the data in each window is aggregated with a function such as mean(), median(), sum(), etc. However, unlike downsampling, where the time bins do not overlap and the output is at a lower frequency than the input, rolling windows overlap and “roll” along at the same frequency as the data, so the transformed time series is at the same frequency as the original time series. By default, all data points within a window are equally weighted in the aggregation, but this can be changed by specifying window types such as Gaussian, triangular, and others. We’ll stick with the standard equally weighted window here. Let’s use the rolling method to compute the 7-day rolling mean of our daily data. We use the center=True argument to label each window at its midpoint, so the rolling windows are:

# Compute the centered 7-day rolling mean
df_7d = df["SO"].rolling(7, center=True).mean()
df_7d

Datum
2018-11-13         NaN
2018-11-14         NaN
2018-11-15         NaN
2018-11-16    1.585714
2018-11-17    1.428571
                ...   
2020-03-21    6.257143
2020-03-22    6.528571
2020-03-23         NaN
2020-03-24         NaN
2020-03-25         NaN
Freq: D, Name: SO, Length: 499, dtype: float64

We can see that the first non-missing rolling mean value is on 2018-11-16, because this is the midpoint of the first rolling window.

To visualize the differences between rolling mean and resampling, let’s update our earlier plot of Dez-Feb 2018/2019 sunshine duration to include the 7-day rolling mean along with the weekly mean resampled time series and the original daily data

# Start and end of the date range to extract
start, end = "2019-12", "2020-03"

# Plot daily and weekly resampled time series together
fig, ax = plt.subplots()

ax.plot(
    df.loc[start:end, "SO"], marker=".", linestyle="-", linewidth=0.5, label="Daily"
)

ax.plot(
    df_weekly_mean.loc[start:end, "SO"],
    marker="o",
    markersize=8,
    linestyle="-",
    label="Weekly Mean Resample",
)

ax.plot(df_7d.loc[start:end], marker=".", linestyle="-", label="7-d Rolling Mean")

ax.set_ylabel("SO")
ax.legend()

<matplotlib.legend.Legend at 0x131694850>

../../_images/3d45f2a637f56c082428ad56f864b75c66b068354074e58ae3024144d15a7397.png

We can see that data points in the rolling mean time series have the same spacing as the daily data, but the curve is smoother because higher frequency variability has been averaged out. In the rolling mean time series, the peaks and troughs tend to align closely with the peaks and troughs of the daily time series. In contrast, the peaks and troughs in the weekly resampled time series are less closely aligned with the daily time series, since the resampled time series is at a coarser granularity.

Exercise:

Compute the rooling 3 day, 7 day and 14 day rolling means of the maximal temperature TX of the DWD data set in the time range 1.1.2019 – 31.12.2019. Plot the results.

6.1.12. Trends#

Time series data often exhibit some slow, gradual variability in addition to higher frequency variability such as seasonality and noise. An easy way to visualize these trends is with rolling means at different time scales. A rolling mean tends to smooth a time series by averaging out variations at frequencies much higher than the window size and averaging out any seasonality on a time scale equal to the window size. This allows lower-frequency variations in the data to be explored. Since our electricity consumption time series has weekly and yearly seasonality, let’s look at rolling means on those two time scales.

Computing the rolling mean over longer time perdiods, for example a 365 day rolling mean, is especially interesting for climatological data sets!

6.1.12.1. A short remark on Plotting:#

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics of the time series data. There are lots of possibilities for nice plotting

See https://seaborn.pydata.org for more.

We have a look at some examples:

import seaborn as sns

# Use seaborn style defaults and set the default figure size
sns.set(rc={"figure.figsize": (11, 4)})

df["TM"].plot(linewidth=0.5)

<Axes: xlabel='Datum'>

../../_images/414cb3df1902acf7d598312843be7f6327302eaf5f3f2294ccb366efac4d9a19.png

cols_plot = ["TM", "TN"]
axes = df[cols_plot].plot(
    marker=".", alpha=0.5, linestyle="None", figsize=(11, 9), subplots=True
)
for ax in axes:
    ax.set_ylabel("Daily Data")

../../_images/5f45e0d8c2cc17d7ee58514df95b70cc00d648d8ba673a4b2fd7109a4f12d3dc.png

6.1.13. Zoom in:#

ax = df.loc["2019-01":"2019-02", "TM"].plot(marker="o", linestyle="-")
ax.set_ylabel("TM")

Text(0, 0.5, 'TM')

../../_images/fa4b348100d201b3d0077a679842c30e775c8adf73d3eea41a187c17ccd37d0d.png

Next, let’s further explore the seasonality of our data with box plots, using seaborn’s boxplot() function to group the data by different time periods and display the distributions for each group. We’ll first group the data by month, to visualize yearly seasonality.

fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(["TM", "TN", "RR"], axes):
    sns.boxplot(data=df, x="Month", y=name, ax=ax)
ax.set_ylabel("RR")
ax.set_title(name)
# Remove the automatic x-axis label from all but the bottom subplot
if ax != axes[-1]:
    ax.set_xlabel("")

../../_images/eee2cbdae098a0af4123dd5e7b888c32514673b1c9ad763154312ba440ab62ab.png

These box plots confirm the yearly seasonality that we saw in earlier plots!

6.1.14. Recall the Exercises:#

Exercise:

Was the sun shining last summer? Plot the Sonnenscheindauer ‘SO’ (or another variable) from the DWD data set for the time slice Mai, Juni, July and August 2020

Exercise:

Load the data set Tagesmittel_Dahlem_MD.csv

Fill the missing values of the maximal temperature TX variable with the monthly mean

Calculate and plot the weekly mean of the maximal temperature variable

Exercise:

Compute the weekly and monthly means of the maximal temperature TX of the DWD data set in the time range 01.01.2019 - 31.12.2019. Plot the results.

Exercise:

Compute the rooling 3 day, 7 day and 14 day rolling means of the maximal temperature TX of the DWD data set in the time range 01.01.2019 - 31.12.2019. Plot the results.

Time Series Preprocessing

Contents

6.1. Time Series Preprocessing#

6.1.1. Data information: Descriptions of the German Weather Service (Deutscher Wetterdienst; DWD):#

6.1.1.1. Sort the data set by date: 2018…2020#

Exercise:

6.1.1.2. solution#

6.1.2. Python’s Datetime#

6.1.2.1. Set the date as Index:#

6.1.3. Benefits of the DatetimeIndex#

6.1.3.1. Subsetting and slicing#

6.1.3.2. Easy plotting#

Exercise:

6.1.4. Summary statistics#

6.1.5. Generating DatetimeIndex with periods and frequency arguments#

6.1.6. Pandas series vs. dataframe#

6.1.7. Dealing with missing values (NaNs)?#

6.1.8. Imputation of univariate time series#

6.1.8.1. Count the numbers of NaN in each column#

6.1.8.2. Fill the NaNs#

Exercise:

6.1.9. Missing data (rows)#

6.1.10. Resampling#

Exercise:

6.1.11. Rolling windows#

Exercise:

6.1.12. Trends#

6.1.12.1. A short remark on Plotting:#

6.1.13. Zoom in:#

6.1.14. Recall the Exercises:#

Exercise:

Exercise:

Exercise:

Exercise: