Additional Exercises III - The Normal distribution

2.3. Additional Exercises III - The Normal distribution#

2.3.1. Exercise 1: Recall Statistics#

In order to apply AI algorithms, write codes and evaluate large data sets, we need some statistics.

Recall in groups the following topics:

What is a normal distribution? What is a standard deviation, the variance, confidence intervalls?
What kind of means are there?
What is the central limit theorem
What is the law of large numbers?
What is linear regression?
What is logistische Regression?
What is a general linear model (GLM)?
What is a Poisson distribution?

2.3.2. Exercise 2: Checking for Normal Distribution#

2.3.2.1. DWD data#

We begin by loading the meteorological data set from the Deutscher Wetterdienst DWD (German Weather Service) Open Climate Data Center . You find a documentation of this data in german and english.

# First, let's import all the needed libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dahlem_clim = pd.read_csv(
    "https://userpage.fu-berlin.de/soga/data/raw-data/NS_TS_Dahlem.txt", sep=";"
)
dahlem_clim.head(10)

	STATIONS_ID	MESS_DATUM_BEGINN	MESS_DATUM_ENDE	QN_4	MO_N	MO_TT	MO_TX	MO_TN	MO_FK	MX_TX	MX_FX	MX_TN	MO_SD_S	QN_6	MO_RR	MX_RS	eor
0	403	17190101	17190131	5	-999.0	2.8	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor
1	403	17190201	17190228	5	-999.0	1.1	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor
2	403	17190301	17190331	5	-999.0	5.2	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor
3	403	17190401	17190430	5	-999.0	9.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor
4	403	17190501	17190531	5	-999.0	15.1	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor
5	403	17190601	17190630	5	-999.0	19.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor
6	403	17190701	17190731	5	-999.0	21.4	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor
7	403	17190801	17190831	5	-999.0	18.8	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor
8	403	17190901	17190930	5	-999.0	13.9	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor
9	403	17191001	17191031	5	-999.0	9.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999.0	-999	-999.0	-999.0	eor

2.3.3. Exercise 2A: Subsetting the data frame#

First, we want to subset the entire dataset.

We only want to keep data from 1961 to day. Subset the entire dataset (not only 1 column as we saw in the seminar)! hwo many rows has the dataframe now?

### your code here ###

2.3.3.1. solution#

## you could do e.g.,:
dahlem_clim = dahlem_clim.loc[dahlem_clim["MESS_DATUM_BEGINN"] >= 19610101]
dahlem_clim.head(10)

	STATIONS_ID	MESS_DATUM_BEGINN	MESS_DATUM_ENDE	QN_4	MO_N	MO_TT	MO_TX	MO_TN	MO_FK	MX_TX	MX_FX	MX_TN	MO_SD_S	QN_6	MO_RR	MX_RS	eor
2778	403	19610101	19610131	5	5.01	-1.02	1.31	-3.88	2.63	7.8	-999.0	-16.2	71.5	5	50.4	7.4	eor
2779	403	19610201	19610228	5	6.12	4.66	7.98	1.56	2.56	15.5	-999.0	-4.7	61.4	5	41.9	7.6	eor
2780	403	19610301	19610331	5	5.58	6.79	10.79	3.15	3.02	19.4	-999.0	-3.7	124.1	5	52.9	18.6	eor
2781	403	19610401	19610430	5	4.88	11.38	16.61	6.24	2.40	26.1	-999.0	-0.9	192.7	5	55.7	13.3	eor
2782	403	19610501	19610531	5	6.11	11.23	15.54	7.18	2.29	23.8	-999.0	3.0	132.8	5	120.1	26.7	eor
2783	403	19610601	19610630	5	4.39	17.73	23.18	11.47	2.12	30.4	-999.0	5.5	273.4	5	48.3	13.4	eor
2784	403	19610701	19610731	5	5.79	16.16	20.88	12.14	2.49	33.0	-999.0	8.8	143.2	5	72.6	13.2	eor
2785	403	19610801	19610831	5	5.34	15.97	21.20	11.40	2.40	28.5	-999.0	8.0	169.7	5	43.0	10.3	eor
2786	403	19610901	19610930	5	3.48	15.95	22.00	11.06	2.11	30.0	-999.0	4.4	201.7	5	35.5	12.1	eor
2787	403	19611001	19611031	5	4.77	11.10	15.54	7.26	2.39	22.3	-999.0	3.3	136.8	5	35.7	16.4	eor

dahlem_clim.shape[0]  # number of rows

2.3.4. Exercise 2B: Checking the mean and standard deviation for this subset#

Calculate the mean and standard deviation for the monthly mean temperature! Doe they violate the three sigma rule? plot the distriubtion as histogram and density plot.

### your code here ###

2.3.4.1. solution#

np.mean(dahlem_clim.MO_TT), np.std(
    dahlem_clim.MO_TT
)  ## Yes it violates the 3 sigma rule!

(9.404439890710382, 6.846380527375167)

plt.figure(figsize=(10, 5))
plt.hist(dahlem_clim.MO_TT, bins="sturges", color="lightblue", edgecolor="grey")
plt.xlabel("Temp in °C")
plt.ylabel("absolute frequency")
plt.show()

../../../_images/bedd031142fecbcd5ace070289ca6f4d4efc03b8581d5f5dede15238314665f7.png

Ok, looks Bimodal! But there seems to be no normal distribution…..

2.3.5. Exercise 2C: Checking the mean and standard deviation for this subset#

Check for normality of the monthly mean temperature with the help of a qqplot!

### your code here ###

2.3.5.1. solution#

import statsmodels.api as sm

plt.figure(figsize=(12, 5))
sm.qqplot(dahlem_clim.MO_TT, line="r")

plt.show()

<Figure size 1200x500 with 0 Axes>

../../../_images/f521a6c643fdab944e1e41d3ec1fa9701744e5c0442b2c007c21dd0e77a8e17c.png

Well, we already suspected something like this…. Lets have a closer look:

2.3.6. Exercise 2D: Slicing the data frame#

Now, we want to slice the dataset into 2 parts. We want to subset between 1961 to 1990 and from 1991 to today. Subset the entire dataset (not only 1 column as we saw in the seminar)!

### your code here ###

2.3.6.1. solution#

## combined conditions :

dahlem_1961_1990 = dahlem_clim.loc[
    (dahlem_clim["MESS_DATUM_BEGINN"] >= 19610101)
    & (dahlem_clim["MESS_DATUM_BEGINN"] < 19910101)
]
dahlem_1961_1990

	STATIONS_ID	MESS_DATUM_BEGINN	MESS_DATUM_ENDE	QN_4	MO_N	MO_TT	MO_TX	MO_TN	MO_FK	MX_TX	MX_FX	MX_TN	MO_SD_S	QN_6	MO_RR	MX_RS	eor
2778	403	19610101	19610131	5	5.01	-1.02	1.31	-3.88	2.63	7.8	-999.0	-16.2	71.5	5	50.4	7.4	eor
2779	403	19610201	19610228	5	6.12	4.66	7.98	1.56	2.56	15.5	-999.0	-4.7	61.4	5	41.9	7.6	eor
2780	403	19610301	19610331	5	5.58	6.79	10.79	3.15	3.02	19.4	-999.0	-3.7	124.1	5	52.9	18.6	eor
2781	403	19610401	19610430	5	4.88	11.38	16.61	6.24	2.40	26.1	-999.0	-0.9	192.7	5	55.7	13.3	eor
2782	403	19610501	19610531	5	6.11	11.23	15.54	7.18	2.29	23.8	-999.0	3.0	132.8	5	120.1	26.7	eor
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3133	403	19900801	19900831	10	4.05	18.75	24.83	13.61	1.80	32.0	17.4	7.5	253.7	10	74.7	25.2	eor
3134	403	19900901	19900930	10	5.79	12.27	16.55	9.25	2.13	21.7	23.1	4.8	114.0	10	52.4	8.6	eor
3135	403	19901001	19901031	10	4.10	10.52	15.35	6.74	2.18	23.1	16.9	0.3	157.7	10	9.8	4.3	eor
3136	403	19901101	19901130	10	6.55	5.26	7.40	3.16	2.01	12.1	18.0	-2.4	47.3	10	56.6	16.4	eor
3137	403	19901201	19901231	10	6.04	1.13	2.84	-1.07	2.19	9.0	21.0	-5.8	41.2	10	73.0	23.3	eor

360 rows × 17 columns

## combined conditions:

dahlem_1990_today = dahlem_clim.loc[dahlem_clim["MESS_DATUM_BEGINN"] >= 19910101]
dahlem_1990_today

	STATIONS_ID	MESS_DATUM_BEGINN	MESS_DATUM_ENDE	QN_4	MO_N	MO_TT	MO_TX	MO_TN	MO_FK	MX_TX	MX_FX	MX_TN	MO_SD_S	QN_6	MO_RR	MX_RS	eor
3138	403	19910101	19910131	10	4.97	2.30	4.57	0.07	2.16	15.2	18.5	-8.7	76.20	9	27.6	13.7	eor
3139	403	19910201	19910228	10	5.27	-2.34	0.85	-5.32	1.84	12.9	14.9	-12.9	72.40	9	27.1	7.9	eor
3140	403	19910301	19910331	10	4.79	6.77	11.30	3.25	2.15	18.1	14.4	-1.2	128.30	9	40.8	19.0	eor
3141	403	19910401	19910430	10	4.80	8.23	13.28	3.32	2.02	20.5	20.0	-2.8	174.80	9	40.4	7.8	eor
3142	403	19910501	19910531	10	4.97	10.52	14.90	5.90	2.19	24.2	24.1	1.2	179.60	9	38.6	15.5	eor
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3505	403	20210801	20210831	3	5.67	17.43	22.34	12.97	2.42	29.4	16.9	8.0	188.23	3	92.4	22.6	eor
3506	403	20210901	20210930	3	4.90	15.55	20.28	11.35	2.33	27.3	18.6	6.8	154.48	3	35.1	15.9	eor
3507	403	20211001	20211031	3	4.85	10.49	15.28	6.29	2.81	24.2	25.6	0.5	157.65	3	21.8	5.7	eor
3508	403	20211101	20211130	3	6.65	6.28	8.68	3.87	2.60	12.7	18.6	-0.9	41.23	3	65.2	40.4	eor
3509	403	20211201	20211231	1	6.57	2.19	4.30	-0.11	2.74	14.0	21.5	-10.2	38.18	3	39.2	8.8	eor

372 rows × 17 columns

2.3.7. Exercise 2E: Checking for Normality#

Calculate the mean and standard deviation for the monthly mean temperature for both dataframes! Doe they violate the three sigma rule? plot the distriubtion as histogram and density plot. Check for normality of the monthly mean temperature with the help of a qqplot! (solution the same as above)

### your code here ###

2.3.8. Exercise 2F: Boxplots#

Let’s have a closer look at both dataframes with the help of a boxplot! Plot both datasets next to eachother in two boxplots but one figure frame.

### your code here ###

2.3.8.1. solution#

fig = plt.figure(figsize=(10, 5))

sns.boxplot(data=[dahlem_1961_1990["MO_TT"], dahlem_1990_today["MO_TT"]])

plt.ylabel("Temp °C")
plt.show()

../../../_images/c9510c9f592b1f40c5b90d37a94f99a45abbc89ee13e1bd61c8143cdb12c6d06.png

2.3.9. Exercise 2G: Nodges#

Which argument to we have to add, if we want to find significant differences between the groups? Is there a signifanct difference in the monthly mean temperature between 1961 to 1991 and from 1991 to today?

### your code here ###

2.3.9.1. solution#

fig = plt.figure(figsize=(10, 5))

sns.boxplot(data=[dahlem_1961_1990["MO_TT"], dahlem_1990_today["MO_TT"]], notch=True)

plt.ylabel("Temp °C")
plt.show()

../../../_images/1cfb827fd71c908b762f3962e3b6e34a05c7a2908c748310f778de1e83c91b1d.png

from IPython.display import IFrame

IFrame(
    src="../../citations/citation_Marie.html",
    width=900,
    height=200,
)