Python,Pandas, Statsmodels – Multipredictor Linear Regression (code included)

The Swedish Meteorological & Hydrological Institute, SMHI, provides (some of) its data to the public, as “Open Data“, accessable either by various API’s or by download from their website.

I downloaded several of their CSV-files, containing historical weather-data for a lighthouse in Stockholm archipelago, covering multiple daily measurements on e.g. temperature, wind speed, wind direction and atmospheric pressure, for the period of 1949 to 2009, in total more than 160.000 individual measurements.

The basic structure of these csv-files is as below:

1949-01-04;00:00:00;3.4;Y
1949-01-04;06:00:00;3.6;Y
1949-01-04;12:00:00;1.2;Y
1949-01-04;18:00:00;1.7;Y
1949-01-05;00:00:00;2.3;Y
1949-01-05;06:00:00;3.0;Y
1949-01-05;12:00:00;1.2;Y
1949-01-05;18:00:00;1.2;Y
1949-01-06;00:00:00;1.8;Y
1949-01-06;06:00:00;2.6;Y
1949-01-06;12:00:00;2.8;Y
1949-01-06;18:00:00;3.6;Y
1949-01-07;00:00:00;3.7;Y
1949-01-07;06:00:00;3.5;Y
1949-01-07;12:00:00;3.6;Y
1949-01-07;18:00:00;2.0;Y
1949-01-08;00:00:00;0.2;Y
1949-01-08;06:00:00;2.2;Y
1949-01-08;12:00:00;3.4;Y

Here, the columns are “date”, “time”, “temp[C]” and a measurement status that I don’t care about.

For the other parameters, such a wind speed, direction, pressure etc, the corresponding files have similar structure.

Anyways, what I wanted to do is to see which of these factors might have the greatest impact on some other factor, e.g. does pressure or month impact e.g. windspeed, and if so, to what degree ?

To perform that analysis, I used linear regression, as provided by Python’s Statsmodels library. First I ran the analysis as a single variable linear regression, for each of the parameters contained in the model, to finish with a multi-predictor regression, containing all of the exploratory parameters in a single  run. Below the summary obtained from statsmodels:

                           OLS Regression Results
==============================================================================
Dep. Variable:                    TWS   R-squared:                       0.080
Model:                            OLS   Adj. R-squared:                  0.080
Method:                 Least Squares   F-statistic:                 1.393e+04
Date:                Tue, 20 Mar 2018   Prob (F-statistic):               0.00
Time:                        16:26:54   Log-Likelihood:            -4.3801e+05
No. Observations:              161226   AIC:                         8.760e+05
Df Residuals:                  161224   BIC:                         8.760e+05
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         98.5279      0.772    127.636      0.000      97.015     100.041
Lufttryck     -0.0900      0.001   -118.041      0.000      -0.092      -0.089
==============================================================================
Omnibus:                     7704.728   Durbin-Watson:                   0.334
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             8944.165
Skew:                           0.545   Prob(JB):                         0.00
Kurtosis:                       3.378   Cond. No.                     8.57e+04
==============================================================================


                            OLS Regression Results
==============================================================================
Dep. Variable:                    TWS   R-squared:                       0.017
Model:                            OLS   Adj. R-squared:                  0.017
Method:                 Least Squares   F-statistic:                     2806.
Date:                tis, 20 mar 2018   Prob (F-statistic):               0.00
Time:                        16:27:03   Log-Likelihood:            -4.4330e+05
No. Observations:              161226   AIC:                         8.866e+05
Df Residuals:                  161224   BIC:                         8.866e+05
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.3686      0.022    291.514      0.000       6.326       6.411
TWD            0.0050   9.53e-05     52.973      0.000       0.005       0.005
==============================================================================
Omnibus:                    11236.488   Durbin-Watson:                   0.322
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            13891.396
Skew:                           0.674   Prob(JB):                         0.00
Kurtosis:                       3.501   Cond. No.                         532.
==============================================================================


                            OLS Regression Results
==============================================================================
Dep. Variable:                    TWS   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.038
Method:                 Least Squares   F-statistic:                     6396.
Date:                tis, 20 mar 2018   Prob (F-statistic):               0.00
Time:                        16:27:13   Log-Likelihood:            -4.4156e+05
No. Observations:              161226   AIC:                         8.831e+05
Df Residuals:                  161224   BIC:                         8.831e+05
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.0820      0.013    645.222      0.000       8.057       8.107
Temp          -0.1065      0.001    -79.978      0.000      -0.109      -0.104
==============================================================================
Omnibus:                     9083.230   Durbin-Watson:                   0.327
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            10864.235
Skew:                           0.592   Prob(JB):                         0.00
Kurtosis:                       3.463   Cond. No.                         12.7
==============================================================================


                            OLS Regression Results
==============================================================================
Dep. Variable:                    TWS   R-squared:                       0.018
Model:                            OLS   Adj. R-squared:                  0.018
Method:                 Least Squares   F-statistic:                     3025.
Date:                tis, 20 mar 2018   Prob (F-statistic):               0.00
Time:                        16:27:23   Log-Likelihood:            -4.4320e+05
No. Observations:              161226   AIC:                         8.864e+05
Df Residuals:                  161224   BIC:                         8.864e+05
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -54.6076      1.128    -48.423      0.000     -56.818     -52.397
year           0.0313      0.001     54.998      0.000       0.030       0.032
==============================================================================
Omnibus:                    10070.976   Durbin-Watson:                   0.319
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            12287.743
Skew:                           0.626   Prob(JB):                         0.00
Kurtosis:                       3.513   Cond. No.                     2.37e+05
==============================================================================


                            OLS Regression Results
==============================================================================
Dep. Variable:                    TWS   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     415.0
Date:                tis, 20 mar 2018   Prob (F-statistic):           3.89e-92
Time:                        16:27:35   Log-Likelihood:            -4.4449e+05
No. Observations:              161226   AIC:                         8.890e+05
Df Residuals:                  161224   BIC:                         8.890e+05
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.9803      0.023    300.159      0.000       6.935       7.026
quarter        0.1729      0.008     20.372      0.000       0.156       0.190
==============================================================================
Omnibus:                    10713.102   Durbin-Watson:                   0.314
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            13140.688
Skew:                           0.654   Prob(JB):                         0.00
Kurtosis:                       3.494   Cond. No.                         7.47
==============================================================================


                            OLS Regression Results
==============================================================================
Dep. Variable:                    TWS   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     386.3
Date:                tis, 20 mar 2018   Prob (F-statistic):           6.62e-86
Time:                        16:27:50   Log-Likelihood:            -4.4450e+05
No. Observations:              161226   AIC:                         8.890e+05
Df Residuals:                  161224   BIC:                         8.890e+05
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0614      0.020    348.887      0.000       7.022       7.101
month          0.0541      0.003     19.655      0.000       0.049       0.059
==============================================================================
Omnibus:                    10750.443   Durbin-Watson:                   0.314
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            13196.852
Skew:                           0.655   Prob(JB):                         0.00
Kurtosis:                       3.496   Cond. No.                         15.9
==============================================================================


                            OLS Regression Results
==============================================================================
Dep. Variable:                    TWS   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.9790
Date:                tis, 20 mar 2018   Prob (F-statistic):              0.322
Time:                        16:28:08   Log-Likelihood:            -4.4469e+05
No. Observations:              161226   AIC:                         8.894e+05
Df Residuals:                  161224   BIC:                         8.894e+05
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.4296      0.019    381.722      0.000       7.391       7.468
day           -0.0011      0.001     -0.989      0.322      -0.003       0.001
==============================================================================
Omnibus:                    10776.935   Durbin-Watson:                   0.313
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            13228.257
Skew:                           0.657   Prob(JB):                         0.00
Kurtosis:                       3.493   Cond. No.                         37.0
==============================================================================


                            OLS Regression Results
==============================================================================
Dep. Variable:                    TWS   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     36.89
Date:                tis, 20 mar 2018   Prob (F-statistic):           1.25e-09
Time:                        16:28:25   Log-Likelihood:            -4.4468e+05
No. Observations:              161226   AIC:                         8.894e+05
Df Residuals:                  161224   BIC:                         8.894e+05
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.3258      0.017    426.251      0.000       7.292       7.359
hour           0.0084      0.001      6.074      0.000       0.006       0.011
==============================================================================
Omnibus:                    10773.915   Durbin-Watson:                   0.313
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            13225.120
Skew:                           0.657   Prob(JB):                         0.00
Kurtosis:                       3.494   Cond. No.                         22.6
==============================================================================


                            OLS Regression Results
==============================================================================
Dep. Variable:                    TWS   R-squared:                       0.168
Model:                            OLS   Adj. R-squared:                  0.168
Method:                 Least Squares   F-statistic:                     4064.
Date:                tis, 20 mar 2018   Prob (F-statistic):               0.00
Time:                        16:28:40   Log-Likelihood:            -4.2988e+05
No. Observations:              161226   AIC:                         8.598e+05
Df Residuals:                  161217   BIC:                         8.599e+05
Df Model:                           8
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         25.5731      1.284     19.923      0.000      23.057      28.089
Lufttryck     -0.0859      0.001   -117.796      0.000      -0.087      -0.084
TWD            0.0040    8.8e-05     45.433      0.000       0.004       0.004
Temp          -0.1428      0.001   -106.160      0.000      -0.145      -0.140
year           0.0341      0.001     64.808      0.000       0.033       0.035
quarter        0.5652      0.033     17.182      0.000       0.501       0.630
month         -0.0350      0.011     -3.302      0.001      -0.056      -0.014
day           -0.0018      0.001     -1.837      0.066      -0.004       0.000
hour           0.0124      0.001      9.821      0.000       0.010       0.015
==============================================================================
Omnibus:                     4612.974   Durbin-Watson:                   0.374
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5183.933
Skew:                           0.396   Prob(JB):                         0.00
Kurtosis:                       3.379   Cond. No.                     3.31e+05
==============================================================================




As can be seen, both from the single variable as well as from the multivariable runs, all parameters except day and month are statistically significant at 95% CI level. However, we can also see from the coefficients that the impact on TWS of each parameter is very small – for instance atmospheric pressure (“Lufttryck”), which at least I a priori thought would have significant impact on wind speed (TWS) turns out to be while statistically significant, not very important in terms of its effect on wind speed: a change of atmospheric pressure by one unit does only change wind speed by 0.09 m/s, a barely noticable change.

So, which of the parameters impacts wind speed the most…..? Perhaps it comes as a surprise, but the most important factor for determining wind speed turns out the be the quarter of the year, where the fall clearly is the most windy period.

Below some graphs resulting from the various linear regression runs, and below that the Python code created for the analysis.

The first 8 graphs are for runs with a single predictor variable, the last 8 are for the multipredictor run with all predictors included.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

# create unique timestamp by merging date & time to serve as index
def create_timestamp_index(df):
    df['Timestamp'] = pd.to_datetime(df['Datum'] + ' ' + df['Tid'],
                                     format = '%Y-%m-%d %H:%M:%S')

    df.set_index('Timestamp',drop=True,inplace=True)
    df.drop(['Datum','Tid'],axis=1,inplace=True)
    return df

# read csv-files
tryck = pd.read_csv('lufttryck-hogarna.csv',skiprows=9,sep=';',usecols=[0,1,2])
tryck.columns = ['Datum','Tid','Lufttryck']
tryck = create_timestamp_index(tryck)

vind = pd.read_csv('smhi-hogarna-vind.csv',sep=';',
                   skiprows=10,usecols=[0,1,2,4])

vind.columns = ['Datum','Tid','TWD','TWS']
vind = create_timestamp_index(vind)

temp = pd.read_csv('smhi-hogarna-temp.csv',sep=';',skiprows=9,usecols=[0,1,2])
temp.columns = ['Datum','Tid','Temp']
temp = create_timestamp_index(temp)

# join the various dataframes by index
df = tryck.join(vind,how='left')
df = df.join(temp,how='left')
df.dropna(inplace=True)
df['year'] = df.index.year
df['quarter'] = df.index.quarter
df['month'] = df.index.month
df['day'] = df.index.day
df['hour'] = df.index.hour

# linear regression
explanatory_vars = list(df.columns)
del explanatory_vars[2] # remove dependent variable from exp_vars
# append all expvar combo for a multipredictor linreg
explanatory_vars.append(list(explanatory_vars))

# for each explanatory_variable, print stats and plot
for exp_var in explanatory_vars:
    X = df[exp_var]
    X = sm.add_constant(X)
    y = df['TWS']

    model = sm.OLS(y,X)
    fitted = model.fit()
    y_pred = fitted.predict(X)
    params = fitted.params
    print (fitted.summary())
    
    #print (dir(fitted)) #all available attributes
    
    nr_predictors = len(X.columns[1:]) # excl. constant

    for pred in range(nr_predictors):
        sm.graphics.plot_fit(fitted,pred+1)
        plt.savefig(exp_var +'_predictor_' + str(pred) + '.png')
# example of selecting subsets based on criteria
midsummer_noon = df.loc[((
    df['month'] == 6) & (df['day'] == 21) & (df['hour'] == 12))]

midsummer_noon_pre_75 = midsummer_noon.loc[midsummer_noon['year'] <= 1975] midsummer_noon_post_75 = midsummer_noon.loc[midsummer_noon['year'] > 1975]
print ('BEFORE 1976:\n',midsummer_noon_pre_75.describe())
print ('\nAFTER 1976:\n',midsummer_noon_post_75.describe())

plt.show()

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Big Data, Data Analytics, Data Driven Management, Numpy, Probability, Python, Statistics and tagged , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s