Tour de Ski Final climb – does age matter for performance ?

In an earlier post, I analyzed data from the Marcialonga Ski race. Marcialonga is one of the classic long distance ski races, where both elite’ as well as amateurs compete together. In fact, the vast majority of the competitors in these classic long distance ski races are in fact amateurs, not elite’ skiers.  One of the findings of that analysis was that “Age Matters”, that is, the analysis revealed that the older a participant is, the worse his/her performance.

Earlier today, I did a quick & dirty basic analysis of today’s Tour de Ski race, which asop to Marcialonga, only invites the real elite’, i.e. the world top cross country skiers.  So, does age matter for this exclusive group of elite’ skiers too…?

Let’s first look at the age spans for the two races, Marcialonga vs Tour de Ski: for Marcialonga, the ages span from 18 and up, with most skiers being in the 40+ and 50+ groups. For Tour de Ski, the ages span a much narrower range, from 20 to 38.

Let’s do a Linear Regression to see if age matters. First, here’s the graph using a “traditional” (i.e. “Frequentist”) statistical analysis method:


With the traditional, Frequentist statistical method, it indeed looks like age matters: both men and women have a regression line, sloping slightly upwards, which would indicate that there is a dependency between age and performance, in this case, each additional year of age would add about 6-7 seconds to the race time.

However, just by looking at the scattered dots (red for women, blue for men) it is really hard to see that there in fact exists any strong relationship between age and race time.

So, let’s see what can be done using Bayesian Linear Regression instead, where the uncertainty of any analysis is preserved, and can be illustrated very explicitly:


So here, instead of a single regression line for women, and one for men, as in the previous example, we have thousands of them for each gender: orange lines for women, green for men. From these regression lines, we can clearly see that there’s a whole lot of uncertainty about where the “true” regression resides, in fact, there is so much uncertainty that I’m willing to state that for this elite’ group of skiers, in this particular race, age does *not* impact performance, i.e there is no causal relationship between age and performance!

The graph also illustrates the underlying uncertainty of the data by the jagged colored (cyan, yellow) areas surrounding the regression lines: both the male as well as the female areas, “credible intervals”, are very wide. Take a look at e.g. this regression, for a comparison.

So. the bottom line is: age does not determine result in elite races.

 import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import seaborn as sns
from matplotlib import rcParams
import pymc as pm


def read_data(f):

    df = pd.read_csv(f,sep=';',header=None)
    df.columns = ['Pos','Name','Born','Nat','Time']

    def secs(x):
        s = datetime.strptime(x,'%M:%S.%f')
        return s.time().minute * 60 + s.time().second + (
            s.time().microsecond / 1e9)

    df['Secs'] = df['Time'].apply(secs)

    df['Diff'] = df['Secs'] - df['Secs'].shift()

    def diff2win(x):
        return x - df['Secs'].iloc[0]

    df['Diff2Win'] = df['Secs'].apply(diff2win)
    df['Age'] = 2018 - df['Born']
    df.set_index(np.arange(1,len(df) + 1) ,inplace=True)
    return df

df_F = read_data('final_climb_F.txt')
print (df_F.to_string())

df_M = read_data('final_climb_M.txt')
print (df_M.to_string())

#df_F = df_F.iloc[:28]
#df_M = df_M.iloc[:28]

m_f_timedelta = df_F.Secs / df_M.Secs
m_f_timedelta *= 100

### PYMC

def regression(df,alpha,beta,sigma):

    x = df['Age']
    def time_means(x=x,alpha=alpha,beta=beta):
        return x * beta + alpha

    likelihood = pm.Normal('likelihood',
                           tau = 1 / sigma ** 2,

    model = pm.Model([alpha,beta,sigma,time_means,likelihood])
    map_ = pm.MAP(model)

    mcmc = pm.MCMC(model)
    mcmc.sample(100000 ,50000 ,2)

    alpha_posterior = mcmc.trace('alpha')[:]
    beta_posterior = mcmc.trace('beta')[:]
    sigma_posterior = mcmc.trace('sigma')[:]
    time_means_posterior = mcmc.trace('time_means')[:,0]

    results = pd.DataFrame({'alpha_posterior':alpha_posterior,

    return results

m_alpha = pm.Uniform('alpha',1000,3000)
m_beta = pm.Normal('beta',100,1 / 50 ** 2)
m_sigma = pm.Normal('sigma',500,100)

m_results = regression(df_M,m_alpha,m_beta,m_sigma)
m_mean_alpha = m_results.alpha_posterior.mean()
m_mean_beta = m_results.beta_posterior.mean()

f_alpha = pm.Uniform('alpha',1500,3500)
f_beta = pm.Normal('beta',100,1 / 50 **2)
f_sigma = pm.Normal('sigma',500,100)

f_results = regression(df_F,f_alpha,f_beta,f_sigma)
f_mean_alpha = f_results.alpha_posterior.mean()
f_mean_beta = f_results.beta_posterior.mean()

nr_samples = 5000
sample_idx = np.random.choice(range(len(m_results)),

m_mean_time_samples = np.array([df_M.Age * m_results.beta_posterior.iloc[i] \
                     + m_results.alpha_posterior.iloc[i] \
                     for i in sample_idx])

f_mean_time_samples = np.array([df_F.Age * f_results.beta_posterior.iloc[i] \
                     + f_results.alpha_posterior.iloc[i] \
                     for i in sample_idx])

m_sample_times = np.array(
    for i in range(len(sample_idx))])

m_ci = np.percentile(m_sample_times,[5.5,94.5],axis=1)

f_sample_times = np.array(
    for i in range(len(sample_idx))])

f_ci = np.percentile(f_sample_times,[5.5,94.5],axis=1)

ax = plt.gca()

ax.set_ylabel('Race Time [seconds]')
ax.legend(loc='upper left')

plt.title('Tour de Ski Final Climb 2019 - Time diff top 28 Women vs Men')
plt.xlabel('Finishing Position in Race')

ax2 = plt.twinx()

ax2.set_ylabel("Female times relative to Men's times [%]")
ax2.plot(m_f_timedelta,'x-',color='orange',label='Rel time diff women vs men')
ax2.legend(loc='upper right')

plt.title('Tour de Ski Final Climb 2019 - Female Climb Times')
xticks = [df_F.Name.iloc[i] for i in range(len(df_F))]
plt.xlabel('Race Results (left to right)')
plt.ylabel('Race Time [seconds]')

plt.title('Tour de Ski Final Climb 2019 - Male Climb Times')
xticks = [df_M.Name.iloc[i] for i in range(len(df_M))]
plt.xlabel('Race Results (left to right)')
plt.ylabel('Race Time [seconds]')

plt.title('Tour de Ski 2019 Final Climb, time diff to winner')
plt.xlabel('Race Position')
plt.ylabel('Diff to winner [seconds]')
plt.legend(loc='upper left')

plt.title('Race Time distribution')
plt.ylabel('Number of racers')
plt.xlabel('Race Time [seconds]')
plt.legend(loc='upper right')


plt.plot(df_F.Age,df_F.Age * f_mean_beta + f_mean_alpha,ls='dashed',

plt.plot(df_M.Age,df_M.Age * m_mean_beta + m_mean_alpha,ls='dashed',

title_string = r'Men $\alpha$: {:.2f} $\beta$: {:.2f} Women $\alpha$: {:.2f} $\beta$: {:.2f}'.format(m_mean_alpha,m_mean_beta,f_mean_alpha,f_mean_beta)

plt.title ('Tour de Ski 2019 Final climb: Regression age->Time ' + title_string)
plt.ylabel('Race TIme')

for mts in m_mean_time_samples:

for mts in f_mean_time_samples:

xvals = np.linspace(df_M.Age.min(),df_M.Age.max() +1, nr_samples)

xvals = np.linspace(df_F.Age.min(),df_F.Age.max() +1, nr_samples)


About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Bayes, Data Analytics, Math, Numpy, Pandas, Probability, PYMC, Python, SNA, sports, Statistics and tagged , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s