In an earlier post, I analyzed data from the Marcialonga Ski race. Marcialonga is one of the classic long distance ski races, where both elite’ as well as amateurs compete together. In fact, the vast majority of the competitors in these classic long distance ski races are in fact amateurs, not elite’ skiers. One of the findings of that analysis was that “Age Matters”, that is, the analysis revealed that the older a participant is, the worse his/her performance.

Earlier today, I did a quick & dirty basic analysis of today’s Tour de Ski race, which asop to Marcialonga, only invites the real elite’, i.e. the world top cross country skiers. So, does age matter for this exclusive group of elite’ skiers too…?

Let’s first look at the age spans for the two races, Marcialonga vs Tour de Ski: for Marcialonga, the ages span from 18 and up, with most skiers being in the 40+ and 50+ groups. For Tour de Ski, the ages span a much narrower range, from 20 to 38.

Let’s do a Linear Regression to see if age matters. First, here’s the graph using a “traditional” (i.e. “Frequentist”) statistical analysis method:

With the traditional, Frequentist statistical method, it indeed looks like age matters: both men and women have a regression line, sloping slightly upwards, which would indicate that there is a dependency between age and performance, in this case, each additional year of age would add about 6-7 seconds to the race time.

However, just by looking at the scattered dots (red for women, blue for men) it is really hard to see that there in fact exists any strong relationship between age and race time.

So, let’s see what can be done using Bayesian Linear Regression instead, where the uncertainty of any analysis is preserved, and can be illustrated very explicitly:

So here, instead of a single regression line for women, and one for men, as in the previous example, we have thousands of them for each gender: orange lines for women, green for men. From these regression lines, we can clearly see that there’s a whole lot of uncertainty about where the “true” regression resides, in fact, there is so much uncertainty that I’m willing to state that for this elite’ group of skiers, in this particular race, age does *not* impact performance, i.e there is no causal relationship between age and performance!

The graph also illustrates the underlying uncertainty of the data by the jagged colored (cyan, yellow) areas surrounding the regression lines: both the male as well as the female areas, “credible intervals”, are very wide. Take a look at e.g. this regression, for a comparison.

So. the bottom line is: age does not determine result in elite races.

import pandas as pd import numpy as np import matplotlib.pyplot as plt from datetime import datetime import seaborn as sns from matplotlib import rcParams import pymc as pm sns.set() rcParams.update({'figure.autolayout':True}) def read_data(f): df = pd.read_csv(f,sep=';',header=None) df.columns = ['Pos','Name','Born','Nat','Time'] def secs(x): s = datetime.strptime(x,'%M:%S.%f') return s.time().minute * 60 + s.time().second + ( s.time().microsecond / 1e9) df['Secs'] = df['Time'].apply(secs) df.sort_values('Secs',inplace=True) df['Diff'] = df['Secs'] - df['Secs'].shift() def diff2win(x): return x - df['Secs'].iloc[0] df['Diff2Win'] = df['Secs'].apply(diff2win) df['Age'] = 2018 - df['Born'] df.set_index(np.arange(1,len(df) + 1) ,inplace=True) return df df_F = read_data('final_climb_F.txt') print (df_F.to_string()) df_M = read_data('final_climb_M.txt') print (df_M.to_string()) #df_F = df_F.iloc[:28] #df_M = df_M.iloc[:28] m_f_timedelta = df_F.Secs / df_M.Secs m_f_timedelta *= 100 ### PYMC def regression(df,alpha,beta,sigma): x = df['Age'] @pm.deterministic() def time_means(x=x,alpha=alpha,beta=beta): return x * beta + alpha likelihood = pm.Normal('likelihood', mu=time_means, tau = 1 / sigma ** 2, observed=True, value=df['Secs']) model = pm.Model([alpha,beta,sigma,time_means,likelihood]) map_ = pm.MAP(model) map_.fit() mcmc = pm.MCMC(model) mcmc.sample(100000 ,50000 ,2) alpha_posterior = mcmc.trace('alpha')[:] beta_posterior = mcmc.trace('beta')[:] sigma_posterior = mcmc.trace('sigma')[:] time_means_posterior = mcmc.trace('time_means')[:,0] results = pd.DataFrame({'alpha_posterior':alpha_posterior, 'beta_posterior':beta_posterior, 'sigma_posterior':sigma_posterior, 'time_means_posterior':time_means_posterior}) return results m_alpha = pm.Uniform('alpha',1000,3000) m_beta = pm.Normal('beta',100,1 / 50 ** 2) m_sigma = pm.Normal('sigma',500,100) m_results = regression(df_M,m_alpha,m_beta,m_sigma) m_mean_alpha = m_results.alpha_posterior.mean() m_mean_beta = m_results.beta_posterior.mean() f_alpha = pm.Uniform('alpha',1500,3500) f_beta = pm.Normal('beta',100,1 / 50 **2) f_sigma = pm.Normal('sigma',500,100) f_results = regression(df_F,f_alpha,f_beta,f_sigma) f_mean_alpha = f_results.alpha_posterior.mean() f_mean_beta = f_results.beta_posterior.mean() nr_samples = 5000 sample_idx = np.random.choice(range(len(m_results)), replace=True, size=nr_samples) m_mean_time_samples = np.array([df_M.Age * m_results.beta_posterior.iloc[i] \ + m_results.alpha_posterior.iloc[i] \ for i in sample_idx]) f_mean_time_samples = np.array([df_F.Age * f_results.beta_posterior.iloc[i] \ + f_results.alpha_posterior.iloc[i] \ for i in sample_idx]) m_sample_times = np.array( [np.random.normal(m_mean_time_samples[i], m_results.sigma_posterior.iloc[sample_idx[i]])\ for i in range(len(sample_idx))]) m_ci = np.percentile(m_sample_times,[5.5,94.5],axis=1) f_sample_times = np.array( [np.random.normal(f_mean_time_samples[i], f_results.sigma_posterior.iloc[sample_idx[i]])\ for i in range(len(sample_idx))]) f_ci = np.percentile(f_sample_times,[5.5,94.5],axis=1) ### PLOTTING plt.figure(figsize=(18,12)) ax = plt.gca() ax.plot(df_M.Secs,'x-',color='b',label='Men') ax.plot(df_F.Secs,'x-',color='r',label='Women') ax.set_ylabel('Race Time [seconds]') ax.legend(loc='upper left') plt.title('Tour de Ski Final Climb 2019 - Time diff top 28 Women vs Men') plt.xlabel('Finishing Position in Race') ax2 = plt.twinx() ax2.set_ylabel("Female times relative to Men's times [%]") ax2.plot(m_f_timedelta,'x-',color='orange',label='Rel time diff women vs men') ax2.legend(loc='upper right') plt.savefig('final_climb_diff.jpg',format='jpg') plt.figure(figsize=(18,12)) plt.title('Tour de Ski Final Climb 2019 - Female Climb Times') plt.plot(range(len(df_F)),df_F.Secs,'x-',color='r') xticks = [df_F.Name.iloc[i] for i in range(len(df_F))] plt.xticks(range(len(df_F)),xticks,rotation='vertical') plt.xlabel('Race Results (left to right)') plt.ylabel('Race Time [seconds]') plt.savefig('final_climb_times_women.jpg',format='jpg') plt.figure(figsize=(18,12)) plt.title('Tour de Ski Final Climb 2019 - Male Climb Times') plt.plot(range(len(df_M)),df_M.Secs,'x-',color='b') xticks = [df_M.Name.iloc[i] for i in range(len(df_M))] plt.xticks(range(len(df_M)),xticks,rotation='vertical') plt.xlabel('Race Results (left to right)') plt.ylabel('Race Time [seconds]') plt.savefig('final_climb_times_men.jpg',format='jpg') plt.figure(figsize=(18,12)) plt.title('Tour de Ski 2019 Final Climb, time diff to winner') plt.plot(df_F['Diff2Win'],'x-',color='r',label='Women') plt.plot(df_M['Diff2Win'],'x-',color='b',label='Men') plt.xlabel('Race Position') plt.ylabel('Diff to winner [seconds]') plt.savefig('final_climb_diff_to_winner.jpg',format='jpg') plt.legend(loc='upper left') plt.figure(figsize=(18,12)) plt.title('Race Time distribution') plt.hist(df_M.Secs,color='b',bins=range(1800,2500,30),label='Men') plt.hist(df_F.Secs,color='r',bins=range(1800,2500,30),label='Women') plt.ylabel('Number of racers') plt.xlabel('Race Time [seconds]') plt.legend(loc='upper right') plt.savefig('final_climb_dist.jpg',format='jpg') plt.figure(figsize=(18,12)) plt.scatter(df_F.Age,df_F.Secs,color='r') plt.scatter(df_M.Age,df_M.Secs,color='b') plt.plot(df_F.Age,df_F.Age * f_mean_beta + f_mean_alpha,ls='dashed', color='orange') plt.plot(df_M.Age,df_M.Age * m_mean_beta + m_mean_alpha,ls='dashed', color='green') title_string = r'Men $\alpha$: {:.2f} $\beta$: {:.2f} Women $\alpha$: {:.2f} $\beta$: {:.2f}'.format(m_mean_alpha,m_mean_beta,f_mean_alpha,f_mean_beta) plt.title ('Tour de Ski 2019 Final climb: Regression age->Time ' + title_string) plt.xlabel('Age') plt.ylabel('Race TIme') for mts in m_mean_time_samples: plt.plot(df_M.Age,mts,color='green',ls='dashed',alpha=0.01) for mts in f_mean_time_samples: plt.plot(df_F.Age,mts,color='orange',ls='dashed',alpha=0.01) xvals = np.linspace(df_M.Age.min(),df_M.Age.max() +1, nr_samples) plt.fill_between(xvals,m_ci[0,:],m_ci[1,:], color='cyan',alpha=0.2) xvals = np.linspace(df_F.Age.min(),df_F.Age.max() +1, nr_samples) plt.fill_between(xvals,f_ci[0,:],f_ci[1,:], color='yellow',alpha=0.2) plt.show()