Speed Latitud Longitud Time 2019-01-07 06:15:27 0 59.649582 17.721365 2019-01-07 06:16:28 0 59.649583 17.721372 2019-01-07 06:17:28 0 59.649583 17.721370 2019-01-07 06:18:28 0 59.649583 17.721372 2019-01-07 06:19:29 0 59.649600 17.721372 2019-01-07 06:20:06 6 59.649605 17.721278 2019-01-07 06:20:07 7 59.649625 17.721258 2019-01-07 06:20:08 6 59.649642 17.721247 2019-01-07 06:20:10 7 59.649670 17.721228 2019-01-07 06:20:14 7 59.649760 17.721238 2019-01-07 06:20:25 13 59.650020 17.721323 2019-01-07 06:20:28 13 59.650122 17.721323 2019-01-07 06:20:30 13 59.650187 17.721292

And you’d want to calculate distance (Rhumb Line) between each gps fix, and the cumulative distance: the Python code below does the trick.

import numpy as np import math import pandas as pd df = pd.read_csv('gps_log.csv',sep=';',header=0,parse_dates=True, index_col=0,usecols=[0,1,2,3]) R = 6378.1e3 def pos2d(lat1,lon1,lat2,lon2): la1 = np.radians(lat1) lo1 = np.radians(lon1) la2 = np.radians(lat2) lo2 = np.radians(lon2) dla = np.radians(lat2-lat1) dlo = np.radians(lon2-lon1) a = math.sin(dla/2) * math.sin(dla/2) + \ math.cos(la1) * math.cos(la2) * math.sin(dlo/2) * math.sin(dlo/2) c = 2 * math.atan2(math.sqrt(a),math.sqrt(1-a)) return R * c / 1000 df_shifted = df.shift() df['Dist'] = np.vectorize(pos2d)( df['Latitud'],df['Longitud'],df_shifted['Latitud'],df_shifted['Longitud']) df['Dist'] = df['Dist'].fillna(0) df['TotDist'] = df['Dist'].cumsum() print (df.to_string())

Output:

2019-01-07 06:15:27 0 59.649582 17.721365 0.000000 0.000000 2019-01-07 06:16:28 0 59.649583 17.721372 0.000409 0.000409 2019-01-07 06:17:28 0 59.649583 17.721370 0.000112 0.000522 2019-01-07 06:18:28 0 59.649583 17.721372 0.000112 0.000634 2019-01-07 06:19:29 0 59.649600 17.721372 0.001892 0.002527 2019-01-07 06:20:06 6 59.649605 17.721278 0.005317 0.007843 2019-01-07 06:20:07 7 59.649625 17.721258 0.002494 0.010338 2019-01-07 06:20:08 6 59.649642 17.721247 0.001991 0.012329 2019-01-07 06:20:10 7 59.649670 17.721228 0.003295 0.015624 2019-01-07 06:20:14 7 59.649760 17.721238 0.010034 0.025658 2019-01-07 06:20:25 13 59.650020 17.721323 0.029335 0.054993 2019-01-07 06:20:28 13 59.650122 17.721323 0.011355 0.066348 2019-01-07 06:20:30 13 59.650187 17.721292 0.007443 0.073791 2019-01-07 06:20:44 15 59.650737 17.720778 0.067708 0.141499 2019-01-07 06:20:45 15 59.650773 17.720725 0.004995 0.146493 2019-01-07 06:20:46 16 59.650790 17.720668 0.003723 0.150216 2019-01-07 06:20:49 15 59.650817 17.720532 0.008219 0.158435 2019-01-07 06:20:51 19 59.650817 17.720373 0.008943 0.167378 2019-01-07 06:20:58 17 59.650777 17.719705 0.037835 0.205213 2019-01-07 06:21:01 16 59.650817 17.719472 0.013841 0.219054 2019-01-07 06:21:05 19 59.650855 17.719117 0.020410 0.239465 2019-01-07 06:21:08 19 59.650858 17.718827 0.016315 0.255779 2019-01-07 06:21:13 9 59.650822 17.718467 0.020641 0.276421 2019-01-07 06:22:13 51 59.648910 17.707070 0.675463 0.951884 2019-01-07 06:22:16 48 59.648868 17.706333 0.041718 0.993602 2019-01-07 06:22:58 7 59.648642 17.702463 0.219134 1.212736 2019-01-07 06:23:00 13 59.648592 17.702390 0.006917 1.219653 2019-01-07 06:23:01 14 59.648550 17.702375 0.004751 1.224404 2019-01-07 06:23:04 30 59.648340 17.702412 0.023469 1.247873 2019-01-07 06:24:04 57 59.640857 17.704502 0.841256 2.089129 2019-01-07 06:24:26 56 59.637767 17.706400 0.360171 2.449300 2019-01-07 06:24:29 59 59.637297 17.707105 0.065658 2.514959

]]>

In order to access these API’s, you need a personal authorization key, which you can create at Google Developer Console, which in most cases will * not* be for free. (google it!

Anyhow, below a short Python example on how to map addresses – some of them being very vague! – to GPS coordinates. Considering that Google’s Geolocation services covers the entire world, it’s even more impressive that it can figure out the exact position of an address as vaque as for instance “Globen” below…!

import requests key='your-google-api-key' def map_addr_to_gps(address): payload = {'address':address,'key':key} url = 'https://maps.googleapis.com/maps/api/geocode/json' r = requests.get(url,payload) result = r.json() result = result['results'] for res in result: lat = res['geometry']['location']['lat'] lon = res['geometry']['location']['lng'] return (lat,lon) addr1 = 'arlanda t5' addr2 = 'stadshuset stockholm' addr3 = 'Globen' addresses = [addr1,addr2,addr3] for a in addresses: gps = map_addr_to_gps(a) print (gps[0],gps[1])

]]>

[Btw, the weird looking vertical orange/blue “spike” in the time plot below revealed a bug on the official Vasaloppet Results page – at men’s finishing position 1652… ]

]]>

Earlier today, I did a quick & dirty basic analysis of today’s Tour de Ski race, which asop to Marcialonga, only invites the real elite’, i.e. the world top cross country skiers. So, does age matter for this exclusive group of elite’ skiers too…?

Let’s first look at the age spans for the two races, Marcialonga vs Tour de Ski: for Marcialonga, the ages span from 18 and up, with most skiers being in the 40+ and 50+ groups. For Tour de Ski, the ages span a much narrower range, from 20 to 38.

Let’s do a Linear Regression to see if age matters. First, here’s the graph using a “traditional” (i.e. “Frequentist”) statistical analysis method:

With the traditional, Frequentist statistical method, it indeed looks like age matters: both men and women have a regression line, sloping slightly upwards, which would indicate that there is a dependency between age and performance, in this case, each additional year of age would add about 6-7 seconds to the race time.

However, just by looking at the scattered dots (red for women, blue for men) it is really hard to see that there in fact exists any strong relationship between age and race time.

So, let’s see what can be done using Bayesian Linear Regression instead, where the uncertainty of any analysis is preserved, and can be illustrated very explicitly:

So here, instead of a single regression line for women, and one for men, as in the previous example, we have thousands of them for each gender: orange lines for women, green for men. From these regression lines, we can clearly see that there’s a whole lot of uncertainty about where the “true” regression resides, in fact, there is so much uncertainty that I’m willing to state that for this elite’ group of skiers, in this particular race, age does *not* impact performance, i.e there is no causal relationship between age and performance!

The graph also illustrates the underlying uncertainty of the data by the jagged colored (cyan, yellow) areas surrounding the regression lines: both the male as well as the female areas, “credible intervals”, are very wide. Take a look at e.g. this regression, for a comparison.

So. the bottom line is: age does not determine result in elite races.

import pandas as pd import numpy as np import matplotlib.pyplot as plt from datetime import datetime import seaborn as sns from matplotlib import rcParams import pymc as pm sns.set() rcParams.update({'figure.autolayout':True}) def read_data(f): df = pd.read_csv(f,sep=';',header=None) df.columns = ['Pos','Name','Born','Nat','Time'] def secs(x): s = datetime.strptime(x,'%M:%S.%f') return s.time().minute * 60 + s.time().second + ( s.time().microsecond / 1e9) df['Secs'] = df['Time'].apply(secs) df.sort_values('Secs',inplace=True) df['Diff'] = df['Secs'] - df['Secs'].shift() def diff2win(x): return x - df['Secs'].iloc[0] df['Diff2Win'] = df['Secs'].apply(diff2win) df['Age'] = 2018 - df['Born'] df.set_index(np.arange(1,len(df) + 1) ,inplace=True) return df df_F = read_data('final_climb_F.txt') print (df_F.to_string()) df_M = read_data('final_climb_M.txt') print (df_M.to_string()) #df_F = df_F.iloc[:28] #df_M = df_M.iloc[:28] m_f_timedelta = df_F.Secs / df_M.Secs m_f_timedelta *= 100 ### PYMC def regression(df,alpha,beta,sigma): x = df['Age'] @pm.deterministic() def time_means(x=x,alpha=alpha,beta=beta): return x * beta + alpha likelihood = pm.Normal('likelihood', mu=time_means, tau = 1 / sigma ** 2, observed=True, value=df['Secs']) model = pm.Model([alpha,beta,sigma,time_means,likelihood]) map_ = pm.MAP(model) map_.fit() mcmc = pm.MCMC(model) mcmc.sample(100000 ,50000 ,2) alpha_posterior = mcmc.trace('alpha')[:] beta_posterior = mcmc.trace('beta')[:] sigma_posterior = mcmc.trace('sigma')[:] time_means_posterior = mcmc.trace('time_means')[:,0] results = pd.DataFrame({'alpha_posterior':alpha_posterior, 'beta_posterior':beta_posterior, 'sigma_posterior':sigma_posterior, 'time_means_posterior':time_means_posterior}) return results m_alpha = pm.Uniform('alpha',1000,3000) m_beta = pm.Normal('beta',100,1 / 50 ** 2) m_sigma = pm.Normal('sigma',500,100) m_results = regression(df_M,m_alpha,m_beta,m_sigma) m_mean_alpha = m_results.alpha_posterior.mean() m_mean_beta = m_results.beta_posterior.mean() f_alpha = pm.Uniform('alpha',1500,3500) f_beta = pm.Normal('beta',100,1 / 50 **2) f_sigma = pm.Normal('sigma',500,100) f_results = regression(df_F,f_alpha,f_beta,f_sigma) f_mean_alpha = f_results.alpha_posterior.mean() f_mean_beta = f_results.beta_posterior.mean() nr_samples = 5000 sample_idx = np.random.choice(range(len(m_results)), replace=True, size=nr_samples) m_mean_time_samples = np.array([df_M.Age * m_results.beta_posterior.iloc[i] \ + m_results.alpha_posterior.iloc[i] \ for i in sample_idx]) f_mean_time_samples = np.array([df_F.Age * f_results.beta_posterior.iloc[i] \ + f_results.alpha_posterior.iloc[i] \ for i in sample_idx]) m_sample_times = np.array( [np.random.normal(m_mean_time_samples[i], m_results.sigma_posterior.iloc[sample_idx[i]])\ for i in range(len(sample_idx))]) m_ci = np.percentile(m_sample_times,[5.5,94.5],axis=1) f_sample_times = np.array( [np.random.normal(f_mean_time_samples[i], f_results.sigma_posterior.iloc[sample_idx[i]])\ for i in range(len(sample_idx))]) f_ci = np.percentile(f_sample_times,[5.5,94.5],axis=1) ### PLOTTING plt.figure(figsize=(18,12)) ax = plt.gca() ax.plot(df_M.Secs,'x-',color='b',label='Men') ax.plot(df_F.Secs,'x-',color='r',label='Women') ax.set_ylabel('Race Time [seconds]') ax.legend(loc='upper left') plt.title('Tour de Ski Final Climb 2019 - Time diff top 28 Women vs Men') plt.xlabel('Finishing Position in Race') ax2 = plt.twinx() ax2.set_ylabel("Female times relative to Men's times [%]") ax2.plot(m_f_timedelta,'x-',color='orange',label='Rel time diff women vs men') ax2.legend(loc='upper right') plt.savefig('final_climb_diff.jpg',format='jpg') plt.figure(figsize=(18,12)) plt.title('Tour de Ski Final Climb 2019 - Female Climb Times') plt.plot(range(len(df_F)),df_F.Secs,'x-',color='r') xticks = [df_F.Name.iloc[i] for i in range(len(df_F))] plt.xticks(range(len(df_F)),xticks,rotation='vertical') plt.xlabel('Race Results (left to right)') plt.ylabel('Race Time [seconds]') plt.savefig('final_climb_times_women.jpg',format='jpg') plt.figure(figsize=(18,12)) plt.title('Tour de Ski Final Climb 2019 - Male Climb Times') plt.plot(range(len(df_M)),df_M.Secs,'x-',color='b') xticks = [df_M.Name.iloc[i] for i in range(len(df_M))] plt.xticks(range(len(df_M)),xticks,rotation='vertical') plt.xlabel('Race Results (left to right)') plt.ylabel('Race Time [seconds]') plt.savefig('final_climb_times_men.jpg',format='jpg') plt.figure(figsize=(18,12)) plt.title('Tour de Ski 2019 Final Climb, time diff to winner') plt.plot(df_F['Diff2Win'],'x-',color='r',label='Women') plt.plot(df_M['Diff2Win'],'x-',color='b',label='Men') plt.xlabel('Race Position') plt.ylabel('Diff to winner [seconds]') plt.savefig('final_climb_diff_to_winner.jpg',format='jpg') plt.legend(loc='upper left') plt.figure(figsize=(18,12)) plt.title('Race Time distribution') plt.hist(df_M.Secs,color='b',bins=range(1800,2500,30),label='Men') plt.hist(df_F.Secs,color='r',bins=range(1800,2500,30),label='Women') plt.ylabel('Number of racers') plt.xlabel('Race Time [seconds]') plt.legend(loc='upper right') plt.savefig('final_climb_dist.jpg',format='jpg') plt.figure(figsize=(18,12)) plt.scatter(df_F.Age,df_F.Secs,color='r') plt.scatter(df_M.Age,df_M.Secs,color='b') plt.plot(df_F.Age,df_F.Age * f_mean_beta + f_mean_alpha,ls='dashed', color='orange') plt.plot(df_M.Age,df_M.Age * m_mean_beta + m_mean_alpha,ls='dashed', color='green') title_string = r'Men $\alpha$: {:.2f} $\beta$: {:.2f} Women $\alpha$: {:.2f} $\beta$: {:.2f}'.format(m_mean_alpha,m_mean_beta,f_mean_alpha,f_mean_beta) plt.title ('Tour de Ski 2019 Final climb: Regression age->Time ' + title_string) plt.xlabel('Age') plt.ylabel('Race TIme') for mts in m_mean_time_samples: plt.plot(df_M.Age,mts,color='green',ls='dashed',alpha=0.01) for mts in f_mean_time_samples: plt.plot(df_F.Age,mts,color='orange',ls='dashed',alpha=0.01) xvals = np.linspace(df_M.Age.min(),df_M.Age.max() +1, nr_samples) plt.fill_between(xvals,m_ci[0,:],m_ci[1,:], color='cyan',alpha=0.2) xvals = np.linspace(df_F.Age.min(),df_F.Age.max() +1, nr_samples) plt.fill_between(xvals,f_ci[0,:],f_ci[1,:], color='yellow',alpha=0.2) plt.show()

]]>

First, some basic stats about the race:

Total number of participants: 5558, of which 4660 were men, 898 women. All these folks came from 34 different countries. Below a graphic showing the distribution of nationalities, and we can see that Marcialonga is clearly dominated by the Vikings, at least in man power…

Further evidence of Viking dominance can be obtained by looking at the distribution of last names of the participants: Lots of Johansson, Andersson etc…!

Furthermore, as the next graphic demonstrates, the event is not exactly dominated by the millenials – the dominant age group is the 50+ folks – perhaps ski racing is way too hard work for young & beautiful of today…?

So, with the basic stats over and done with, let’s dwell into some race performance matters. First, lets look at the finishing time distribution, for men as well as women:

First, observe that the histogram clearly shows that there are way more men than women participating. Secondly, it’s interesting that men have a long right hand tail, that is, there are both absolutely as well as relatively, many male partipants at the higher end of race times, i.e. relatively speaking, poor performers. The Female distribution is much more “normal”, i.e. the tails are more evenly distributed. One way to interpret this is that the participating men are more split, one fairly large part are very fit, and one large part of them are less well trained, where quite a few are very poorly trained, (relatively speaking – myself, I would probably die if I tried to ski that race…). While the women are more “equal” in their performance. The difference in tails between men & women can also be seen in the distances of the respective means and medians: the diff is quite a bit wider for the men.

Let’s next select the Swedish team – after all this is a Swedish blog – and look at their performance:

Here we have the overall race results, that is, men as well as women ordered by race time into a single results list, ranging from place 1 to place 5588. Again, we can see something interesting here, e.g. that the men are much more evenly, “uniformly” in stats speak, located at the “podium list”, while the women are “tail heavy”, that is, more frequently residing at the lower spots of the results list. Remember that the results list is a joint list, consiting of both genders, so this is not such a big surprise – except for those that believe that “gender is a purely social construct“.

Looking at the male distribution, it’s interesting to notice how uniform it is with a small peak just before place 1000 – my interpretation of that is that many of the participating Swedish men are fairly serious amateurs, putting quite a lot of effort into their skiing and general physical fitness.

Since we are at it, let’s analyze the general performance difference between men vs women:

The above graphic might look a bit busy for some, but rest assured, it’s really not that complicated: on the horizontal x-axis are the race result spots, ranging from 1 to 5558.

Most of the plots are plotted against the left hand vertical y-axis, which shows race time in seconds. Against that axis, I’ve mapped three plots: male race positions (blue), female race positions (red), and the time difference between them (orange). These plots show some interesting things: Look at the difference in slope between the female and the male plots. That difference tells us that for women, particularily for the top spots, the differences in race time are very large, while the men are much more even in their performance, all over the spectrum. We can also see that for both genders, the laggards are really really far behind those in front – the slope at the end of the plots is almost vertical.

Plotted against the right hand side y-axis, is the relative (think percent) difference between the first 898 men vs women (reason for that weird number is that it is the number of female participants, and since I compare men’s race finishing spots with women’s finishing spots…) Anyway’s, from that cyan plot we can see that for the top performers, e.g. the first 3-4 men vs first 3-4 women, the time difference is about 15% (115 on the right hand vertical scale). That is, the (presumably) elite’ men and women who finish in the first 3-4 spots in the respective group, have a time difference about 15%. That number is very, extremely consistent with what I observed in my analysis last weekend of Tour De Ski, where all participants are real elite skiers. Here, for Marcialonga, a race dominated by serious but amateur skiers, with only a handful real elite skiers, the difference in race time between men vs women grows very rapidly: already at 100th podium place, the difference in finishing time is almost 60%, and at the 800th or so finishing spots, the women take about 2x as long to finish the race.

Now, let’s look at a slightly more advanced analysis of the race results: a Bayesian Linear Regression, where we look at if and how age impacts race time:

Here, on the horizontal x-axis, I’ve put the age groups of the race. Age groups are simply an aggregation of the age of each competitor, into discrete groups, eg 18-30, 30-40 etc. On the vertical axis, we again have race finishing time. The blue vertical “bars” are actually dots, each representing a the time of a racer by that age. there are 5558 such blue dots on the graph.

The objective here is to figure out if and how age impacts race results. Not going into all the nitty gritty mathematical & computational details of Linear Regression, let me just point out the yellow lines: they are all sloping (slightly) upwards. That tells us that there is a dependency between age and race time, and that dependency is such that the older you are, the worse your performance. Not that surprising,right ? At least not for us senior citizens… Furthermore, the graph can also quantify that relationship: again, without going into all the nitty gritty detail, let me just finish up this post by saying that according to this data and my analysis, each additional year of age slows you down by about 60 seconds on the Marcialonga track! So if you did your first race at 20 years of age, you can expect your finishing time at your 40th anniversary race be about 40 minutes longer than your first race.

Postscript: a couple of bonus graphics:

First, let’s define “Best Nation” as the nation with the best mean (average) race time:

Next, a Kernel Density Estimation of age distribution:

Next, best race time per nation:

]]>

Anyways, I wanted to scrape some data for further analysis down the line.Normally, I’d put Pandas to heavy use for a lot of the data munging tasks, but since my neighborhood is currently experiencing the Mother of all Power outtages, after the storm of the century on past Tuesday, I dont have power to run my computer – thus, this stuff is done with Pythonista (2!) on my old & tired iPad. Btw, Pythonista is a really great Python environment for i*, and I guess some day I should upgrade tp Pythonista3…. But I sure miss Pandas; it makes structured data manipulation so very much more convenient than using lists or even numpy arrays….!

So, for this exercise: I wanted to collect the results from a very famous long distance cross country ski race, the Marcialonga, which takes place each January in my favorite place on this earth, Val di Fiemme, in the Dolomites. I’d preferred for the organisers of the race to publish the results in a more easy-to-grab format than html, but I couldn’t find anything else but the web page. Which furthermore splits the 5600 result entries to 56 pages… which took a while to figure out how to scrape multiple linked pages.

Below a couple of screen shots of an initial, very basic analysis. I’ll do quite a bit more statistical analysis if and when the power grid resumes operations…

# coding: utf-8 import requests from requests.utils import quote import re from bs4 import BeautifulSoup as bs import numpy as np from datetime import datetime,timedelta import matplotlib.pyplot as plt comp_list = [] # results are in 56 separate pages for page in range(1,57): print page url = 'https://www.marcialonga.it/marcialonga_ski/EN_results.php' payload = {'pagenum':page} print url print payload r = requests.get(url,params=payload) print r.status_code #print r.text c = r.content soup = bs(c,'html.parser') #print soup.prettify() tbl= soup.find('table') #print tbl main_table = tbl #print main_table #print main_table.get_text() competitors = main_table.find_all(class_='SP') for comp in competitors: comp_list.append(comp.get_text()) comp_list = list(map(lambda x : x.encode('utf-8'),comp_list)) #print comp_list print len(comp_list) def parse_item(i): res_pattern = r'[0-9]+' char_pattern = r'[A-Z]+' num_pattern = r'[0-9]+:[0-9]+:[0-9]+\.[0-9]' age_pattern = r'[0-9]+/' res =re.match(res_pattern,i).group() chars = re.findall(char_pattern,i,flags=re.IGNORECASE) nat = chars[-1][1:] nums = re.findall(num_pattern,i) age = re.findall(age_pattern,i) age_over=age[0][:-1] time_pattern = r'0[0-9]:[0-9]+:[0-9]+\.[0-9]$' time = re.findall(time_pattern,nums[0]) t = datetime.strptime(time[0],'%H:%M:%S.%f') #t = datetime.strftime(t,'%H:%M:%S.%f') td = (t - datetime(1900,1,1)).total_seconds() name = chars[0] + ' ' + chars[1] gender = chars[-2] return (res, name,nat,gender,age_over,td) results = [] for comp in comp_list: results.append(parse_item(comp)) ''' results = np.array(results,dtype=[('res','i4'),('name','U100'),('nat','U3'),('gender','U1'),('age','i4'),('time','datetime64[us]')]) ''' gender = np.array([results[i][3] for i in range(len(results))]) pos = np.array([results[i][0] for i in range(len(results))]) ages = np.array([results[i][4] for i in range(len(results))]).astype(int) secs = np.array([results[i][5] for i in range(len(results))]) print ages.size print secs.size friend_1 = 18844 friend_2= 24446 male_mask = gender=='M' male_secs = secs[male_mask] female_secs = secs[~male_mask] male_mean = male_secs.mean() female_mean = female_secs.mean() bins=range(10000,38000,1000) plt.subplot(211) plt.hist(male_secs,color='b',weights=np.zeros_like(male_secs) + 1. / male_secs.size,alpha=0.5,label='Men',bins=bins) plt.hist(female_secs,color='r',weights=np.zeros_like(female_secs) + 1. / female_secs.size,alpha=0.5,label='Women',bins=bins) plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women') plt.xlabel('Time [seconds]') plt.ylabel('Relative Frequency') plt.axvline(friend_1,ls='dashed',color='cyan',label='Friend_1',lw=5) plt.axvline(friend_2,ls='dashed',color='magenta',label='Friend_2',lw=5) plt.axvline(male_mean,ls='dashed',color='darkblue',label='Men mean',lw=5) plt.axvline(female_mean,ls='dashed',color='darkred',label='Women mean',lw=5) plt.legend(loc='upper right') def colors(x): if gender[x] == 'M': return 'b' else: return 'r' print secs.min(),secs.max() colormap = list(map(colors,range(len(gender)))) plt.subplot(212) plt.hist(male_secs,color='b',alpha=0.5,label='Men',bins=bins) plt.hist(female_secs,color='r',alpha=0.5,label='Women',bins=bins) plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women') plt.xlabel('Time [seconds]') plt.ylabel('Nr of Skiers') plt.legend(loc='upper right') plt.tight_layout() plt.show()

the end

]]>Data taken from today’s Tour the Ski sprint qualification times, for the top 30 women vs men, where both men & women used the same track and same distance.

Turns out that there is a consistent time difference about 15% between men and women, and that even the best woman skier is far from the worst male skier in performance, the difference between the best woman and the worst man is about 8% in time.

And if we include all qualification participants, 68 women and 80 men, the graphs look as follows:

And finally, a Bayesian Linear Regression demonstrating the variability, using a 89% credible interval, within the two datasets, where it can be visually observed that the variability in the female group is much larger than in the male group:

]]>Data in a csv-file describe various attributes such as weight, height, age, gender etc of an indigenous people, !Kung, below an extract:

"height";"weight";"age";"male" 151.765;47.8256065;63;1 139.7;36.4858065;63;0 136.525;31.864838;65;0 156.845;53.0419145;41;1 145.415;41.276872;51;0 163.83;62.992589;35;1 149.225;38.2434755;32;0 168.91;55.4799715;27;1

Here, I’m using weight as a predictor variable, and the objective is to predict height based on weight.

First, a histogram on population heights:

Next, a histogram of regression priors:

Next, posterior distributions:

Finally, a regression plot including a sampling from the posterior.

import numpy as np import matplotlib.pyplot as plt import pandas as pd import pymc as pm import numpy as np import seaborn as sns sns.set() np.random.seed(4711) bins = 20 df = pd.read_csv('../Stat_Rethink/Howell1.csv',header=0,sep=';') df = df.loc[df['age'] >= 18] df['weight_c'] = df['weight'] - np.mean(df['weight']) df['weight_s'] = df['weight_c'] / df['weight_c'].std() print (df.describe()) df['height'].hist(bins=bins,figsize=(18,12)) plt.title('Overall Height Distribution') plt.xlabel('Height') plt.ylabel ('Frequency') plt.savefig('pymc_basic_linear_0.jpg',format='jpg') # regression coefficients - priors alphas = pm.Normal('alphas',mu=178,tau=1/(50 * 50)) betas = pm.Lognormal('betas',0,1/(1 * 1)) alpha_prior_samples = np.array([alphas.random() for i in range(10000)]) beta_prior_samples = np.array([betas.random() for i in range(10000)]) priors = pd.DataFrame({'alpha_prior':alpha_prior_samples, 'beta_prior':beta_prior_samples}) priors.hist(bins=bins,figsize=(18,12)) plt.savefig('pymc_linear_basic_1.jpg',format='jpg') # regression function, to generate a distribution of mean heights per weight @pm.deterministic() def heights_mu(x=df.weight_s,alpha=alphas,beta=betas): return x * beta + alpha sigma = pm.Uniform('sigma',0,100) # likelihood, using the mean heighs from above as mu fit = pm.Normal('mean_heights', mu=heights_mu, tau=1/sigma**2, value=df.height, observed=True) model = pm.Model([alphas,betas,heights_mu,sigma,fit]) map_ = pm.MAP(model) map_.fit() mcmc = pm.MCMC(model) mcmc.sample(10000,2000,2) alpha_posterior = mcmc.trace('alphas')[:] beta_posterior = mcmc.trace('betas')[:] heights_mu_posterior = mcmc.trace('heights_mu')[:,0] sigma_posterior = mcmc.trace('sigma')[:] results = pd.DataFrame({'alpha_posterior':alpha_posterior, 'beta_posterior':beta_posterior, 'heights_mu_posterior':heights_mu_posterior, 'sigma_posterior':sigma_posterior}) print (results.head()) print (results.describe()) results.hist(bins=20,figsize=(18,12)) plt.savefig('pymc_basic_linear_2.jpg',format='jpg') # sample posterior nr_samples = 1000 sample_index = np.random.choice(range(len(results)), replace=True, size=nr_samples) sample_alphas = results.alpha_posterior.iloc[sample_index] sample_betas = results.beta_posterior.iloc[sample_index] sample_sigmas = results.sigma_posterior.iloc[sample_index] # must sort for fill_between to work! sample_weights = df.weight_s.sort_values() # for each weight, generate mean heights sample_mean_heights = np.array([sample_weights * sample_betas.iloc[i] + \ sample_alphas.iloc[i] \ for i in range(len(sample_index))]) # plot linear regression based on mean heights for each weight plt.figure(figsize=(18,12)) plt.title('Posterior Samples (green) and Data Points (red)') for h in sample_mean_heights: plt.plot(sample_weights,h,color='yellow',alpha=0.05) # for each weight, generate heights ~ N(mu,sigma) sample_heights = np.array([np.random.normal( sample_mean_heights[:,i],sample_sigmas) for i in range( len(sample_weights))]) # credible interval 89% ci = np.percentile(sample_heights,[5.5,94.5],axis=1) count = 0 for s in range(nr_samples): if count == 0: plt.scatter(sample_weights,sample_heights[:,s],color='g', alpha=0.7,label='Generated Sample Point') else: plt.scatter(sample_weights,sample_heights[:,s],color='g',alpha=0.05) count += 1 plt.scatter(df['weight_s'],df['height'],color='r') plt.fill_between(df.weight_s.sort_values(), ci[0,:],ci[1,:], color='orange',alpha=0.4,label='credible interval 89%') plt.xlabel('Standardized weight') plt.ylabel('Height') plt.legend() plt.savefig('pymc_basic_linear_3.jpg',format='jpg') plt.show()

]]>