Marcialonga Ski 2018 – some Analytics

Now, with the power grid finally – after 62 hours! – back in business, I’m able to continue my stats/analytics exploration of the past Marcialonga ski race.

First, some basic stats about the race:

Total number of participants: 5558, of which 4660 were men, 898 women. All these folks came from 34 different countries.  Below a graphic showing the distribution of nationalities, and we can see that Marcialonga is clearly dominated by the Vikings, at least in man power… 🙂

marcialonga_nation_counts

Further evidence of Viking dominance can be obtained by looking at the distribution of last names of the participants: Lots of Johansson, Andersson etc…! 😉

marcialonga_names_dist

Furthermore, as the next graphic demonstrates, the event is not exactly dominated by the millenials – the dominant age group is the 50+ folks – perhaps ski racing is way too hard work for young & beautiful of today…?

figure_4

So, with the basic stats over and done with, let’s dwell into some race performance matters. First, lets look at the finishing time distribution, for men as well as women:

figure_5

First, observe that the histogram clearly shows that there are way more men than women participating. Secondly, it’s interesting that men have a long right hand tail, that is, there are both absolutely as well as relatively, many male partipants at the higher end of race times, i.e. relatively speaking, poor performers. The Female distribution is much more “normal”, i.e. the tails are more evenly distributed.  One way to interpret this is that the participating men are more split, one fairly large part are very fit, and one large part of them are less well trained, where quite a few are very poorly trained, (relatively  speaking – myself, I would probably die if I tried to ski that race…). While the women are more “equal” in their performance.  The difference in tails between men & women can also be seen in the distances of the respective means and medians: the diff is quite a bit wider for the men.

Let’s next select the Swedish team – after all this is a Swedish blog – and look at their performance:

marcia_longa_swedish

Here we have the overall race results, that is, men as well as women ordered by race time into a single results list, ranging from place 1 to place 5588. Again, we can see something interesting here, e.g. that the men are much more evenly, “uniformly” in stats speak,  located at the “podium list”, while the women are “tail heavy”, that is, more frequently residing at the lower spots of the results list. Remember that the results list is a joint list, consiting of both genders, so this is not such a big surprise – except for those that believe that “gender is a purely social construct“.

Looking at the male distribution, it’s interesting to notice how uniform it is with a small peak just before place 1000 – my interpretation of that is that many of the participating Swedish men are fairly serious amateurs, putting quite a lot of effort into their skiing and general physical fitness.

Since we are at it, let’s analyze the general performance difference between men vs women:

marcialonga_time_diff

The above graphic might look a bit busy for some, but rest assured, it’s really not that complicated: on the horizontal x-axis are the race result spots, ranging from 1 to 5558.

Most of the plots are plotted against the left hand vertical y-axis, which shows race time in seconds. Against that axis, I’ve mapped three plots: male race positions (blue), female race positions (red), and the time difference between them (orange). These plots show some interesting things: Look at the difference in slope between the female and the male plots. That difference tells us that for women,  particularily for the top spots, the differences in race time are very large, while the men are much more even in their performance, all over the spectrum. We can also see that for both genders, the laggards are really really far behind those in front – the slope at the end of the plots is almost vertical.

Plotted against the right hand side y-axis, is the relative (think percent) difference between the first 898 men vs women (reason for that weird number is that it is the number of female participants, and since I compare men’s  race finishing spots with women’s finishing spots…)  Anyway’s, from that cyan plot we can see that for the top performers, e.g. the first 3-4 men vs first 3-4 women, the time difference is about 15% (115 on the right hand vertical scale).  That is, the (presumably) elite’ men and women who finish in the first 3-4 spots in the respective group, have a time difference about 15%. That number is very, extremely consistent with what I observed in my analysis last weekend of Tour De Ski, where all participants are real elite skiers. Here, for Marcialonga, a race dominated by serious but amateur skiers, with only a handful real elite skiers, the difference in race time between men vs women grows very rapidly: already at 100th podium place, the difference in finishing time is almost 60%, and at the 800th or so finishing spots, the women take about 2x as long to finish the race.

Now, let’s look at a slightly more advanced analysis of the race results: a Bayesian Linear Regression, where we look at if and how age impacts race time:

marcialonga_linear_regression

Here, on the horizontal x-axis, I’ve put the age groups of the race. Age groups are simply an aggregation of the age of each competitor, into discrete groups, eg 18-30, 30-40 etc.  On the vertical axis, we again have race finishing time. The blue vertical “bars” are actually dots, each representing a the time of a racer by that age. there are 5558 such blue dots on the graph.

The objective here is to figure out if and how age impacts race results. Not going into all the nitty gritty mathematical & computational details of Linear Regression, let me just point out the yellow lines: they are all sloping (slightly) upwards. That tells us that there is a dependency between age and race time, and that dependency is such that the older you are, the worse your performance. Not that surprising,right ? At least not for us senior citizens…  Furthermore, the graph can also quantify that relationship: again, without going into all the nitty gritty detail, let me just finish up this post by saying that according to this data and my analysis, each additional year of age slows you down by about 60 seconds on the Marcialonga track! 🙂 So if you did your first race at 20 years of age, you can expect your finishing time at your 40th anniversary race be about 40 minutes longer than your first race.

Postscript: a couple of bonus graphics:

First, let’s define “Best Nation” as the nation with the best mean (average) race time:

marcialonga_best_nation

Next, a Kernel Density Estimation of age distribution:

marcia_joint.jpg

Next, best race time per nation:

marcialonga_best_of_nationmarcialonga_race_time_dist

 

 

 

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Bayes, Data Analytics, Numpy, Pandas, Probability, PYMC, Python, sports, Statistics and tagged , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s