Ski Tour 2020 Östersund – Race Analysis

I took the results from past Sunday’s Women’s Pursuit, 15 km, and did a bit of analysis.  (The official results can be found here)

Since pursuit means that each racer starts the current race with a delay based on performance in the previous race, it’s impossible to see, from the official results, who ran fastest, i.e. did the current race in shortest time.  So, I subtracted the start delay time for each individual competitor from their official results time, and it’s pretty interesting to see that the “impossible” Therese Johaug, was not fastest, in fact, she was only third fastest in this race.

The best time in this race had Ana Maria Lampic the Slovenian sprint world cup leader, followed by Frida Karlsson, 3.1 sec behind Lampic. Therese Johaug was an additional 1.2 sec behind Frida.

Below a graph showing the number of seconds each competitor was behind Lampic:


Let’s zoom in on the Swedish skiers:


Next, let’s look at the trend over the 4 sections of the course, start-1.6 km, 1.6 km-4.9 km, 4.9 km to 8.2 km, and 8.2 km to finish at 10 km, for the top-10 fastest skiers: in the graph below, each colored bar shows the racer’s performance, expressed in seconds-behind-the-fastest-racer of that section, with the blue bars corresponding to section start-1.6 km etc. In most cases, each competitor has 4 bars, one for each section of the track, but in those cases where there are only 3 bars, the implication is that that racer was fastest on that particular sector of the course:


Another perspective on the same is given by the lineplot below, which makes it perhaps easier to see the trend, and where on the track, in which sector, each racer performed better or worse: for instance, here it’s easy to see that Frida Karlsson had burned up most of her energy when reaching the final 8.2 km sector – from having been only a few seconds from the best sector time in sector 3, she lost almost 17 seconds against the best time of the last sector.


Finally, the same trend graph, focused on the Swedish participants, but this time, instead of showing seconds-behind-the-fastest, we’ll show race positions per course sector: Here we can see that the 10+ seconds Frida lost in the last sector against the fastest racer of that sector, corresponds to dropping from 5th to 48th place in the race.


Finally, a couple of graphs,possibly  meaningful only to stat’s nerds 🙂

This one shows the race time distribution by nation. Interesting to see that the swedish team on average performed better than the norwegians 🙂


Next graph shows linear regression with race time over start delay time, meaning that those under the line performed better than given by their starting time, those above performed worse.



Posted in Data Analytics, sports, Statistics | Tagged , | Leave a comment

Corona outbreak – why isolation of infected people matters

Just hacked a simple simulation on the spread and effects of a virus, like the Corona virus that is in voque as we speak.

The model is very simple, with the following assumptions/constants:

  • The population size is constant, in the example, 160.000 people
  • The initial probability for anyone within the population to be infected is 0.001
  • The period during which a person carrying the virus is able to infect others is 2 weeks
  • the probability that a person carrying the virus will infect another person when they meet is 0.05

start by looking at what happens when the average number of contacts per week per person is 40:

We can see that the process is very fast – already during the second week, the number of people infected is almost 140.000, out of total population of 160.000


Next, let’s look at mortality: with a mortality rate of 2%, the process results in about 2700 deceased, very early in the process.


Now let’s look at what happens if we limit the number of contacts to 5:


Mortality after limiting number of contacts drops by two orders of magnitude, from about 1500 to 10.


Finally, let’s look at what happens if we increase the period during which an infected person can contaminate others from 2 weeks to 5 weeks, while keeping the number of contacts constant at 5:  here, the max number of people infected is about 3000, that is, an order of magnitude greater than when the contamination period was 2 weeks, and the process peaks much later, after 17-18 weeks instead of already during week 2.



Now, mortality also increases massively, from the previous 10, to almost 700.


Lesson learned – at least as given by the parameters and constraints of this simulation run: limiting the amount of physical contact between people, and trying to reduce the contamination period are both key tools for containing a virus outbreak.

[Addendum] Below a graphical representation of the spreading of virus :


Further reading on spread of virus

Posted in Big Government, Epidemics, Probability, Simulation | Tagged , , | Leave a comment

Marcia Longa 2020 – race statistics

Below some stat’s from this year’s Marcia Longa Cross Country Skiing. marcia_part_nation_gender









Posted in Data Analytics, sports | Tagged , , | Leave a comment

Cold water skills

Here’s What You Need to Know: Cold Water Kills

Posted in development | Leave a comment

Calculating distance from Lat & Lon coordinates using Python & Pandas

Given a GPS log file structured as follows (column separators omitted for clarity):

                     Speed      Latitud   Longitud      
2019-01-07 06:15:27          0  59.649582  17.721365  
2019-01-07 06:16:28          0  59.649583  17.721372  
2019-01-07 06:17:28          0  59.649583  17.721370  
2019-01-07 06:18:28          0  59.649583  17.721372  
2019-01-07 06:19:29          0  59.649600  17.721372  
2019-01-07 06:20:06          6  59.649605  17.721278  
2019-01-07 06:20:07          7  59.649625  17.721258  
2019-01-07 06:20:08          6  59.649642  17.721247  
2019-01-07 06:20:10          7  59.649670  17.721228  
2019-01-07 06:20:14          7  59.649760  17.721238  
2019-01-07 06:20:25         13  59.650020  17.721323  
2019-01-07 06:20:28         13  59.650122  17.721323  
2019-01-07 06:20:30         13  59.650187  17.721292  

And you’d want to calculate distance (Rhumb Line) between each gps fix, and the cumulative distance: the Python code below does the trick.

import numpy as np
import math
import pandas as pd

df = pd.read_csv('gps_log.csv',sep=';',header=0,parse_dates=True,

R = 6378.1e3

def pos2d(lat1,lon1,lat2,lon2):

    la1 = np.radians(lat1)
    lo1 = np.radians(lon1)
    la2 = np.radians(lat2)
    lo2 = np.radians(lon2)
    dla = np.radians(lat2-lat1)
    dlo = np.radians(lon2-lon1)
    a = math.sin(dla/2) * math.sin(dla/2) + \
        math.cos(la1) * math.cos(la2) * math.sin(dlo/2) * math.sin(dlo/2)
    c = 2 * math.atan2(math.sqrt(a),math.sqrt(1-a))
    return R * c / 1000

df_shifted = df.shift()

df['Dist'] = np.vectorize(pos2d)(

df['Dist'] = df['Dist'].fillna(0)
df['TotDist'] = df['Dist'].cumsum()

print (df.to_string())


2019-01-07 06:15:27          0  59.649582  17.721365  0.000000    0.000000
2019-01-07 06:16:28          0  59.649583  17.721372  0.000409    0.000409
2019-01-07 06:17:28          0  59.649583  17.721370  0.000112    0.000522
2019-01-07 06:18:28          0  59.649583  17.721372  0.000112    0.000634
2019-01-07 06:19:29          0  59.649600  17.721372  0.001892    0.002527
2019-01-07 06:20:06          6  59.649605  17.721278  0.005317    0.007843
2019-01-07 06:20:07          7  59.649625  17.721258  0.002494    0.010338
2019-01-07 06:20:08          6  59.649642  17.721247  0.001991    0.012329
2019-01-07 06:20:10          7  59.649670  17.721228  0.003295    0.015624
2019-01-07 06:20:14          7  59.649760  17.721238  0.010034    0.025658
2019-01-07 06:20:25         13  59.650020  17.721323  0.029335    0.054993
2019-01-07 06:20:28         13  59.650122  17.721323  0.011355    0.066348
2019-01-07 06:20:30         13  59.650187  17.721292  0.007443    0.073791
2019-01-07 06:20:44         15  59.650737  17.720778  0.067708    0.141499
2019-01-07 06:20:45         15  59.650773  17.720725  0.004995    0.146493
2019-01-07 06:20:46         16  59.650790  17.720668  0.003723    0.150216
2019-01-07 06:20:49         15  59.650817  17.720532  0.008219    0.158435
2019-01-07 06:20:51         19  59.650817  17.720373  0.008943    0.167378
2019-01-07 06:20:58         17  59.650777  17.719705  0.037835    0.205213
2019-01-07 06:21:01         16  59.650817  17.719472  0.013841    0.219054
2019-01-07 06:21:05         19  59.650855  17.719117  0.020410    0.239465
2019-01-07 06:21:08         19  59.650858  17.718827  0.016315    0.255779
2019-01-07 06:21:13          9  59.650822  17.718467  0.020641    0.276421
2019-01-07 06:22:13         51  59.648910  17.707070  0.675463    0.951884
2019-01-07 06:22:16         48  59.648868  17.706333  0.041718    0.993602
2019-01-07 06:22:58          7  59.648642  17.702463  0.219134    1.212736
2019-01-07 06:23:00         13  59.648592  17.702390  0.006917    1.219653
2019-01-07 06:23:01         14  59.648550  17.702375  0.004751    1.224404
2019-01-07 06:23:04         30  59.648340  17.702412  0.023469    1.247873
2019-01-07 06:24:04         57  59.640857  17.704502  0.841256    2.089129
2019-01-07 06:24:26         56  59.637767  17.706400  0.360171    2.449300
2019-01-07 06:24:29         59  59.637297  17.707105  0.065658    2.514959


Posted in Data Analytics, Math, Nautical Information Systems, Pandas, Python | Tagged , , , , | Leave a comment

Geolocation with Google API’s & Python – mapping addresses to GPS coordinates

Google does some pretty impressive things – not just all the web-based search stuff, but they also have lot’s and lot’s of really cool API’s for programmers to peruse.

In order to access these API’s, you need a personal authorization key, which you can create at Google Developer Console, which in most cases will not be for free. (google it! 🙂

Anyhow, below a short Python example on how to map addresses – some of them being very vague! – to GPS coordinates. Considering that Google’s Geolocation services covers the entire world, it’s even more impressive that it can figure out the exact position of an address as vaque as for instance “Globen” below…!

Continue reading

Posted in API, Data Analytics, Python, Web | Tagged , , , | Leave a comment

Vasaloppet 2018 – race time analysis

An analysis of race times for the ~11000 men and ~2000 women that participated in 2018 Vasaloppet. For explanations of the graphs, see earlier posts on Marcialonga or Tour de Ski.

[Btw, the weird looking vertical orange/blue “spike” in the time plot below revealed a bug on the official Vasaloppet Results page – at men’s finishing position 1652… 🙂 ]



NOTE! Log-10 scale (otherwise you wouldn’t see more than the 3-4 first bars in each chart)



Posted in Data Analytics, Numpy, Pandas, Python, sports, Statistics, Web | Tagged , , , , | Leave a comment

Tour de Ski Final climb – does age matter for performance ?

In an earlier post, I analyzed data from the Marcialonga Ski race. Marcialonga is one of the classic long distance ski races, where both elite’ as well as amateurs compete together. In fact, the vast majority of the competitors in these classic long distance ski races are in fact amateurs, not elite’ skiers.  One of the findings of that analysis was that “Age Matters”, that is, the analysis revealed that the older a participant is, the worse his/her performance.

Earlier today, I did a quick & dirty basic analysis of today’s Tour de Ski race, which asop to Marcialonga, only invites the real elite’, i.e. the world top cross country skiers.  So, does age matter for this exclusive group of elite’ skiers too…?

Let’s first look at the age spans for the two races, Marcialonga vs Tour de Ski: for Marcialonga, the ages span from 18 and up, with most skiers being in the 40+ and 50+ groups. For Tour de Ski, the ages span a much narrower range, from 20 to 38.

Let’s do a Linear Regression to see if age matters. First, here’s the graph using a “traditional” (i.e. “Frequentist”) statistical analysis method:


With the traditional, Frequentist statistical method, it indeed looks like age matters: both men and women have a regression line, sloping slightly upwards, which would indicate that there is a dependency between age and performance, in this case, each additional year of age would add about 6-7 seconds to the race time.

However, just by looking at the scattered dots (red for women, blue for men) it is really hard to see that there in fact exists any strong relationship between age and race time.

So, let’s see what can be done using Bayesian Linear Regression instead, where the uncertainty of any analysis is preserved, and can be illustrated very explicitly:


So here, instead of a single regression line for women, and one for men, as in the previous example, we have thousands of them for each gender: orange lines for women, green for men. From these regression lines, we can clearly see that there’s a whole lot of uncertainty about where the “true” regression resides, in fact, there is so much uncertainty that I’m willing to state that for this elite’ group of skiers, in this particular race, age does *not* impact performance, i.e there is no causal relationship between age and performance!

The graph also illustrates the underlying uncertainty of the data by the jagged colored (cyan, yellow) areas surrounding the regression lines: both the male as well as the female areas, “credible intervals”, are very wide. Take a look at e.g. this regression, for a comparison.

So. the bottom line is: age does not determine result in elite races.

Continue reading

Posted in Bayes, Data Analytics, Math, Numpy, Pandas, Probability, PYMC, Python, SNA, sports, Statistics | Tagged , , , , , , , , | Leave a comment

Tour de Ski 2019 Final Climb Analysis

Just a quickie analysis on the just finished race, comparing the climb times for women vs men, the top-28 of both genders. Once again, the results are consistent with my earlier findings on this topic: at elite’ level, the difference in performance, between men and women in endurance sports such as cross country skiing, is about 15-20%.




Posted in Data Analytics, sports, Statistics | Tagged , | Leave a comment

Marcialonga Ski 2019 – some Analytics

Now, with the power grid finally – after 62 hours! – back in business, I’m able to continue my stats/analytics exploration of the past Marcialonga ski race.

First, some basic stats about the race:

Total number of participants: 5558, of which 4660 were men, 898 women. All these folks came from 34 different countries.  Below a graphic showing the distribution of nationalities, and we can see that Marcialonga is clearly dominated by the Vikings, at least in man power… 🙂


Further evidence of Viking dominance can be obtained by looking at the distribution of last names of the participants: Lots of Johansson, Andersson etc…! 😉


Furthermore, as the next graphic demonstrates, the event is not exactly dominated by the millenials – the dominant age group is the 50+ folks – perhaps ski racing is way too hard work for young & beautiful of today…?


So, with the basic stats over and done with, let’s dwell into some race performance matters. First, lets look at the finishing time distribution, for men as well as women:


First, observe that the histogram clearly shows that there are way more men than women participating. Secondly, it’s interesting that men have a long right hand tail, that is, there are both absolutely as well as relatively, many male partipants at the higher end of race times, i.e. relatively speaking, poor performers. The Female distribution is much more “normal”, i.e. the tails are more evenly distributed.  One way to interpret this is that the participating men are more split, one fairly large part are very fit, and one large part of them are less well trained, where quite a few are very poorly trained, (relatively  speaking – myself, I would probably die if I tried to ski that race…). While the women are more “equal” in their performance.  The difference in tails between men & women can also be seen in the distances of the respective means and medians: the diff is quite a bit wider for the men.

Let’s next select the Swedish team – after all this is a Swedish blog – and look at their performance:


Here we have the overall race results, that is, men as well as women ordered by race time into a single results list, ranging from place 1 to place 5588. Again, we can see something interesting here, e.g. that the men are much more evenly, “uniformly” in stats speak,  located at the “podium list”, while the women are “tail heavy”, that is, more frequently residing at the lower spots of the results list. Remember that the results list is a joint list, consiting of both genders, so this is not such a big surprise – except for those that believe that “gender is a purely social construct“.

Looking at the male distribution, it’s interesting to notice how uniform it is with a small peak just before place 1000 – my interpretation of that is that many of the participating Swedish men are fairly serious amateurs, putting quite a lot of effort into their skiing and general physical fitness.

Since we are at it, let’s analyze the general performance difference between men vs women:


The above graphic might look a bit busy for some, but rest assured, it’s really not that complicated: on the horizontal x-axis are the race result spots, ranging from 1 to 5558.

Most of the plots are plotted against the left hand vertical y-axis, which shows race time in seconds. Against that axis, I’ve mapped three plots: male race positions (blue), female race positions (red), and the time difference between them (orange). These plots show some interesting things: Look at the difference in slope between the female and the male plots. That difference tells us that for women,  particularily for the top spots, the differences in race time are very large, while the men are much more even in their performance, all over the spectrum. We can also see that for both genders, the laggards are really really far behind those in front – the slope at the end of the plots is almost vertical.

Plotted against the right hand side y-axis, is the relative (think percent) difference between the first 898 men vs women (reason for that weird number is that it is the number of female participants, and since I compare men’s  race finishing spots with women’s finishing spots…)  Anyway’s, from that cyan plot we can see that for the top performers, e.g. the first 3-4 men vs first 3-4 women, the time difference is about 15% (115 on the right hand vertical scale).  That is, the (presumably) elite’ men and women who finish in the first 3-4 spots in the respective group, have a time difference about 15%. That number is very, extremely consistent with what I observed in my analysis last weekend of Tour De Ski, where all participants are real elite skiers. Here, for Marcialonga, a race dominated by serious but amateur skiers, with only a handful real elite skiers, the difference in race time between men vs women grows very rapidly: already at 100th podium place, the difference in finishing time is almost 60%, and at the 800th or so finishing spots, the women take about 2x as long to finish the race.

Now, let’s look at a slightly more advanced analysis of the race results: a Bayesian Linear Regression, where we look at if and how age impacts race time:


Here, on the horizontal x-axis, I’ve put the age groups of the race. Age groups are simply an aggregation of the age of each competitor, into discrete groups, eg 18-30, 30-40 etc.  On the vertical axis, we again have race finishing time. The blue vertical “bars” are actually dots, each representing a the time of a racer by that age. there are 5558 such blue dots on the graph.

The objective here is to figure out if and how age impacts race results. Not going into all the nitty gritty mathematical & computational details of Linear Regression, let me just point out the yellow lines: they are all sloping (slightly) upwards. That tells us that there is a dependency between age and race time, and that dependency is such that the older you are, the worse your performance. Not that surprising,right ? At least not for us senior citizens…  Furthermore, the graph can also quantify that relationship: again, without going into all the nitty gritty detail, let me just finish up this post by saying that according to this data and my analysis, each additional year of age slows you down by about 60 seconds on the Marcialonga track! 🙂 So if you did your first race at 20 years of age, you can expect your finishing time at your 40th anniversary race be about 40 minutes longer than your first race.

Postscript: a couple of bonus graphics:

First, let’s define “Best Nation” as the nation with the best mean (average) race time:


Next, a Kernel Density Estimation of age distribution:


Next, best race time per nation:





Posted in Bayes, Data Analytics, Numpy, Pandas, Probability, PYMC, Python, sports, Statistics | Tagged , , , , , , , , , | Leave a comment