Calculating distance from Lat & Lon coordinates using Python & Pandas

Given a GPS log file structured as follows (column separators omitted for clarity):

                     Speed      Latitud   Longitud      
2019-01-07 06:15:27          0  59.649582  17.721365  
2019-01-07 06:16:28          0  59.649583  17.721372  
2019-01-07 06:17:28          0  59.649583  17.721370  
2019-01-07 06:18:28          0  59.649583  17.721372  
2019-01-07 06:19:29          0  59.649600  17.721372  
2019-01-07 06:20:06          6  59.649605  17.721278  
2019-01-07 06:20:07          7  59.649625  17.721258  
2019-01-07 06:20:08          6  59.649642  17.721247  
2019-01-07 06:20:10          7  59.649670  17.721228  
2019-01-07 06:20:14          7  59.649760  17.721238  
2019-01-07 06:20:25         13  59.650020  17.721323  
2019-01-07 06:20:28         13  59.650122  17.721323  
2019-01-07 06:20:30         13  59.650187  17.721292  

And you’d want to calculate distance (Rhumb Line) between each gps fix, and the cumulative distance: the Python code below does the trick.

import numpy as np
import math
import pandas as pd

df = pd.read_csv('gps_log.csv',sep=';',header=0,parse_dates=True,

R = 6378.1e3

def pos2d(lat1,lon1,lat2,lon2):

    la1 = np.radians(lat1)
    lo1 = np.radians(lon1)
    la2 = np.radians(lat2)
    lo2 = np.radians(lon2)
    dla = np.radians(lat2-lat1)
    dlo = np.radians(lon2-lon1)
    a = math.sin(dla/2) * math.sin(dla/2) + \
        math.cos(la1) * math.cos(la2) * math.sin(dlo/2) * math.sin(dlo/2)
    c = 2 * math.atan2(math.sqrt(a),math.sqrt(1-a))
    return R * c / 1000

df_shifted = df.shift()

df['Dist'] = np.vectorize(pos2d)(

df['Dist'] = df['Dist'].fillna(0)
df['TotDist'] = df['Dist'].cumsum()

print (df.to_string())


2019-01-07 06:15:27          0  59.649582  17.721365  0.000000    0.000000
2019-01-07 06:16:28          0  59.649583  17.721372  0.000409    0.000409
2019-01-07 06:17:28          0  59.649583  17.721370  0.000112    0.000522
2019-01-07 06:18:28          0  59.649583  17.721372  0.000112    0.000634
2019-01-07 06:19:29          0  59.649600  17.721372  0.001892    0.002527
2019-01-07 06:20:06          6  59.649605  17.721278  0.005317    0.007843
2019-01-07 06:20:07          7  59.649625  17.721258  0.002494    0.010338
2019-01-07 06:20:08          6  59.649642  17.721247  0.001991    0.012329
2019-01-07 06:20:10          7  59.649670  17.721228  0.003295    0.015624
2019-01-07 06:20:14          7  59.649760  17.721238  0.010034    0.025658
2019-01-07 06:20:25         13  59.650020  17.721323  0.029335    0.054993
2019-01-07 06:20:28         13  59.650122  17.721323  0.011355    0.066348
2019-01-07 06:20:30         13  59.650187  17.721292  0.007443    0.073791
2019-01-07 06:20:44         15  59.650737  17.720778  0.067708    0.141499
2019-01-07 06:20:45         15  59.650773  17.720725  0.004995    0.146493
2019-01-07 06:20:46         16  59.650790  17.720668  0.003723    0.150216
2019-01-07 06:20:49         15  59.650817  17.720532  0.008219    0.158435
2019-01-07 06:20:51         19  59.650817  17.720373  0.008943    0.167378
2019-01-07 06:20:58         17  59.650777  17.719705  0.037835    0.205213
2019-01-07 06:21:01         16  59.650817  17.719472  0.013841    0.219054
2019-01-07 06:21:05         19  59.650855  17.719117  0.020410    0.239465
2019-01-07 06:21:08         19  59.650858  17.718827  0.016315    0.255779
2019-01-07 06:21:13          9  59.650822  17.718467  0.020641    0.276421
2019-01-07 06:22:13         51  59.648910  17.707070  0.675463    0.951884
2019-01-07 06:22:16         48  59.648868  17.706333  0.041718    0.993602
2019-01-07 06:22:58          7  59.648642  17.702463  0.219134    1.212736
2019-01-07 06:23:00         13  59.648592  17.702390  0.006917    1.219653
2019-01-07 06:23:01         14  59.648550  17.702375  0.004751    1.224404
2019-01-07 06:23:04         30  59.648340  17.702412  0.023469    1.247873
2019-01-07 06:24:04         57  59.640857  17.704502  0.841256    2.089129
2019-01-07 06:24:26         56  59.637767  17.706400  0.360171    2.449300
2019-01-07 06:24:29         59  59.637297  17.707105  0.065658    2.514959


Posted in Data Analytics, Math, Nautical Information Systems, Pandas, Python | Tagged , , , , | Leave a comment

Geolocation with Google API’s & Python – mapping addresses to GPS coordinates

Google does some pretty impressive things – not just all the web-based search stuff, but they also have lot’s and lot’s of really cool API’s for programmers to peruse.

In order to access these API’s, you need a personal authorization key, which you can create at Google Developer Console, which in most cases will not be for free. (google it! 🙂

Anyhow, below a short Python example on how to map addresses – some of them being very vague! – to GPS coordinates. Considering that Google’s Geolocation services covers the entire world, it’s even more impressive that it can figure out the exact position of an address as vaque as for instance “Globen” below…!

Continue reading

Posted in API, Data Analytics, Python, Web | Tagged , , , | Leave a comment

Vasaloppet 2018 – race time analysis

An analysis of race times for the ~11000 men and ~2000 women that participated in 2018 Vasaloppet. For explanations of the graphs, see earlier posts on Marcialonga or Tour de Ski.

[Btw, the weird looking vertical orange/blue “spike” in the time plot below revealed a bug on the official Vasaloppet Results page – at men’s finishing position 1652… 🙂 ]



NOTE! Log-10 scale (otherwise you wouldn’t see more than the 3-4 first bars in each chart)



Posted in Data Analytics, Numpy, Pandas, Python, sports, Statistics, Web | Tagged , , , , | Leave a comment

Tour de Ski Final climb – does age matter for performance ?

In an earlier post, I analyzed data from the Marcialonga Ski race. Marcialonga is one of the classic long distance ski races, where both elite’ as well as amateurs compete together. In fact, the vast majority of the competitors in these classic long distance ski races are in fact amateurs, not elite’ skiers.  One of the findings of that analysis was that “Age Matters”, that is, the analysis revealed that the older a participant is, the worse his/her performance.

Earlier today, I did a quick & dirty basic analysis of today’s Tour de Ski race, which asop to Marcialonga, only invites the real elite’, i.e. the world top cross country skiers.  So, does age matter for this exclusive group of elite’ skiers too…?

Let’s first look at the age spans for the two races, Marcialonga vs Tour de Ski: for Marcialonga, the ages span from 18 and up, with most skiers being in the 40+ and 50+ groups. For Tour de Ski, the ages span a much narrower range, from 20 to 38.

Let’s do a Linear Regression to see if age matters. First, here’s the graph using a “traditional” (i.e. “Frequentist”) statistical analysis method:


With the traditional, Frequentist statistical method, it indeed looks like age matters: both men and women have a regression line, sloping slightly upwards, which would indicate that there is a dependency between age and performance, in this case, each additional year of age would add about 6-7 seconds to the race time.

However, just by looking at the scattered dots (red for women, blue for men) it is really hard to see that there in fact exists any strong relationship between age and race time.

So, let’s see what can be done using Bayesian Linear Regression instead, where the uncertainty of any analysis is preserved, and can be illustrated very explicitly:


So here, instead of a single regression line for women, and one for men, as in the previous example, we have thousands of them for each gender: orange lines for women, green for men. From these regression lines, we can clearly see that there’s a whole lot of uncertainty about where the “true” regression resides, in fact, there is so much uncertainty that I’m willing to state that for this elite’ group of skiers, in this particular race, age does *not* impact performance, i.e there is no causal relationship between age and performance!

The graph also illustrates the underlying uncertainty of the data by the jagged colored (cyan, yellow) areas surrounding the regression lines: both the male as well as the female areas, “credible intervals”, are very wide. Take a look at e.g. this regression, for a comparison.

So. the bottom line is: age does not determine result in elite races.

Continue reading

Posted in Bayes, Data Analytics, Math, Numpy, Pandas, Probability, PYMC, Python, SNA, sports, Statistics | Tagged , , , , , , , , | Leave a comment

Tour de Ski 2019 Final Climb Analysis

Just a quickie analysis on the just finished race, comparing the climb times for women vs men, the top-28 of both genders. Once again, the results are consistent with my earlier findings on this topic: at elite’ level, the difference in performance, between men and women in endurance sports such as cross country skiing, is about 15-20%.




Posted in Data Analytics, sports, Statistics | Tagged , | Leave a comment

Marcialonga Ski 2018 – some Analytics

Now, with the power grid finally – after 62 hours! – back in business, I’m able to continue my stats/analytics exploration of the past Marcialonga ski race.

First, some basic stats about the race:

Total number of participants: 5558, of which 4660 were men, 898 women. All these folks came from 34 different countries.  Below a graphic showing the distribution of nationalities, and we can see that Marcialonga is clearly dominated by the Vikings, at least in man power… 🙂


Further evidence of Viking dominance can be obtained by looking at the distribution of last names of the participants: Lots of Johansson, Andersson etc…! 😉


Furthermore, as the next graphic demonstrates, the event is not exactly dominated by the millenials – the dominant age group is the 50+ folks – perhaps ski racing is way too hard work for young & beautiful of today…?


So, with the basic stats over and done with, let’s dwell into some race performance matters. First, lets look at the finishing time distribution, for men as well as women:


First, observe that the histogram clearly shows that there are way more men than women participating. Secondly, it’s interesting that men have a long right hand tail, that is, there are both absolutely as well as relatively, many male partipants at the higher end of race times, i.e. relatively speaking, poor performers. The Female distribution is much more “normal”, i.e. the tails are more evenly distributed.  One way to interpret this is that the participating men are more split, one fairly large part are very fit, and one large part of them are less well trained, where quite a few are very poorly trained, (relatively  speaking – myself, I would probably die if I tried to ski that race…). While the women are more “equal” in their performance.  The difference in tails between men & women can also be seen in the distances of the respective means and medians: the diff is quite a bit wider for the men.

Let’s next select the Swedish team – after all this is a Swedish blog – and look at their performance:


Here we have the overall race results, that is, men as well as women ordered by race time into a single results list, ranging from place 1 to place 5588. Again, we can see something interesting here, e.g. that the men are much more evenly, “uniformly” in stats speak,  located at the “podium list”, while the women are “tail heavy”, that is, more frequently residing at the lower spots of the results list. Remember that the results list is a joint list, consiting of both genders, so this is not such a big surprise – except for those that believe that “gender is a purely social construct“.

Looking at the male distribution, it’s interesting to notice how uniform it is with a small peak just before place 1000 – my interpretation of that is that many of the participating Swedish men are fairly serious amateurs, putting quite a lot of effort into their skiing and general physical fitness.

Since we are at it, let’s analyze the general performance difference between men vs women:


The above graphic might look a bit busy for some, but rest assured, it’s really not that complicated: on the horizontal x-axis are the race result spots, ranging from 1 to 5558.

Most of the plots are plotted against the left hand vertical y-axis, which shows race time in seconds. Against that axis, I’ve mapped three plots: male race positions (blue), female race positions (red), and the time difference between them (orange). These plots show some interesting things: Look at the difference in slope between the female and the male plots. That difference tells us that for women,  particularily for the top spots, the differences in race time are very large, while the men are much more even in their performance, all over the spectrum. We can also see that for both genders, the laggards are really really far behind those in front – the slope at the end of the plots is almost vertical.

Plotted against the right hand side y-axis, is the relative (think percent) difference between the first 898 men vs women (reason for that weird number is that it is the number of female participants, and since I compare men’s  race finishing spots with women’s finishing spots…)  Anyway’s, from that cyan plot we can see that for the top performers, e.g. the first 3-4 men vs first 3-4 women, the time difference is about 15% (115 on the right hand vertical scale).  That is, the (presumably) elite’ men and women who finish in the first 3-4 spots in the respective group, have a time difference about 15%. That number is very, extremely consistent with what I observed in my analysis last weekend of Tour De Ski, where all participants are real elite skiers. Here, for Marcialonga, a race dominated by serious but amateur skiers, with only a handful real elite skiers, the difference in race time between men vs women grows very rapidly: already at 100th podium place, the difference in finishing time is almost 60%, and at the 800th or so finishing spots, the women take about 2x as long to finish the race.

Now, let’s look at a slightly more advanced analysis of the race results: a Bayesian Linear Regression, where we look at if and how age impacts race time:


Here, on the horizontal x-axis, I’ve put the age groups of the race. Age groups are simply an aggregation of the age of each competitor, into discrete groups, eg 18-30, 30-40 etc.  On the vertical axis, we again have race finishing time. The blue vertical “bars” are actually dots, each representing a the time of a racer by that age. there are 5558 such blue dots on the graph.

The objective here is to figure out if and how age impacts race results. Not going into all the nitty gritty mathematical & computational details of Linear Regression, let me just point out the yellow lines: they are all sloping (slightly) upwards. That tells us that there is a dependency between age and race time, and that dependency is such that the older you are, the worse your performance. Not that surprising,right ? At least not for us senior citizens…  Furthermore, the graph can also quantify that relationship: again, without going into all the nitty gritty detail, let me just finish up this post by saying that according to this data and my analysis, each additional year of age slows you down by about 60 seconds on the Marcialonga track! 🙂 So if you did your first race at 20 years of age, you can expect your finishing time at your 40th anniversary race be about 40 minutes longer than your first race.

Postscript: a couple of bonus graphics:

First, let’s define “Best Nation” as the nation with the best mean (average) race time:


Next, a Kernel Density Estimation of age distribution:


Next, best race time per nation:





Posted in Bayes, Data Analytics, Numpy, Pandas, Probability, PYMC, Python, sports, Statistics | Tagged , , , , , , , , , | Leave a comment

Ugly soup with Python, requests & Beautiful Soup

Web scraping has never been a coveted nor favorite discipline of mine; in fact, for me web scraping is an unfortunate, but sometimes necessary evil. Scraping web-pages, at least for me, is a very unstructured process, basically pure Trial & Error. Perhaps, with more experience, more interest in html and the www in general, it might become more likable, akin to the other types of programming I really enjoy doing… Who knows….

Anyways, I wanted to scrape some data for further analysis down the line.Normally, I’d put Pandas to heavy use for a lot of the data munging tasks, but since my neighborhood is currently experiencing the Mother of all Power outtages, after the storm of the century on past Tuesday, I dont have power to run my computer – thus, this stuff is done with Pythonista (2!) on my old & tired iPad. Btw, Pythonista is a really great Python environment for i*, and I guess some day I should upgrade tp Pythonista3…. But I sure miss Pandas; it makes structured data manipulation so very much more convenient than using lists or even numpy arrays….!

So, for this exercise: I wanted to collect the results from a very famous long distance cross country ski race, the Marcialonga, which takes place each January in my favorite place on this earth, Val di Fiemme, in the Dolomites. I’d preferred for the organisers of the race to publish the results in a more easy-to-grab format than html, but I couldn’t find anything else but the web page. Which furthermore splits the 5600 result entries to 56 pages… which took a while to figure out how to scrape multiple linked pages.

Below a couple of screen shots of an initial, very basic analysis. I’ll do quite a bit more statistical analysis if and when the power grid resumes operations…

# coding: utf-8
import requests
from requests.utils import quote
import re
from bs4 import BeautifulSoup as bs
import numpy as np
from datetime import datetime,timedelta
import matplotlib.pyplot as plt

comp_list = []
# results are in 56 separate pages

for page in range(1,57):
	print page
	url = ''
	payload = {'pagenum':page}
	print url
	print payload
	r = requests.get(url,params=payload)
	print r.status_code
	#print r.text
	c = r.content
	soup = bs(c,'html.parser')
	#print soup.prettify()

	tbl= soup.find('table')
	#print tbl

	main_table = tbl
	#print main_table
	#print main_table.get_text()

	competitors = main_table.find_all(class_='SP')

	for comp in competitors:
comp_list = list(map(lambda x : x.encode('utf-8'),comp_list))
#print comp_list

print len(comp_list)

def parse_item(i):
	res_pattern = r'[0-9]+'
	char_pattern = r'[A-Z]+'
	num_pattern = r'[0-9]+:[0-9]+:[0-9]+\.[0-9]'
	age_pattern = r'[0-9]+/'
	res =re.match(res_pattern,i).group()
	chars = re.findall(char_pattern,i,flags=re.IGNORECASE)
	nat = chars[-1][1:]
	nums = re.findall(num_pattern,i)	
	age = re.findall(age_pattern,i)
	time_pattern = r'0[0-9]:[0-9]+:[0-9]+\.[0-9]$'
	time = re.findall(time_pattern,nums[0])
	t = datetime.strptime(time[0],'%H:%M:%S.%f')
	#t = datetime.strftime(t,'%H:%M:%S.%f')
	td = (t - datetime(1900,1,1)).total_seconds()
	name = chars[0] + ' ' + chars[1]
	gender = chars[-2]
	return (res, name,nat,gender,age_over,td)
results = []
for comp in comp_list:

results = np.array(results,dtype=[('res','i4'),('name','U100'),('nat','U3'),('gender','U1'),('age','i4'),('time','datetime64[us]')])
gender = np.array([results[i][3] for i in range(len(results))])
pos = np.array([results[i][0] for i in range(len(results))])
ages = np.array([results[i][4] for i in range(len(results))]).astype(int)
secs = np.array([results[i][5] for i in range(len(results))])
print ages.size
print secs.size
friend_1 = 18844
friend_2= 24446

male_mask = gender=='M'
male_secs = secs[male_mask]
female_secs = secs[~male_mask]

male_mean = male_secs.mean()
female_mean = female_secs.mean()

plt.hist(male_secs,color='b',weights=np.zeros_like(male_secs) + 1. / male_secs.size,alpha=0.5,label='Men',bins=bins)
plt.hist(female_secs,color='r',weights=np.zeros_like(female_secs) + 1. / female_secs.size,alpha=0.5,label='Women',bins=bins)
plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women')
plt.xlabel('Time [seconds]')
plt.ylabel('Relative Frequency')

plt.axvline(male_mean,ls='dashed',color='darkblue',label='Men mean',lw=5)
plt.axvline(female_mean,ls='dashed',color='darkred',label='Women mean',lw=5)
plt.legend(loc='upper right')

def colors(x):
	if gender[x] == 'M':
		return 'b'
		return 'r'
print secs.min(),secs.max()	
colormap = list(map(colors,range(len(gender))))

plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women')
plt.xlabel('Time [seconds]')
plt.ylabel('Nr of Skiers')
plt.legend(loc='upper right')


the end 😉

Posted in Data Analytics, development, Python, Web | Tagged , , , , , | Leave a comment

Gender is not a social construct, but a biological reality, clearly demonstrated in sports

Just a quick demo to debunk the contemporary notion that “Gender is a social construct”.

Data taken from today’s Tour the Ski sprint qualification times, for the top 30 women vs men, where both men & women used the same track and same distance.

Turns out that there is a consistent time difference about 15% between men and women, and that even the best woman skier is far from the worst male skier in performance, the difference between the best woman and the worst man is about 8% in time.


And if we include all qualification participants, 68 women and 80 men, the graphs look as follows:


And finally, a Bayesian Linear Regression demonstrating the variability, using a 89% credible interval,  within the two datasets, where it can be visually observed that the variability in the female group is much larger than in the male group:


Posted in Bayes, Culture, Data Analytics, Politik, Society, Statistics | Tagged , , , , | Leave a comment

Bayesian Linear Regression with PYMC

Python, Pandas & PYMC example on Bayesian Linear Regression, adopted from Richard McElreath’s “Statistical Rethinking” class, where he uses R as modeling language instead of PYMC.

Data in a csv-file describe various attributes such as weight, height, age, gender etc of an indigenous people, !Kung, below an extract:


Here, I’m using weight as a predictor variable, and the objective is to predict height based on weight.

First, a histogram on population heights:


Next, a histogram of regression priors:


Next, posterior distributions:


Finally, a regression plot including a sampling from the posterior.


Continue reading

Posted in Bayes, Data Analytics, Numpy, Pandas, Probability, PYMC, Python, Statistics | Tagged , , , , , , , | Leave a comment

Climate Change – Man made or…? Perhaps not…

Posted in Climate, Complex Systems | Leave a comment