Geolocation with Google API’s & Python – mapping addresses to GPS coordinates

Google does some pretty impressive things – not just all the web-based search stuff, but they also have lot’s and lot’s of really cool API’s for programmers to peruse.

In order to access these API’s, you need a personal authorization key, which you can create at Google Developer Console, which in most cases will not be for free. (google it! 🙂

Anyhow, below a short Python example on how to map addresses – some of them being very vague! – to GPS coordinates. Considering that Google’s Geolocation services covers the entire world, it’s even more impressive that it can figure out the exact position of an address as vaque as for instance “Globen” below…!

Continue reading

Posted in API, Data Analytics, Python, Web | Tagged , , , | Leave a comment

Vasaloppet 2018 – race time analysis

An analysis of race times for the ~11000 men and ~2000 women that participated in 2018 Vasaloppet. For explanations of the graphs, see earlier posts on Marcialonga or Tour de Ski.

[Btw, the weird looking vertical orange/blue “spike” in the time plot below revealed a bug on the official Vasaloppet Results page – at men’s finishing position 1652… 🙂 ]

vasaloppet_time_histvasaloppet_time_plot

vasaloppet_nat_histo

NOTE! Log-10 scale (otherwise you wouldn’t see more than the 3-4 first bars in each chart)

 

 

Posted in Data Analytics, Numpy, Pandas, Python, sports, Statistics, Web | Tagged , , , , | Leave a comment

Tour de Ski Final climb – does age matter for performance ?

In an earlier post, I analyzed data from the Marcialonga Ski race. Marcialonga is one of the classic long distance ski races, where both elite’ as well as amateurs compete together. In fact, the vast majority of the competitors in these classic long distance ski races are in fact amateurs, not elite’ skiers.  One of the findings of that analysis was that “Age Matters”, that is, the analysis revealed that the older a participant is, the worse his/her performance.

Earlier today, I did a quick & dirty basic analysis of today’s Tour de Ski race, which asop to Marcialonga, only invites the real elite’, i.e. the world top cross country skiers.  So, does age matter for this exclusive group of elite’ skiers too…?

Let’s first look at the age spans for the two races, Marcialonga vs Tour de Ski: for Marcialonga, the ages span from 18 and up, with most skiers being in the 40+ and 50+ groups. For Tour de Ski, the ages span a much narrower range, from 20 to 38.

Let’s do a Linear Regression to see if age matters. First, here’s the graph using a “traditional” (i.e. “Frequentist”) statistical analysis method:

figure_6

With the traditional, Frequentist statistical method, it indeed looks like age matters: both men and women have a regression line, sloping slightly upwards, which would indicate that there is a dependency between age and performance, in this case, each additional year of age would add about 6-7 seconds to the race time.

However, just by looking at the scattered dots (red for women, blue for men) it is really hard to see that there in fact exists any strong relationship between age and race time.

So, let’s see what can be done using Bayesian Linear Regression instead, where the uncertainty of any analysis is preserved, and can be illustrated very explicitly:

figure_6

So here, instead of a single regression line for women, and one for men, as in the previous example, we have thousands of them for each gender: orange lines for women, green for men. From these regression lines, we can clearly see that there’s a whole lot of uncertainty about where the “true” regression resides, in fact, there is so much uncertainty that I’m willing to state that for this elite’ group of skiers, in this particular race, age does *not* impact performance, i.e there is no causal relationship between age and performance!

The graph also illustrates the underlying uncertainty of the data by the jagged colored (cyan, yellow) areas surrounding the regression lines: both the male as well as the female areas, “credible intervals”, are very wide. Take a look at e.g. this regression, for a comparison.

So. the bottom line is: age does not determine result in elite races.

Continue reading

Posted in Bayes, Data Analytics, Math, Numpy, Pandas, Probability, PYMC, Python, SNA, sports, Statistics | Tagged , , , , , , , , | Leave a comment

Tour de Ski 2019 Final Climb Analysis

Just a quickie analysis on the just finished race, comparing the climb times for women vs men, the top-28 of both genders. Once again, the results are consistent with my earlier findings on this topic: at elite’ level, the difference in performance, between men and women in endurance sports such as cross country skiing, is about 15-20%.

final_climb_dist

final_climb_difffinal_climb_times_menfinal_climb_times_women

final_climb_diff_to_winner.jpg

Posted in Data Analytics, sports, Statistics | Tagged , | Leave a comment

Marcialonga Ski 2018 – some Analytics

Now, with the power grid finally – after 62 hours! – back in business, I’m able to continue my stats/analytics exploration of the past Marcialonga ski race.

First, some basic stats about the race:

Total number of participants: 5558, of which 4660 were men, 898 women. All these folks came from 34 different countries.  Below a graphic showing the distribution of nationalities, and we can see that Marcialonga is clearly dominated by the Vikings, at least in man power… 🙂

marcialonga_nation_counts

Further evidence of Viking dominance can be obtained by looking at the distribution of last names of the participants: Lots of Johansson, Andersson etc…! 😉

marcialonga_names_dist

Furthermore, as the next graphic demonstrates, the event is not exactly dominated by the millenials – the dominant age group is the 50+ folks – perhaps ski racing is way too hard work for young & beautiful of today…?

figure_4

So, with the basic stats over and done with, let’s dwell into some race performance matters. First, lets look at the finishing time distribution, for men as well as women:

figure_5

First, observe that the histogram clearly shows that there are way more men than women participating. Secondly, it’s interesting that men have a long right hand tail, that is, there are both absolutely as well as relatively, many male partipants at the higher end of race times, i.e. relatively speaking, poor performers. The Female distribution is much more “normal”, i.e. the tails are more evenly distributed.  One way to interpret this is that the participating men are more split, one fairly large part are very fit, and one large part of them are less well trained, where quite a few are very poorly trained, (relatively  speaking – myself, I would probably die if I tried to ski that race…). While the women are more “equal” in their performance.  The difference in tails between men & women can also be seen in the distances of the respective means and medians: the diff is quite a bit wider for the men.

Let’s next select the Swedish team – after all this is a Swedish blog – and look at their performance:

marcia_longa_swedish

Here we have the overall race results, that is, men as well as women ordered by race time into a single results list, ranging from place 1 to place 5588. Again, we can see something interesting here, e.g. that the men are much more evenly, “uniformly” in stats speak,  located at the “podium list”, while the women are “tail heavy”, that is, more frequently residing at the lower spots of the results list. Remember that the results list is a joint list, consiting of both genders, so this is not such a big surprise – except for those that believe that “gender is a purely social construct“.

Looking at the male distribution, it’s interesting to notice how uniform it is with a small peak just before place 1000 – my interpretation of that is that many of the participating Swedish men are fairly serious amateurs, putting quite a lot of effort into their skiing and general physical fitness.

Since we are at it, let’s analyze the general performance difference between men vs women:

marcialonga_time_diff

The above graphic might look a bit busy for some, but rest assured, it’s really not that complicated: on the horizontal x-axis are the race result spots, ranging from 1 to 5558.

Most of the plots are plotted against the left hand vertical y-axis, which shows race time in seconds. Against that axis, I’ve mapped three plots: male race positions (blue), female race positions (red), and the time difference between them (orange). These plots show some interesting things: Look at the difference in slope between the female and the male plots. That difference tells us that for women,  particularily for the top spots, the differences in race time are very large, while the men are much more even in their performance, all over the spectrum. We can also see that for both genders, the laggards are really really far behind those in front – the slope at the end of the plots is almost vertical.

Plotted against the right hand side y-axis, is the relative (think percent) difference between the first 898 men vs women (reason for that weird number is that it is the number of female participants, and since I compare men’s  race finishing spots with women’s finishing spots…)  Anyway’s, from that cyan plot we can see that for the top performers, e.g. the first 3-4 men vs first 3-4 women, the time difference is about 15% (115 on the right hand vertical scale).  That is, the (presumably) elite’ men and women who finish in the first 3-4 spots in the respective group, have a time difference about 15%. That number is very, extremely consistent with what I observed in my analysis last weekend of Tour De Ski, where all participants are real elite skiers. Here, for Marcialonga, a race dominated by serious but amateur skiers, with only a handful real elite skiers, the difference in race time between men vs women grows very rapidly: already at 100th podium place, the difference in finishing time is almost 60%, and at the 800th or so finishing spots, the women take about 2x as long to finish the race.

Now, let’s look at a slightly more advanced analysis of the race results: a Bayesian Linear Regression, where we look at if and how age impacts race time:

marcialonga_linear_regression

Here, on the horizontal x-axis, I’ve put the age groups of the race. Age groups are simply an aggregation of the age of each competitor, into discrete groups, eg 18-30, 30-40 etc.  On the vertical axis, we again have race finishing time. The blue vertical “bars” are actually dots, each representing a the time of a racer by that age. there are 5558 such blue dots on the graph.

The objective here is to figure out if and how age impacts race results. Not going into all the nitty gritty mathematical & computational details of Linear Regression, let me just point out the yellow lines: they are all sloping (slightly) upwards. That tells us that there is a dependency between age and race time, and that dependency is such that the older you are, the worse your performance. Not that surprising,right ? At least not for us senior citizens…  Furthermore, the graph can also quantify that relationship: again, without going into all the nitty gritty detail, let me just finish up this post by saying that according to this data and my analysis, each additional year of age slows you down by about 60 seconds on the Marcialonga track! 🙂 So if you did your first race at 20 years of age, you can expect your finishing time at your 40th anniversary race be about 40 minutes longer than your first race.

Postscript: a couple of bonus graphics:

First, let’s define “Best Nation” as the nation with the best mean (average) race time:

marcialonga_best_nation

Next, a Kernel Density Estimation of age distribution:

marcia_joint.jpg

Next, best race time per nation:

marcialonga_best_of_nationmarcialonga_race_time_dist

 

 

 

Posted in Bayes, Data Analytics, Numpy, Pandas, Probability, PYMC, Python, sports, Statistics | Tagged , , , , , , , , , | Leave a comment

Ugly soup with Python, requests & Beautiful Soup

Web scraping has never been a coveted nor favorite discipline of mine; in fact, for me web scraping is an unfortunate, but sometimes necessary evil. Scraping web-pages, at least for me, is a very unstructured process, basically pure Trial & Error. Perhaps, with more experience, more interest in html and the www in general, it might become more likable, akin to the other types of programming I really enjoy doing… Who knows….

Anyways, I wanted to scrape some data for further analysis down the line.Normally, I’d put Pandas to heavy use for a lot of the data munging tasks, but since my neighborhood is currently experiencing the Mother of all Power outtages, after the storm of the century on past Tuesday, I dont have power to run my computer – thus, this stuff is done with Pythonista (2!) on my old & tired iPad. Btw, Pythonista is a really great Python environment for i*, and I guess some day I should upgrade tp Pythonista3…. But I sure miss Pandas; it makes structured data manipulation so very much more convenient than using lists or even numpy arrays….!

So, for this exercise: I wanted to collect the results from a very famous long distance cross country ski race, the Marcialonga, which takes place each January in my favorite place on this earth, Val di Fiemme, in the Dolomites. I’d preferred for the organisers of the race to publish the results in a more easy-to-grab format than html, but I couldn’t find anything else but the web page. Which furthermore splits the 5600 result entries to 56 pages… which took a while to figure out how to scrape multiple linked pages.

Below a couple of screen shots of an initial, very basic analysis. I’ll do quite a bit more statistical analysis if and when the power grid resumes operations…

# coding: utf-8
import requests
from requests.utils import quote
import re
from bs4 import BeautifulSoup as bs
import numpy as np
from datetime import datetime,timedelta
import matplotlib.pyplot as plt

comp_list = []
# results are in 56 separate pages

for page in range(1,57):
	print page
	
	url = 'https://www.marcialonga.it/marcialonga_ski/EN_results.php'
	payload = {'pagenum':page}
	print url
	print payload
	
	r = requests.get(url,params=payload)
	print r.status_code
	#print r.text
	c = r.content
	soup = bs(c,'html.parser')
	#print soup.prettify()

	tbl= soup.find('table')
	#print tbl

	main_table = tbl
	#print main_table
	#print main_table.get_text()

	competitors = main_table.find_all(class_='SP')

	for comp in competitors:
		comp_list.append(comp.get_text())
		
comp_list = list(map(lambda x : x.encode('utf-8'),comp_list))
#print comp_list

print len(comp_list)

def parse_item(i):
	
	res_pattern = r'[0-9]+'
	char_pattern = r'[A-Z]+'
	num_pattern = r'[0-9]+:[0-9]+:[0-9]+\.[0-9]'
	age_pattern = r'[0-9]+/'
	res =re.match(res_pattern,i).group()
	chars = re.findall(char_pattern,i,flags=re.IGNORECASE)
	nat = chars[-1][1:]
	nums = re.findall(num_pattern,i)	
	age = re.findall(age_pattern,i)
	age_over=age[0][:-1]
	
	time_pattern = r'0[0-9]:[0-9]+:[0-9]+\.[0-9]$'
	time = re.findall(time_pattern,nums[0])
	t = datetime.strptime(time[0],'%H:%M:%S.%f')
	#t = datetime.strftime(t,'%H:%M:%S.%f')
	td = (t - datetime(1900,1,1)).total_seconds()
	
	name = chars[0] + ' ' + chars[1]
	gender = chars[-2]
	return (res, name,nat,gender,age_over,td)
	
results = []
for comp in comp_list:
	results.append(parse_item(comp))

'''
results = np.array(results,dtype=[('res','i4'),('name','U100'),('nat','U3'),('gender','U1'),('age','i4'),('time','datetime64[us]')])
'''
gender = np.array([results[i][3] for i in range(len(results))])
pos = np.array([results[i][0] for i in range(len(results))])
ages = np.array([results[i][4] for i in range(len(results))]).astype(int)
secs = np.array([results[i][5] for i in range(len(results))])
print ages.size
print secs.size
friend_1 = 18844
friend_2= 24446

male_mask = gender=='M'
male_secs = secs[male_mask]
female_secs = secs[~male_mask]

male_mean = male_secs.mean()
female_mean = female_secs.mean()


bins=range(10000,38000,1000)
plt.subplot(211)
plt.hist(male_secs,color='b',weights=np.zeros_like(male_secs) + 1. / male_secs.size,alpha=0.5,label='Men',bins=bins)
plt.hist(female_secs,color='r',weights=np.zeros_like(female_secs) + 1. / female_secs.size,alpha=0.5,label='Women',bins=bins)
plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women')
plt.xlabel('Time [seconds]')
plt.ylabel('Relative Frequency')

plt.axvline(friend_1,ls='dashed',color='cyan',label='Friend_1',lw=5)
plt.axvline(friend_2,ls='dashed',color='magenta',label='Friend_2',lw=5)
plt.axvline(male_mean,ls='dashed',color='darkblue',label='Men mean',lw=5)
plt.axvline(female_mean,ls='dashed',color='darkred',label='Women mean',lw=5)
plt.legend(loc='upper right')

def colors(x):
	if gender[x] == 'M':
		return 'b'
	else:
		return 'r'
print secs.min(),secs.max()	
colormap = list(map(colors,range(len(gender))))

plt.subplot(212)
plt.hist(male_secs,color='b',alpha=0.5,label='Men',bins=bins)
plt.hist(female_secs,color='r',alpha=0.5,label='Women',bins=bins)
plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women')
plt.xlabel('Time [seconds]')
plt.ylabel('Nr of Skiers')
plt.legend(loc='upper right')


plt.tight_layout()
plt.show()
	

the end 😉

Posted in Data Analytics, development, Python, Web | Tagged , , , , , | Leave a comment

Gender is not a social construct, but a biological reality, clearly demonstrated in sports

Just a quick demo to debunk the contemporary notion that “Gender is a social construct”.

Data taken from today’s Tour the Ski sprint qualification times, for the top 30 women vs men, where both men & women used the same track and same distance.

Turns out that there is a consistent time difference about 15% between men and women, and that even the best woman skier is far from the worst male skier in performance, the difference between the best woman and the worst man is about 8% in time.

Tour_de_skitour_de_ski_scatter

And if we include all qualification participants, 68 women and 80 men, the graphs look as follows:

Tour_de_skitour_de_ski_scatter

And finally, a Bayesian Linear Regression demonstrating the variability, using a 89% credible interval,  within the two datasets, where it can be visually observed that the variability in the female group is much larger than in the male group:

tour_de_ski_scatter

Posted in Bayes, Culture, Data Analytics, Politik, Society, Statistics | Tagged , , , , | Leave a comment

Bayesian Linear Regression with PYMC

Python, Pandas & PYMC example on Bayesian Linear Regression, adopted from Richard McElreath’s “Statistical Rethinking” class, where he uses R as modeling language instead of PYMC.

Data in a csv-file describe various attributes such as weight, height, age, gender etc of an indigenous people, !Kung, below an extract:

"height";"weight";"age";"male"
151.765;47.8256065;63;1
139.7;36.4858065;63;0
136.525;31.864838;65;0
156.845;53.0419145;41;1
145.415;41.276872;51;0
163.83;62.992589;35;1
149.225;38.2434755;32;0
168.91;55.4799715;27;1

Here, I’m using weight as a predictor variable, and the objective is to predict height based on weight.

First, a histogram on population heights:

pymc_basic_linear_0

Next, a histogram of regression priors:

pymc_linear_basic_1

Next, posterior distributions:

pymc_basic_linear_2

Finally, a regression plot including a sampling from the posterior.

pymc_basic_linear_3

Continue reading

Posted in Bayes, Data Analytics, Numpy, Pandas, Probability, PYMC, Python, Statistics | Tagged , , , , , , , | Leave a comment

Climate Change – Man made or…? Perhaps not…

Posted in Climate, Complex Systems | Leave a comment

Python & Pandas to map gps coordinates to known locations

Assume you have a gps log file, with time and position (Lat,Lon in columns 9,10) info, like:

2018.12.12 00:41:20;0;0;0;0;0;1;0;25.8;59.348978;17.969643;0;0;
2018.12.12 01:41:21;0;0;0;0;0;1;0;25.7;59.348962;17.969627;0;0;
2018.12.12 02:41:21;0;0;0;0;0;1;0;25.7;59.349;17.969688;0;0;
2018.12.12 03:41:21;0;0;0;0;0;1;0;25.7;59.349;17.96966;0;0;
2018.12.12 04:41:22;0;0;0;0;0;1;0;25.6;59.349007;17.969618;0;0;
2018.12.12 04:48:50;0;0;0;0;1;1;1;25.2;59.349007;17.969635;0;0;
2018.12.12 04:49:51;0;0.001;0;0;1;1;1;28.3;59.349;17.969642;0;0;

Assume further that you’d like to map each of these positions to a set of known locations, e.g.

       Latitud  Longitud
Loc_1   59.650    17.721
Loc_2   59.649    17.702
Loc_3   59.621    17.772
Loc_4   59.628    17.775
Loc_5   59.627    17.860
Loc_6   59.650    17.930
Loc_7   59.349    17.970

So, we have our 7 known locations, and we would want to know if and when we have been in close proximity of any of these known locations, i.e we’d like to obtain info as below:

                     Loc_1  Loc_2 Loc_3 Loc_4  Loc_5  Loc_6  Loc_7
Time
2018.12.12 00:41:20                                          Loc_7
2018.12.12 05:59:01  Loc_1
2018.12.12 06:58:19                            Loc_5
2018.12.12 09:36:41  Loc_1
2018.12.12 14:50:32                                   Loc_6
2018.12.12 15:01:48                            Loc_5
2018.12.12 16:32:16  Loc_1
2018.12.12 17:37:00  Loc_1
2018.12.12 18:43:00  Loc_1
2018.12.13 06:02:20  Loc_1
2018.12.13 07:18:32  Loc_1
2018.12.13 07:50:17                            Loc_5
2018.12.13 08:24:53         Loc_2
2018.12.13 09:26:11  Loc_1
2018.12.13 10:33:06  Loc_1
2018.12.13 14:09:08  Loc_1
2018.12.13 15:57:42                                   Loc_6
2018.12.13 17:40:26  Loc_1
2018.12.13 19:29:03  Loc_1
2018.12.14 04:23:28                                          Loc_7
2018.12.14 05:55:37  Loc_1
2018.12.14 06:48:52                                   Loc_6
2018.12.14 06:59:52                            Loc_5
2018.12.14 08:29:56  Loc_1
2018.12.14 09:35:01  Loc_1
2018.12.14 10:32:58  Loc_1
2018.12.14 14:16:56  Loc_1
2018.12.14 14:44:02                            Loc_5
2018.12.14 16:58:59                                   Loc_6
2018.12.14 17:35:18  Loc_1

The Python/Pandas hack below does the trick.

Continue reading

Posted in Maritime Technology, Nautical Information Systems, Numpy, Pandas, Python | Tagged , , , | Leave a comment