Ugly soup with Python, requests & Beautiful Soup

Web scraping has never been a coveted nor favorite discipline of mine; in fact, for me web scraping is an unfortunate, but sometimes necessary evil. Scraping web-pages, at least for me, is a very unstructured process, basically pure Trial & Error. Perhaps, with more experience, more interest in html and the www in general, it might become more likable, akin to the other types of programming I really enjoy doing… Who knows….

Anyways, I wanted to scrape some data for further analysis down the line.Normally, I’d put Pandas to heavy use for a lot of the data munging tasks, but since my neighborhood is currently experiencing the Mother of all Power outtages, after the storm of the century on past Tuesday, I dont have power to run my computer – thus, this stuff is done with Pythonista (2!) on my old & tired iPad. Btw, Pythonista is a really great Python environment for i*, and I guess some day I should upgrade tp Pythonista3…. But I sure miss Pandas; it makes structured data manipulation so very much more convenient than using lists or even numpy arrays….!

So, for this exercise: I wanted to collect the results from a very famous long distance cross country ski race, the Marcialonga, which takes place each January in my favorite place on this earth, Val di Fiemme, in the Dolomites. I’d preferred for the organisers of the race to publish the results in a more easy-to-grab format than html, but I couldn’t find anything else but the web page. Which furthermore splits the 5600 result entries to 56 pages… which took a while to figure out how to scrape multiple linked pages.

Below a couple of screen shots of an initial, very basic analysis. I’ll do quite a bit more statistical analysis if and when the power grid resumes operations…

# coding: utf-8
import requests
from requests.utils import quote
import re
from bs4 import BeautifulSoup as bs
import numpy as np
from datetime import datetime,timedelta
import matplotlib.pyplot as plt

comp_list = []
# results are in 56 separate pages

for page in range(1,57):
	print page
	
	url = 'https://www.marcialonga.it/marcialonga_ski/EN_results.php'
	payload = {'pagenum':page}
	print url
	print payload
	
	r = requests.get(url,params=payload)
	print r.status_code
	#print r.text
	c = r.content
	soup = bs(c,'html.parser')
	#print soup.prettify()

	tbl= soup.find('table')
	#print tbl

	main_table = tbl
	#print main_table
	#print main_table.get_text()

	competitors = main_table.find_all(class_='SP')

	for comp in competitors:
		comp_list.append(comp.get_text())
		
comp_list = list(map(lambda x : x.encode('utf-8'),comp_list))
#print comp_list

print len(comp_list)

def parse_item(i):
	
	res_pattern = r'[0-9]+'
	char_pattern = r'[A-Z]+'
	num_pattern = r'[0-9]+:[0-9]+:[0-9]+\.[0-9]'
	age_pattern = r'[0-9]+/'
	res =re.match(res_pattern,i).group()
	chars = re.findall(char_pattern,i,flags=re.IGNORECASE)
	nat = chars[-1][1:]
	nums = re.findall(num_pattern,i)	
	age = re.findall(age_pattern,i)
	age_over=age[0][:-1]
	
	time_pattern = r'0[0-9]:[0-9]+:[0-9]+\.[0-9]$'
	time = re.findall(time_pattern,nums[0])
	t = datetime.strptime(time[0],'%H:%M:%S.%f')
	#t = datetime.strftime(t,'%H:%M:%S.%f')
	td = (t - datetime(1900,1,1)).total_seconds()
	
	name = chars[0] + ' ' + chars[1]
	gender = chars[-2]
	return (res, name,nat,gender,age_over,td)
	
results = []
for comp in comp_list:
	results.append(parse_item(comp))

'''
results = np.array(results,dtype=[('res','i4'),('name','U100'),('nat','U3'),('gender','U1'),('age','i4'),('time','datetime64[us]')])
'''
gender = np.array([results[i][3] for i in range(len(results))])
pos = np.array([results[i][0] for i in range(len(results))])
ages = np.array([results[i][4] for i in range(len(results))]).astype(int)
secs = np.array([results[i][5] for i in range(len(results))])
print ages.size
print secs.size
friend_1 = 18844
friend_2= 24446

male_mask = gender=='M'
male_secs = secs[male_mask]
female_secs = secs[~male_mask]

male_mean = male_secs.mean()
female_mean = female_secs.mean()


bins=range(10000,38000,1000)
plt.subplot(211)
plt.hist(male_secs,color='b',weights=np.zeros_like(male_secs) + 1. / male_secs.size,alpha=0.5,label='Men',bins=bins)
plt.hist(female_secs,color='r',weights=np.zeros_like(female_secs) + 1. / female_secs.size,alpha=0.5,label='Women',bins=bins)
plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women')
plt.xlabel('Time [seconds]')
plt.ylabel('Relative Frequency')

plt.axvline(friend_1,ls='dashed',color='cyan',label='Friend_1',lw=5)
plt.axvline(friend_2,ls='dashed',color='magenta',label='Friend_2',lw=5)
plt.axvline(male_mean,ls='dashed',color='darkblue',label='Men mean',lw=5)
plt.axvline(female_mean,ls='dashed',color='darkred',label='Women mean',lw=5)
plt.legend(loc='upper right')

def colors(x):
	if gender[x] == 'M':
		return 'b'
	else:
		return 'r'
print secs.min(),secs.max()	
colormap = list(map(colors,range(len(gender))))

plt.subplot(212)
plt.hist(male_secs,color='b',alpha=0.5,label='Men',bins=bins)
plt.hist(female_secs,color='r',alpha=0.5,label='Women',bins=bins)
plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women')
plt.xlabel('Time [seconds]')
plt.ylabel('Nr of Skiers')
plt.legend(loc='upper right')


plt.tight_layout()
plt.show()
	

the end 😉

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Data Analytics, development, Python, Web and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s