Statistics is an interesting topic – it can illuminate as well as it can confuse. Sometimes, statistics do mislead unintentionally, i.e. the presenter has done a (honest) mistake. Other times, less honest presenters can take advantage of the fact that most people are not very comfortable with numbers, math nor statistics, and twist statistics to reflect their point of view.

To illustrate this, let’s consider a (hypothetical) newspaper article, stating that some particular city, or county, or in general, any specific geographical area close to you, has a rate of cancer that is 4 times greater than national average.

The question is: is there reason for alarm, i.e. should you immediately start planning for moving out of the vicinity…? After all, four times the national cancer rate sounds like there’s something in your area that is causing all these “way-over-the-bar” cancer occurences… Could there be some area specific factors, environmental, social, genetic, economic etc, that causes the difference in cancer rates ? Or could the difference be due to pure random variability…?

In order to illustrate that despite the alarmist article, perhaps you despite it should calm down, not overly worry about what the probably statistically illiterate newspaper journalist has written, I wrote a simulation to examine whether pure random variability could explain such a large difference in cancer rates.

The scenario thus is: imagine a state/country/region etc divided up into a number of equal areas. Furthermore, imagine that this “grid” of areas is populated with citizens more or less uniformly, i.e the number of citizens in each area is roughly the same. Now, the question is, with a single, nationwide cancer rate (which in the simulation I’ve set to 338/100000, based on cancer statistics in the developed world I found on the net), will randomness alone suffice to explain a difference of factor 4 in cancer rates between the national average and a specific area ?

Turns out the answer to this question is YES – that is, the alarming cancer rate could very well be caused by pure random variability: the graphs below illustrate a grid of 2000 areas, onto which a total population of 1.000.000 has been randomly assigned, by a uniform distribution, resulting in each area having roughly the same number, ~500, people. This is illustrated by the first subplot.

The second subplot shows the resulting population frequency, nicely following the Normal Distribution, with a mean of 500, and a standard deviation of 22.

On top of the area grid, we allocate a number of cancer cases, the number of such cases being #population * #cancer_rate, which in the simulation I ran was 1.000.000 * 338/100.000, which gives 3380 cancer cases, to be randomly assigned to the people in the 2000 areas.

The third subplot shows the distribution of the cancer cases onto the 2000 areas. The mean number of cancer cases per area is 1.69, with a standard deviation of 1.27, as can be seen from subplot 3. But from the same subplot you can also see a number of spikes, with cancer occurences way higher than the mean, in fact, the maximum number of cancer in a specific area is 8, that is, a factor 4.7 above the national average.

Subplot 4 illustrates this more clearly: most areas have 1 case of cancer among it’s roughly 500 inhabitants, others have 0 cases, but on the far right of the graph there are a handful of areas with up to 8 cases.

Thus, we can conclude that before we decide to pack and leave the area with the seemingly exceptionally high occurence of cancer, we need more information about the causes – after all, it might be the case that the high frequency is fully caused by pure random variation.

import numpy as np

import matplotlib.pyplot as plt

from math import erf

def normpdf(x, mu, sigma):

factor = np.array(x-mu).astype(float)

u = ((factor))/np.abs(sigma)

y = (1/(np.sqrt(2*np.pi)*np.abs(sigma)))*np.exp(-u*u/2.)

return y

def normcdf(x,mu,sigma):

a = np.array(x-mu).astype(float)

b = a/(sigma*2**0.5)

elist = []

for i in b:

elist.append((1. + erf(i))/2)

cdf = np.array(elist)

return cdf

nr_areas = 2000

people = 1000000

cancer_rate = 338./100000.

total_with_cancer = int(np.round(people * cancer_rate,0))

locations = np.zeros(nr_areas)

locations_cancer = np.zeros(nr_areas)

for p in range(people):

locations[np.random.randint(0,len(locations))] += 1

max_pop = max(locations)

min_pop = min(locations)

pop_mean = np.mean(locations)

pop_sd = np.std(locations)

print (‘max area population:’,max_pop)

print (‘min area population:’,min_pop)

print (‘area mean population:’, pop_mean)

print (‘area population std:’,pop_sd)

c = 0

while c < total_with_cancer:

allocated = False

while not allocated:

area = np.random.randint(0,len(locations))

if ((locations_cancer[area] < locations[area]) and (area >= 0)):

locations_cancer[area] += 1

allocated = True

c += 1

cancer_mean = np.mean(locations_cancer)

cancer_std = np.std(locations_cancer)

cancer_max = max(locations_cancer)

cancer_min = min(locations_cancer)

stds_over_mean = (cancer_max-cancer_mean)/cancer_std

print (‘min cancer:’,cancer_min)

print (‘max cancer:’,cancer_max)

print (‘area cancer mean:’,cancer_mean)

print (‘area cancer std:’, cancer_std)

print (‘max cancer std dev above mean:’,stds_over_mean)

print (‘p-value for max cancer:’,1.-normcdf(np.array([stds_over_mean]),0,1))

print (‘p-value for 2sd over mean:’, 1.-normcdf(np.array([2.0]),0,1))

high_cancer_rate = np.where(locations_cancer > cancer_mean+2*cancer_std)

print (high_cancer_rate)

#location_cancer_rate = np.array(locations_cancer/locations)

bins = 10

plt.subplot(4,1,1)

plt.bar(range(len(locations)),locations,alpha=0.5)

plt.xlabel(‘Area #’)

plt.ylabel(‘population #’)

plt.ylim(min_pop-10,max_pop+10)

#for i in high_cancer_rate[0]:

#plt.axvline(i,color=’r’)

plt.subplot(4,1,3)

plt.bar(range(len(locations_cancer)),locations_cancer)

plt.xlabel(‘Area #’)

plt.ylabel(‘Cancer case #’)

plt.axhline(cancer_mean,color=’r’,lw=3,label=’Mean’)

plt.axhline(cancer_mean-cancer_std,color=’r’,lw=3,ls=’dashed’,label=’std’)

plt.axhline(cancer_mean-2*cancer_std,color=’r’,lw=3,ls=’dashed’)

plt.axhline(cancer_mean+cancer_std,color=’r’,lw=3,ls=’dashed’)

plt.axhline(cancer_mean+2*cancer_std,color=’r’,lw=3,ls=’dashed’)

plt.legend(loc=’upper left’)

plt.subplot(4,1,2)

plt.hist(locations,normed=True,bins=bins)

plt.xlabel(‘Population in area’)

plt.ylabel (‘Density’)

plt.axvline(pop_mean,color=’r’,lw=3)

plt.axvline(pop_mean-1.96*pop_sd,color=’r’,lw=3,ls=’dashed’,label=’95%CI’)

plt.axvline(pop_mean+1.96*pop_sd,color=’r’,lw=3,ls=’dashed’,label=’95%CI’)

plt.plot(np.sort(locations),normpdf(np.sort(locations),pop_mean,pop_sd),color=’orange’,label=’cdf’)

plt.legend(loc=’upper left’)

plt.title(‘Area Population Frequency’)

plt.subplot(4,1,4)

plt.hist(locations_cancer,normed=True,bins=bins)

plt.xlabel(‘number of cancer cases in each area’)

plt.ylabel(‘Density’)

plt.axvline(cancer_mean,color=’r’,lw=3)

plt.axvline(cancer_mean-1.96*cancer_std,color=’r’,lw=3,ls=’dashed’,label=’95%CI’)

plt.axvline(cancer_mean+1.96*cancer_std,color=’r’,lw=3,ls=’dashed’,label=’95%CI’)

plt.plot(np.sort(locations_cancer),normpdf(np.sort(locations_cancer),cancer_mean,cancer_std),color=’orange’,label=’cdf’)

plt.legend(loc=’upper left’)

plt.title(‘Area Cancer Frequency’)

plt.tight_layout()

plt.show()