Confidence Intervals in Python

Suppose you are interested in finding out the mean weight of all Sumo wrestlers in Japan. Or the average gas consuption of Korean made automobiles… Why…? No idea, but that sort of statistics might be of interest, for someone, at some point in time… 🙂

Anyways: the problem is that it most likely will be tricky  as well as expensive, in terms of money, time and effort, to collect that sort of data for each and every individual in the full population of Sumo wrestlers or Korean made cars; at least for the latter, the magnitude of the number of Korean made automobiles will be in hundreds of thousands, if not more…

Enter statistical sampling. That is, by examining a carefully chosen (“representative”) subset of the entire population of wrestlers, or cars (or voters/consumers or whatever population you might be interested in)  you can, applying statistical methods, infer information about the entire population, with some level of certainty that what you infer from the sample will be a true reflextion of the characteristic of interest of the entire population.

You might for instance have seen political polls with a disclaimer in small print saying something along the lines  “…the findings of this poll have are statistically significant with a 95% confidence interval, and a margin of error of 3.4%”.

But what exactly does the above disclaimer mean, and what does a 95% (or 90%, or 99%) confidence interval really mean…?

Let’s try to get some grip on this statistical jargon, which IMHO often  is unnecessarily complicated: to do so, let’s conduct an experiment, rolling regular, six sided, fair dice:

Let’s pretend that I have a number of dice, let’s say 1.000.000 dice. I roll all of these once, and I’m interested in finding out the mean, or average number of snake eyes (“points”) from this massive roll. So I sum up the points from each die, and divide by the number of dice, that is, 1.00.0000. Let’s say the total sum of the points was 3499137. This gives me a mean of 3499137/1000000 == 3.499137

Furthermore, I’m also interested in the “spread” of the points of each die, that is, how much “variability” or distance from mean the points of each die had. That type of “spread” from the mean is often measured by what the stats folks call “standard deviation”, and it turns out that in my example, the standard deviation for my throw was 1.707. That number tells me (among other things) that about 68% of the dice ended up with points in the range of 3.49 (the mean) + or – 1.707, that is, in the range 1.79—…5,21

But wait a minute…: surely a typical die must land on a whole number, not a fraction…? Absolutely right, but let’s keep this simple, not dwelling into a lot of detail,  by rounding the range to 2..5. So,according to the stats folks,  about 68 % of the dice should then land on either 2,3,4 or 5 eyes, which means that the remaining two outcomes, 1 or 6, must share the remaining 32% of the outcomes. That seems reasonable, since rolling a fair die, each and every possible outcome has a probability of 1/6, which is about 16%, and twice that number makes 32%.

Anyways, I’m loosing track of the topic of this post… The key point is that with a population of 1.000.000, it will take a lot of time and effort to conduct the experiment (unless you are, as we are about to do, run a simulation).  The whole idea of sampling is to infer information about a potentially very large”superset”, i.e. the full population, from a relatively small sample. That is, in normal, real life situations, I would not have access to the above statistics about the population mean and population standard deviation – after all, those parameters is what I’m trying to find out, using sampling, that is, looking at a carefully selected subset of the population.

Now, there are lots of interesting (?) theorems in statistics, and one of the most fundamental ones is the Central Limit Theorem, which basically states that a sufficiently large subset of any population, regardless of its distribution, will have its means, drawn in a several samplings, distributed according to the Normal Distribution. (Look up Central Limit Theorem and Normal Distribution for more details). For our purposes, it suffices to know that thanks to the CLT, we can use sampling to infer  things like the population mean from a relatively small sample, say 100 individuals, of our total population of 1000000.

Not going into all the nitty gritty theoretical details about how and why this works, here’s the essence:

Lets say that, after having rolled all the 1.000.000 dice, and carefully recorded the resulting mean and standard deviation – for checking the results of my sampling in next step.

I now randomly choose a subset, or sample, of the dice, say 100 of them, and calculate the mean and standard deviation of the sample. Let’s say I get a sample mean of 3.5 and a sample standard deviation of 1.70.  Now, the sample mean as well as the sample standard deviation seem to be fairly close to the population mean and population standard deviation (both of which we in real life don’t know. But how certain can we be that the numbers from the sample truly reflect the corresponding parameters of the entire population…?

No worries, the stats people have figured out how to use these two sample statistics to infer, with a “reasonable” level of confidence, information about the full population.

Basically, what we, after having drawn our sample, would want to know if the sample mean of 3.5 is a “good enough” indicator of the true mean, that is, the population mean. To figure that one out, we proceed as follows:

  • first, we decide upon how “certain” we want to be that our sample mean is “reasonably” close to the real population mean. This is called the confidence interval, and is expressed in %. Typical values for the CI is 90%, 95% or 99%. Here, we settle for a CI of 95%. Basically, what a 95% CI tells us, is that “if I’d be to perform many samplings from the same population, in 95% of the samplings the resulting sample mean will be “close enough” to the real population mean. But now we will have to figure out what is “close enough”….?
  • Close enough can be seen as a “range” around the mean, and the stats folks have figured out a formula for “close enough” for the various confidence intervals. “Close enough” in this context is called “margin of error” (E), and it is calculated by: E = z * standard deviation (sample) / square root(sample size).  z is called “z score”, and there are tables for z given the confidence interval we have chosen. In our case, with a CI of 95%, z turns out to b 1.960.
  • Plugging in the numbers in the above formula, we get: E = (1.960 * 1.70) / sqrt(100) , that is, our marging of error, E, is 0.33
  • That means that we can say that with a confidence interval of 95%, our expectation for the population mean is 3.5 plus/minus 0.33, that is, between 3.17<—->3.87
  • That confidence interval of 95% with the corresponding margin of error of 0.33 tells us thus that if we should take repeted samples of the same population say 100 times, then, in 95 of those cases we should expect the thereby found sample mean to cover the true population mean, that is, the true population mean should fall into the range of sample mean plus/minus the margin of error. In the remaining five cases, the true population mean resides outside of the range.
  • The key point is that with this procedure, we can claim that it is fairly (“very” ?) likely, that our sample has given us reason to expect, “confidence”,  that in 95% of the cases, the sample mean found is a resonably good estimation for the true population mean.

Below is a screen shot of a Python-based simulation of the above scenario: roll of 1.000.000 dice, recording the true population mean and standard deviation, and then taking 100 samples, each with samplesize 100, and plotting the results. As can be seen, 95 of the 100 samples have their sample mean within the given 95% confidence interval.

IMG_1538

Advertisements
Posted in Big Data, development, Math, Numpy, Probability, Python, Simulation, Statistics | Tagged , , , , , , , , , | Leave a comment

How many North (or South) poles are there…?

https://www.scientificamerican.com/article/the-earth-has-more-than-one-north-pole/

Posted in development | Leave a comment

NASA aerosol visualization

Posted in Complex Systems, Maritime Technology, Non Linear Dynamic Systems, Simulation | Tagged | Leave a comment

Estimation of tides

Tide estimation made simple by using “rule of 12th’s”.

Posted in development | Leave a comment

Central Limit Theorem illustrated

IMG_1516

A fundamental theorem of statistics is the Central Limit Theorem, which basically states that the averages (means) of a large number of samples drawn from any population converges towards a normal distribution.

The graph above demonstrates this theorem: the blue bars show a “population” of 1.000.000 instances, with a range of values of [0..9]. As can be seen by the blue histogram, these 1.000.000 values are very uniformly distributed.

From these 1.000.000 values, I randomly draw 10 values, 1000 times, and compute the averages of the samples, and plot the resulting histogram in green. Voilá, the famous Bell Curve appears!

The red line is the corresponding pdf for a normal disteibution with the corresponding mean and stddev.

Posted in development, Math, Numpy, Python, Statistics | Tagged , , , | 2 Comments

Wind,Temperature,Air Pressure statistics

lfreqlqqTfreqTqqvfreqvqq

The above graphs show statistics from SMHI, the swedish meteorological and hydrographical institute, on air pressure, temperature and wind speed.  For each categrory, (pressure,temp,wind speed) there are two graphs: the first a frequency chart, the second a Q-Q-chart, showing the correlation between the observed data and a specific theoretical distribution, which I’ve chosen to be the Normal distribution for temperature and pressure, and the Rayleigh distribution for wind speed. Note: that’s not to necessarily imply that wind, temp & pressure ARE distributed according to these theoretical models… more on this below.

As can be seen, the best match occurs for wind speed distribution, where the observed data fit the Rayleigh distribution very well – almost all observations fall onto a straight line.

Both air pressure and temperature are mapped against the Normal distribution, and as can be seen, neither of them are a perfect fit, with temperature being clearly most non-conforming to the normal distribution.

To summarize the observations from this data:

Wind speed follows the Rayleigh distribution quite well, Air pressure could perhaps be modelled by the normal distribution, despite it not being a “perfect” match – but then, in reality, no model is a perfect match for reality! – and temperature, at least at my latitudes, where there is large seasonal variation, does not fit neither Rayleigh nor Normal distributions.

Posted in development, Non Linear Dynamic Systems, Probability, Python, Statistics | Tagged , , , , , , | Leave a comment

Wind Speed Distribution – is it Normal…?

wind_hogarnasoderarm-distsoderarmhog-dist

As a sailor, I’ve become obsessed by wind. Trying to understand its behavior has kept me busy for decades, and still, I can’t claim I fully understand wind behavior.

Anyways, one of the things I’ve wondered is whether wind is governed by the normal (Gauss) distribution. It’s not. As can be seen from the graphs above, with data graciously provided by the Swedish Meteorological Service, SMHI.

Graphs 1 & 3 show a time series of wind, in terms of direction and speed, from two different stations on the swedish east cost, the lighthouses ‘Svenska Högarna’  (abt 600 data points) and ‘Söderarm’ (abt 300.000 data points). For Svenska Högarna, the time series is short, just June 2017 to October 2017, while the time series for Söderarm is from 1951 to 2017. As an aside: if you look carefully at the data for Söderarm, you can in the smoothed plots see that something seems to happen to the data in 1995… I got in touch with the good folks at SMHI, and they confirmed that in november 1995 the measurements went from manual, 3 times a day, to fully automatic, once per hour, which explains the larger fluctuations of the data after 1995.  Also, these two graphs display the data in smoothed form, where I’ve used a Hanning Window and convolution to detect the overall trend for TWS and TWD.

Graphs 2 & 4 show the relative distribution of wind speed (TWS), measured in m/s. It’s from these histograms obvious, that wind speed does not follow normal distribution – instead, wind is characterised by a Long Tail, typical for phenomena governed by different types of power laws.  Apparently the wind pro’s uses Weibull distribution to model wind speed behavior, at least according to this reference.

On top of the wind speed distribution of graphs 2 & 4, I’ve overlayed a normal probability density function, in orange, as well as a random sampling with same parameters as the wind data, by a red, dashed line. Furthermore, on the graphs the mean is illustrated by the red dashed vertical line, and the 1st, 2d & 3d standard deviations by dashed orange lines.

The raw data from SMHI came as .csv’s, and I used Python Pandas to read and manipulate the data. Pandas are great for processing large amounts of more or less structured data, it takes only about 60 lines of code to transform the raw data from the csv’s to the Matplotlib graphs.

Posted in Data Analytics, development, Nautical Information Systems, Probability, Python, software, Statistics | Tagged , , , , , , , | 3 Comments

Map Projections using Python & Basemap

The numerous external libraries available for Python continue impressing me, in scope, numbers and ease of use.  A couple of hours ago I googled for chartographic tools for Python, and found the Basemap-library, an add-on to Matplotlib. For whatever reason, it’s not available for pip install, so I had to install it from sources, which was totally painless. Then, after finding this basemap tutorial, it took just about 15 lines of code and equally long time in minutes to draw the different map projections above, all showing a flight path Stockholm – San Francisco.

Continue reading

Posted in development, Maritime Technology, Nautical Information Systems | Tagged , , , , , , , | Leave a comment

Great Circle Navigation

SYD-LHR

IMG_1500

The shortest distance between two points is a straight line, right…?

Nope, not on the surface of a sphere, like our earth. On a sphere, the shortest distance between two points is a Great Circle.  Let’s say you are a pilot and about to embark on a journey from Sydney to London. You take your Mercator chart, and plot a straight line from Sydney to London (like the red line on the map above).  Turns out that your course, following this straight line – also called “Rhumb Line” or “Loxodrome”, will result in your flight distance being about 360 Nautical Miles longer than had you followed the orange path, which is the Great Circle route. That’s about 45 min extra flight time in a typical commercial jet, which is a lot of extra fuel,  or a day and a half of extra sailing if you’d sail the same route.

Loxodrome’s (or Rhumb Lines) have the nice characteristic on Mercator charts that they are presented as a straight line, and with a constant course, in the example Sydney-London the rhumb line course is 303 degrees.

For Great Circle navigation, your course is going to vary during the journey, for instance, on the plotted journey from Sydney to London, the initial course is 318, gradually increasing to 326, and then gradually decreasing to reach the final course of 242 degrees. For practical reasons, it’s not possible to change course constantly, so for instance onboard ships, navigators typically used to change course once a day to follow an approximation of a Great Circle route.  Today, with modern technology, it should be feasible to tell the autopilot to follow a Great Circle path with much more frequent course changes. (In case any commercial pilots or professional mariners read this blog, I’d be interested in learning how you guys do onboard your ships/aircraft in order to follow a great circle).

The lower graph shows the True Vector at selected points.

(I haven’t found a way to include code in a decent way, preserving indentation etc,  in a WordPress post – without having to pay for it – so the Python hack code is in the pdf doc below).

Cudos to pythonforoceanographersgpxpy, and movabletype for inspiration, insights and new knowledge.

great_circle

 

Posted in development, Nautical Information Systems, Python | Tagged , , , , , , , | Leave a comment

Machine Learning – results

So, I reached the point where my Neural Network , after a couple of thousand training iterations, is capable of correctly identifying and separating different categories of input into their corresponding classes.

More specifically, as “targets”, I’ve chosen a number of continuous sampled waveforms, each one 1000 samples long, wheras as “noise”, I’ve simpy used equally many samples of random noise.  One could imagine that the targets represent sonar clips of submarines, sea life or something else, and the noise samples just represent the background noise in the marine environment.

Anyways, a visual illustration of the 100 target samples is below:

sinus-100.png

Next image illustrates the noise, equally 100 samples, each with size of 1000:

noise-100.png

I put these data as the training data to my neural network, in this case a 1000 x 100 x 10 x 1 network, and after a couple of hundred learning iterations the results are as below:

learning_run

So, the network is capable of providing a perfect separation between the two classes of training data, as can be seen from the upper graph. The lower graph shows how the network fairly quickly converges towards a solution: the y-axis shows the cumulative error over all the training data per iteration. As can be seen, the network is able to adjust its 101.010 (1000 x 100 + 100 x 10 + 10 x 1) weights towards a convergence.

To test the network after the training, I pass in one single target waveform, randomly chosen, and similarly, one single noise input.  The below graph shows that the network is capable of correctly classifying these two inputs.

sample_test

I’ve done this experiment using my very modest laptop – despite my very limited number crunching and data processing power, I’m able to obtain good results. Imagine then what the Big Guys, the Google’s, Facebook’s, Amazon’s and others are able to do with this technology, with their immensely more powerful machinery….

Artificial intelligence is already changing socities and human life, and it will continue doing so in an ever increasing pace. Humans will need to think carefully about how to deal with this “Pandora’s Box” – a good intro to the potential dangers of Super-AI can be found here:

Posted in AI, development, Machine Learning, Neural networks, Society, Technology | Tagged , , , , , , , | Leave a comment