Suppose you are interested in finding out the mean weight of all Sumo wrestlers in Japan. Or the average gas consuption of Korean made automobiles… Why…? No idea, but that sort of statistics might be of interest, for someone, at some point in time… 🙂
Anyways: the problem is that it most likely will be tricky as well as expensive, in terms of money, time and effort, to collect that sort of data for each and every individual in the full population of Sumo wrestlers or Korean made cars; at least for the latter, the magnitude of the number of Korean made automobiles will be in hundreds of thousands, if not more…
Enter statistical sampling. That is, by examining a carefully chosen (“representative”) subset of the entire population of wrestlers, or cars (or voters/consumers or whatever population you might be interested in) you can, applying statistical methods, infer information about the entire population, with some level of certainty that what you infer from the sample will be a true reflextion of the characteristic of interest of the entire population.
You might for instance have seen political polls with a disclaimer in small print saying something along the lines “…the findings of this poll have are statistically significant with a 95% confidence interval, and a margin of error of 3.4%”.
But what exactly does the above disclaimer mean, and what does a 95% (or 90%, or 99%) confidence interval really mean…?
Let’s try to get some grip on this statistical jargon, which IMHO often is unnecessarily complicated: to do so, let’s conduct an experiment, rolling regular, six sided, fair dice:
Let’s pretend that I have a number of dice, let’s say 1.000.000 dice. I roll all of these once, and I’m interested in finding out the mean, or average number of snake eyes (“points”) from this massive roll. So I sum up the points from each die, and divide by the number of dice, that is, 1.00.0000. Let’s say the total sum of the points was 3499137. This gives me a mean of 3499137/1000000 == 3.499137
Furthermore, I’m also interested in the “spread” of the points of each die, that is, how much “variability” or distance from mean the points of each die had. That type of “spread” from the mean is often measured by what the stats folks call “standard deviation”, and it turns out that in my example, the standard deviation for my throw was 1.707. That number tells me (among other things) that about 68% of the dice ended up with points in the range of 3.49 (the mean) + or – 1.707, that is, in the range 1.79—…5,21
But wait a minute…: surely a typical die must land on a whole number, not a fraction…? Absolutely right, but let’s keep this simple, not dwelling into a lot of detail, by rounding the range to 2..5. So,according to the stats folks, about 68 % of the dice should then land on either 2,3,4 or 5 eyes, which means that the remaining two outcomes, 1 or 6, must share the remaining 32% of the outcomes. That seems reasonable, since rolling a fair die, each and every possible outcome has a probability of 1/6, which is about 16%, and twice that number makes 32%.
Anyways, I’m loosing track of the topic of this post… The key point is that with a population of 1.000.000, it will take a lot of time and effort to conduct the experiment (unless you are, as we are about to do, run a simulation). The whole idea of sampling is to infer information about a potentially very large”superset”, i.e. the full population, from a relatively small sample. That is, in normal, real life situations, I would not have access to the above statistics about the population mean and population standard deviation – after all, those parameters is what I’m trying to find out, using sampling, that is, looking at a carefully selected subset of the population.
Now, there are lots of interesting (?) theorems in statistics, and one of the most fundamental ones is the Central Limit Theorem, which basically states that a sufficiently large subset of any population, regardless of its distribution, will have its means, drawn in a several samplings, distributed according to the Normal Distribution. (Look up Central Limit Theorem and Normal Distribution for more details). For our purposes, it suffices to know that thanks to the CLT, we can use sampling to infer things like the population mean from a relatively small sample, say 100 individuals, of our total population of 1000000.
Not going into all the nitty gritty theoretical details about how and why this works, here’s the essence:
Lets say that, after having rolled all the 1.000.000 dice, and carefully recorded the resulting mean and standard deviation – for checking the results of my sampling in next step.
I now randomly choose a subset, or sample, of the dice, say 100 of them, and calculate the mean and standard deviation of the sample. Let’s say I get a sample mean of 3.5 and a sample standard deviation of 1.70. Now, the sample mean as well as the sample standard deviation seem to be fairly close to the population mean and population standard deviation (both of which we in real life don’t know. But how certain can we be that the numbers from the sample truly reflect the corresponding parameters of the entire population…?
No worries, the stats people have figured out how to use these two sample statistics to infer, with a “reasonable” level of confidence, information about the full population.
Basically, what we, after having drawn our sample, would want to know if the sample mean of 3.5 is a “good enough” indicator of the true mean, that is, the population mean. To figure that one out, we proceed as follows:
- first, we decide upon how “certain” we want to be that our sample mean is “reasonably” close to the real population mean. This is called the confidence interval, and is expressed in %. Typical values for the CI is 90%, 95% or 99%. Here, we settle for a CI of 95%. Basically, what a 95% CI tells us, is that “if I’d be to perform many samplings from the same population, in 95% of the samplings the resulting sample mean will be “close enough” to the real population mean. But now we will have to figure out what is “close enough”….?
- Close enough can be seen as a “range” around the mean, and the stats folks have figured out a formula for “close enough” for the various confidence intervals. “Close enough” in this context is called “margin of error” (E), and it is calculated by: E = z * standard deviation (sample) / square root(sample size). z is called “z score”, and there are tables for z given the confidence interval we have chosen. In our case, with a CI of 95%, z turns out to b 1.960.
- Plugging in the numbers in the above formula, we get: E = (1.960 * 1.70) / sqrt(100) , that is, our marging of error, E, is 0.33
- That means that we can say that with a confidence interval of 95%, our expectation for the population mean is 3.5 plus/minus 0.33, that is, between 3.17<—->3.87
- That confidence interval of 95% with the corresponding margin of error of 0.33 tells us thus that if we should take repeted samples of the same population say 100 times, then, in 95 of those cases we should expect the thereby found sample mean to cover the true population mean, that is, the true population mean should fall into the range of sample mean plus/minus the margin of error. In the remaining five cases, the true population mean resides outside of the range.
- The key point is that with this procedure, we can claim that it is fairly (“very” ?) likely, that our sample has given us reason to expect, “confidence”, that in 95% of the cases, the sample mean found is a resonably good estimation for the true population mean.
Below is a screen shot of a Python-based simulation of the above scenario: roll of 1.000.000 dice, recording the true population mean and standard deviation, and then taking 100 samples, each with samplesize 100, and plotting the results. As can be seen, 95 of the 100 samples have their sample mean within the given 95% confidence interval.