Bayesian Multiple Regression to identify likely causal associations

I’m using the Johns Hopkins Corona Virus dataset here to demonstrate how Multiple Regression can help to unmask associations between variables, and assist in figuring out which associations really impact the dependent variable, and which ones are only spurious relationships/correlations.

Let’s assert that GDP per Capita (GDPpc)  and/or population density could possibly have impact on confirmed Corona cases (normalized to per million habitants). To see if that’s indeed the case, let’s run a binary (single) regression on the data – first on a set of 129 countries – first with GDPpc as the predictor, and Confirmed_per_Million as the dependent variable, then with density as the predictor:

binary_regression_countries_gdp_per_capita_conf_per_M

The above graphs show the results of that binary regression over 129 countries, and clearly there is a strong association between GDPpc and the number of confirmed, with a mean slope of 0.46, with CI [0.38,0.54] between GDPpc and the number of Corona cases – the higher the GDPpc, the higher the number of Corona cases [1]. That is, number of confirmed is strongly correlated with GDPpc (whether that also is a causal relationship can not be decided by regression alone, for determining causality we need to apply additional scientific reasoning.)

Next, let’s look at whether population density has any impact on the number of Corona cases for these 129 countries:

binary_regression_countries_density_conf_per_M

It looks like there is a positive association, albeit weaker than that of GDPpc. Looking closer at the 89% Credible interval for the slope (inside the brackets in the legend), we see that the interval goes from negative 0.03 to positive 0.37, with a mean value of 0.17.

So, we might conclude, based on this binary regression, that there is a weak but still existing positive association between density and the number of confirmed cases.

So, a conclusion we might reach at this point, after having run binary regression for both GDPpc and population density, is that for these 129 countries, both predictors are positively correlated with number of Corona cases, GDPpc strongly so, and population density weakly so.

Let’s do the same analysis  for the set of US States.

First GDPpc –> Confirmed_per_Million:

binary_regression_US_States_gdp_per_capita_conf_per_M

Here we see a weak positive association, with a slope mean of 0.28, and a CI of [0.02,0.53]

So while for the 129 countries above GDPpc indeed is strongly associated with confirmed Corona cases, for US the relationship is much weaker.  What could be the reason for the difference….?

One might perhaps speculate that once GDPpc has exceeded a certain treshold, as it has for all US states (mean GDPpc is almost 60.000 USD) all of them being (very) rich compared to most countries of the world (mean for the 129 countries above is about 17.000 USD, but there’s a whole bunch of countries in the set with a GDPpc under 1000 USD) GDPpc no longer matters for the ability detect Corona, while very poor countries simpy can’t afford to test much at all, and thus have much less of confirmed cases.

What about density for US ?

binary_regression_US_States_density_dead_per_M

Binary regression shows that for US states, there is a very strong association between population density and confirmed cases within the states: mean slope is now whopping 1.52, meaning that moving one unit on x-scale (one standard deviation) results in 1.52 units upwards on the y-scale. 89% CI for the slope is [0.78,2.26].

What these separate binary regressions thus have told us, is that the association between GDPpc and Confirmed Cases is:

  • Strong positive for the set of 129 countries
  • Weak positive for US states

For Density, we’ve found that:

  • Weak positive for the set of 129 countries
  • Very strong positive for US states

Now, let’s run a Multiple Regression with both predictor varibles present, first for the 129 countries:

multi_regression_countries_gdp_per_capita_conf_per_M

Here we see that the association GDPpc <–> Nr of Confirmed is identical to the binary regression run, but density <–> Nr of Confirmed is now a bit weaker than before, mean slope now 0.12 instead of 0.17. Overall though pretty much the same result.

Let’s look at US states:

multi_regression_US_States_gdp_per_capita_conf_per_M

Now we see that density is still strongly positive, with mean slope of 1.05 and a CI of [0.34,1.80] but GDPpc has now fallen to slope mean of 0.14, and it’s CI ranges from negative values to positive, [- 0.11, 0.40], implying that GDPpc  actually hasn’t a clear, well defined impact at all on number of confirmed, which we might have thought after running the corresponding binary regression.

So, multiple regression helped us to identify the “true” associations in case of US states, that is, that among the two predictors we’ve used here,  for US states density is the key predictor of number of Corona cases and that GDPpc most likely does not have an inpact – somehing we could not conclude by running separate binary regressions.

So, multivariate regression helped us here to reveal that GDPpc most likely is not a predictor for the number of Corona cases in US states, something we might otherwise have concluded when running the binary regression.

 

[1] Which could have to do with ritcher countries being able to do more testing than poorer countries

 

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Bayes, Data Analytics, Epidemics, Probability, PYMC, Statistics and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s