Bayesian Inference – Corona (Technical)

[This is a technical post. Those of you mainly interested in the daily Corona numbers may want to skip this]

Now, after about a month of basically full time Data Analysis of the Corona statistics, perhaps it’s a good point to write a bit more technical post on my main method, Bayesian Inference, that I use in my analysis of the data.

As an example of Bayesian Inference, let’s tackle the following problem/question (inspiration to this example coming from this book):

“For each individual country, is there a detectable “switchpoint” in the data?” That is, is there a specific date, when a major change in the trend occurs ?

For example, let’s take the number of confirmed cases of infection per day: We’ve all seen that once the virus spread starts, that number tends to grow, day by day.  But what we here are interested in, is if there is any point in time, when that growth makes a distinctive change, in direction or magnitude.

How could we go about and detect such a change ? Well, first step could be doing an occular inspection of the data, that is, looking at the data or graphs, and trying to see if there is a significant change somewhere along the timeline.  That might work, if there is a change large enough, but in most cases, it is difficult to detect a change by visual observation only.

Instead, what we can do is to instruct our Bayesian Inference Engine to look for such changes, and let the Bayesian machinery come up with a probability distribution for a significant change occuring on any particular day.

In order to do so, we assume that there is some probability  that determines the outcome of value of the parameter we are interested in, in our case, the number of infected per day. For instance: “the probability p for number of infected per day being 100 is …? “.

The problem is that we don’t know what that probability is.  But we can assume that there is some underlying probability distribution, and let the Bayesian Inference Engine, using Markov Chain Monte Carlo, figure out what the most likely value for p then is.

So, in the code  snippet below, I’ve chosen DiscreteUniform probability distribution for the switchpoint, Exponential probability distribution for generation of the parameter values we are tracking, that is, the number of confirmed per day, and I’ve used a Normal probability distribution for fitting the samples to the actual COVID-19 data:

def switchpoint(data,country):
    
    
    alpha = 1 / pm.Normal('alpha',mu=data.mean(),tau = 1 / data.std() ** 2)

    try:
        lambda_1 = pm.Exponential('lambda_1',alpha)
    except:
        print ('cant assign priors to {}'.format(country))
        return None
    
    lambda_2 = pm.Exponential('lambda_2',alpha)

    obs_sigma = pm.Normal('obs_sigma',data.mean(),1 / data.std() ** 2)

    tau = pm.DiscreteUniform('tau',lower=0,upper=data.size) #switchpoint-day

    @pm.deterministic()
    def lambda_ (tau=tau,lambda_1=lambda_1,lambda_2=lambda_2):
        out = np.zeros(data.size)
        out[:tau] = lambda_1 # param (eg inc) value before switch
        out[tau:] = lambda_2 # param value after switch
        return out


    obs = pm.Normal('obs',lambda_,obs_sigma,observed=True,value=data)

    model = pm.Model([lambda_1,lambda_2,lambda_,tau,obs,alpha,obs_sigma])

    try:
        map_ = pm.MAP(model)
        map_.fit()
    
    except:
        print ('cant fit {}'.format(country_name))
    
    mcmc = pm.MCMC(model)
    mcmc.sample(100000,40000,2)

    lambda_1_post = mcmc.trace('lambda_1')[:]
    lambda_2_post =  mcmc.trace('lambda_2')[:]
    tau_post = mcmc.trace('tau')[:]

    result = pd.DataFrame({'lambda_1_post' : lambda_1_post,
                      'lambda_2_post': lambda_2_post,
                      'tau_post': tau_post})

    print ()
    print (result.describe())
    print (result.head())

    return result



Running this code on the current Johns Hopkins CSSE dataset for a subset of the countries gives some interesting, and perhaps somewhat surprising results:

Let’s compare just to countries, US vs South Korea, and start by looking at the US:

Corona_switchpoint_params_US_

The graph above shows the λ-posterior distributions before the switchpoint we are trying to identify (green) and after (red). On the x-axis, we see the daily number of infected, on the y-axis, the probability density.  It’s obvious from this graph that there indeed is a switchpoint involved.

Let’s see if we can figure out on what day(s) such a switch occured:

Corona_switchpoint_EV_US_

These two plots share the same x-axis, which shows the number of days passed since the number of infected (‘confirmed’) exceeded 500.  The upper histogram shows the probability for a switchpoint occuring on any particular day, here we can see that around day 15, there is a high probability for a switch to have occured; that is, the Inference Engine has found that onwards from that point in time, the second exponential distribution, ‘Lambda_2’, illustrated in red on the top graph, provides a better fit than the initial,  green ‘Lambda_1″ distribution.

The bottom graph shows the actual data points for US in blue, and the Expected Value for the data, before and after the Switchpoint.

Basically, on day 15 after the number of confirmed in US hit 500, the number of daily confirmed accelerated quickly (which is to be expected, since we are dealing with exponential growth).

The interesting thing is to compare US with South Korea:

Corona_switchpoint_params_Korea, South_

Already here, looking at the posteriors for λ, we should notice something interesting – the order of colors is now switched!  That is, the green distribution is now to the right of the red one.

Let’s look at the next graph:

Corona_switchpoint_EV_Korea, South_inc_

On the top graph, we can see that the model thinks that there is a high probability for a switchpoint occuring around day 15 here, for South Korea as well, but look at the bottom graph with the actual data:

While for US (and all the other 10+ countries I’ve looked at)  what happened at the switchpoint was that the number of infected accelerated, for South Korea, it instead slowed down!  Pretty much on the same day of the process timeline!

I find this quite interesting, and thus far, the only country I’ve found in the data that exhibits this behavior, is South Korea.

So, South Korea apparently did something before or at day 15, that resulted in a very observable drop of the number of confirmed cases.  I don’t have enough info to take a quess on what the did, so I leave it to you guys to figure that one out.

 

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Bayes, Data Analytics, Epidemics, Probability, PYMC, Statistics and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s