top of page

A little about our amateur COVID19 model

(Notes by an armchair virologist)

Several people receiving my daily forecasts have asked about the process of estimating the so- called “true cases”. Indeed, how do we know what true COVID19 cases are if testing is partial and random, and none of the available statistics provides us with an answer to a pretty basic question: how many people in a given pool are actually infected? Fortunately, something called “Compartmental Epidemiology Modeling” comes to the rescue and can help even lay modelers like myself.

What follows is a brief (and hopefully not too boring) description of the process.

1. INTRO TO BASIC COMPUTATIONAL EPIDEMIOLOGY (Don’t be alarmed, it’s not that scary).

In devising the forecasting model, I am relying on a basic epidemiology model called SIR (also known as Kermack-McKendrick model).

Briefly, the SIR model suggests that the population is compartmentalized into three groups: Susceptible (“S”) – people who have not yet been infected; Infected (“I”) – people who are currently infected; and Recovered (“R”) – formerly infected people who are now recovered. Together, they represent the entire population, so that:

S + R + I = 100%

There are several fundamental assumptions we must make to keep the model relatively simple.

First, the total population is assumed constant (seems easy enough over short time periods).

Second, recovered patients acquire full immunity and cannot get re-infected. I do have a potential problem with this assumption, since we do not yet know whether SARS-CoV-2 mutates and re-infects formerly recovered people. I will try to address this in the second installment. But for now, let us stick with the simplistic assumption.

The SIR model has a few straight-forward postulates: S (% of susceptible individuals) will decrease with time; R (% of individuals with a resolved disease) will increase with time; I (% of Infected individuals) grows to a certain point before reaching a peak and then starts to decline.

When I speak of “true cases” in my nightly updates, “I” above is that number. One of the adjustments I had to make to the classic SIR model is to include both recoveries and deaths in R. This preserves the condition that R individuals cannot be infected again (either immune or dead). So we are going to call R “Resolved”, rather than “Recovered”.

Each of the variables can be calculated as follows:

dS(t) = -b*I*S dI(t) = b*I*S – g*I dR(t) = g*I

where: dS(t), dI(t) and dR(t) are incremental changes of S, I and R over time t b = rate of disease transmission over time t through secondary exposure g = rate of disease resolution over time t

[note that b and g are really "beta" and "gamma" but I can't figure out how to embed greek characters in this blog :(].


The basic reproduction number, R(0), which we hear about so much lately, is simply a ratio of b/g. Given no external intervention, each virus type has constant R(0) which is empirically derived. For example, common cold has R(0) of 2-3, flu is around 1.8, polio is 5-7, mumps 4-7 and so on.

The model at its most basic, uses the classic SIR model with a constant R(0). Multiple sources estimate that R(0) for SARS-CoV-2 is around 2.2, which is what I am using on day 1 of the simulation. This means that if left unchecked, a single infectious individual will spread the virus to an average 2.2 people over the course of his/her infection.

The SIR model conveniently provides a calculation of the % population unaffected by the disease by the time the epidemic has completely run its course. The formula is below:

S = exp(-R(0)*(1-S))

This formula does not have a closed form solution but can be easily iterated to find the value of S. Thus, for value of R0 equal 2.2, S is 15.7%, meaning that 84.3% of the population will become carriers of SARS-CoV-2.

Now we can construct the hypothetical picture of a life of the SARS-CoV-2 epidemic from start to finish, given no external intervention:

Assuming 10% of the carriers eventually develop COVID19, that is 8.4% of the total population, or 28 million in the US. With COVID19 mortality rates at 1-2%, this would be between 2.8 and 5.6 million deaths in the US. Obviously, a terrible outcome which explains the efforts to artificially reduce R0. In the ideal world (or in South Korea), when everyone is perfectly disciplined and self-isolates, R0 rapidly falls to below 1, and the epidemic ends.

There are only three ways to alter the course of an epidemic: 1) Increase g by reducing time to recovery through effective treatment; 2) Decrease the timing of the cycle by rapidly increasing b through herd immunity while reducing the viral load, typically by vaccinations; 3) Reduce b by increasing the average time between social contacts (aka social isolation). It’s clear that the world is following the third option until the first two become available.


Obviously an estimation model needs reliable inputs to avoid the GIGO problem. My early version of the model used confirmed cases as input, but I had to dismiss this data as pretty random.

We can rely, however, on a relative accuracy of resolved cases (recoveries + deaths), or dR(t), and the relationship between dR(t) and confirmed cases. Of course, lack of reliable data on comortality (e.g. how many deaths were caused by multiple underlying conditions rather than solely COVID-19) adds another layer of randomness to this exercise.

Knowing dR(t) can help estimate average time for resolution, i.e. how many days ago did we have the number of confirmed cases which eventually completely resolved to dR today. In absence of interventional treatment, time to resolution is relatively stable. Empirically this number converges to 8.5 days. Add to that time lag from when the symptoms appear to when the test was performed, and we get somewhere in the 10-12 day range.

Now to the other part of the equation, the rate of infection, which can be implied from the average time during which a susceptible individual may come in contact with an infected individual and also becomes infected. We’ll call this number T(i), and it is measured in days. We can back into T(i) from an observed R(0). R(0) of 2.2 implies that T(i) is roughly 4.5 days (e.g. a healthy susceptible individual has a 1 in 4.5 chances to get infected on any given day).

T(i) is a critical variable and has dramatic impact on the overall course of the epidemic. Increasing T(i) by only 2 days lowers the peak of infection from 20% to 10% of total population, and aggregate I over the life of the epidemic from 84% to 60% of total population. Doubling T(i) to 9 days lowers the peak to about 3% and implies max 11% infected until the virus completes its cycle. Even though 11%, or 35 million seems like a lot, this is actually an acceptable number. This spreads patient load over a much longer time, reduces the burden on the health care system, significantly reduces mortality, and buys us time to develop the other two solutions – treatment or vaccine, or both. This is what is meant by “flattening the curve” that everyone is talking about.

Three scenarios, all assuming absence of treatment or vaccine.

Another unknown but possibly positive effect is seasonality, where we expect that warm weather reduces virus propagation (uncertain but possible). Pushing the apex out by 30-60 days will get us into the summer and may help reduce the peak even further.


The challenge for a modeler is to keep estimating the ever-changing value of R(0) – as we implement social isolation measures, R(0) rapidly declines, so that the infection curve is not static, but is constantly changing shape.

So this is what our model does. We attempt to estimate the dynamics of R(0) on a daily basis which can be roughly done by inferring dI(t-2) and dI(t-1). To calibrate the model, I used statistics from countries with the largest % tested and the most stringent social isolation procedures at the time, and then applied to countries which were behind. Of course, the path of R(0) from 2.2 to below 1 is not linear, so I had to combine both short-term linear and long-term polynomial interpolations to come up with the next day’s estimate.

On any given day, R(0) is different for every country (and for large countries, different for every region). It depends on the time since inception, specific policies to combat propagation, and compliance of general population. So R(0) is calculated separately for each country. Once R(0) is estimated, a forecast of I(t+1) can be made. That is a forecast of “true” cases. Because there is no way to empirically check the accuracy of this number, we back into the assumed confirmed cases tomorrow by interpolation.


The main indicator that things are moving in the right direction is the change in the shape of the R(0) curve, as monitored by the second derivative. R(0) declines slowly at first (this is the stage where the infections grow exponentially), and then picks up the speed as it approaches 1. At some point, the decline becomes linear. That is the signal that the second phase, what we call “flat top”, is near and true daily cases level off. Before R(0) hits 1, its descent begins to slow down. That second kink in the R(0) curve indicates the end of the “flat top”, and daily cases and total active cases gradually begin to decrease as recoveries go up.


There is no way to check the accuracy of the forecast of I because this value is not observable. The only data we have is confirmed cases, a notoriously unreliable statistics. So the model performs a simple interpolation trick, looking at past discrepancies between I and confirmed cases over different time periods and interpolating this error forward. This exercise bears no practical use other than to perform a sanity check for the overall model.

Having carried daily calculations all the way to the apex in every modeled country, I am now satisfied that this process gets us what we need – a general understanding of important stages of the epidemics, specifically the timing and magnitude of the top and the bottom of the infection curve. The simplified SIR model does not include complications such as temporary immunity, variable contact rates and an ugly thing called pluriannual epidemic. As the name indicates, this last one is about an epidemic rearing its ugly head every season, simply because there are still some infected people left at the bottom of each cycle.

Here's the overlay of the implied "true daily cases" line vs. reported confirmed cases in the US, shown as % of the total population. The confirmed cases are shown on a logarithmic scale on the right:

I am now extending the model to address this last effect, as it seems likely that mass vaccination will not come this year and thus we should expect another wave of infections after the current cycle ends. In my next installment, I will describe this iteration and, hopefully, get a glimpse of what this will look like for us in the 2020-21 season.

Featured Posts
Check back soon
Once posts are published, you’ll see them here.
Recent Posts
Search By Tags
Follow Us
  • LinkedIn Social Icon
  • Twitter Basic Square
bottom of page