Data Collection/Analysis and Covid-19: TL;DR Edition
Not to encourage people to skip the longer version, but here's a key point.
Let me emphasize a key point that I really wanted to make in my previous, much longer post. It comes from a FiveThirtyEight piece (Why It’s So Freaking Hard To Make A Good COVID-19 Model) I linked therein:
Numbers aren’t facts. They’re the result of a lot of subjective choices that have to be documented transparently and in detail before you can even begin to consider treating the output as fact. How data is gathered — and whether it is gathered the same way each time — matters.
There’s also the issue of uncollected or inaccurate data. To determine the fatality rate, you have to divide the number of people who have died from the disease by the number of people infected with the disease. In this case, we don’t really have a reliable count for the number of people infected — so, to put it mathematically, we don’t know the denominator. (If we’re being honest, we probably don’t know exactly what the first number — the numerator — is, either, but we’re assuming it’s closer to correct.)
In other words not only is there a lot we don’t know, but even the number of deaths as currently reported is an artifact of uncertainty.
The real death tally is the reported tally +/- some level of error (which, of course, is true about the annual flu-related death rate, or anything else that requires judgment calls and/or has to account for human error).
The question at the moment is: what is more likely in these conditions? An over-count or an under-count?
I would argue that an under-count is more probable. First and foremost because of the lack of adequate testing. Second, this is a new phenomenon (unlike the flu) and there is, therefore, no experience with making decisions about how to classify morbidity (and this also raises consistency problems in terms of coding deaths). Third, we are placing a lot of stock in instant counts, but the reality is that in the middle of crisis we should expect some communication errors and lags.
On that last point let me note: a lag in communication cannot lead to an over-count, it can only lead to an under-count.
Another problem with getting a good sense of the available data is that there are various time-horizons in operation here. California is one clock, NYC in on its own clock, and Louisana yet another. It is difficult to really assess the effects of various policy choices right now. For example, Florida’s stay-at-home order is only just over a week old as I write this. (And their tests are lagging, as I noted in a link my previous post).
Fundamentally, I would argue that the criticisms of the estimates of the death toll are asserting far too much certainly prematurely because they aren’t thinking through both the quality of the data at the moment nor the incompleteness thereof.