COVID-19 Visual Data Science #3 - A Primer for Analytics in the Back-to-Work Landscape - Spotfire

Follow along with this blog in Spotfire, and for live updates

Live Spotfire application available here

This blog and Spotfire application are authored by the Spotfire Data Science team

Contact: Michael O'Connell, @MichOConnell

May 28, 2020

Introduction

We are now in the middle of the first wave of worldwide COVID-19 regional outbreaks. WW confirmed cases have topped 5.5 million with more than 350 thousand deaths. New case additions have been arriving at ~80,000/day for the past month and deaths at ~3,000/day, with steady decline from a peak of ~6,000/day in mid-April (Figures 1,8)

In the face of this, some countries and US states are starting to reopen, while the cases and fatalities continue to accrue. This poses some interesting data science problems such as how best to reopen country and state jurisdictions, stores and businesses, in a way that gets the economy moving while mitigating risks from reopening.

Retail stores have been hit hard, but tech-savvy companies are using this situation to study the interactions between ecommerce and instore channels - where is there cannibalization and where is there a halo effect. How will stores with similar product mix and historical traffic patterns, but different COVID-19 case status, fare upon reopening? How is media consumption changing? It's like a hurricane has landed on a crop science field trial - which traits are still standing? "The retailers that were already doing it successfully are the ones that are going to recover much more quickly," says Kimberly Becker, senior research director with Gartner. And in a future when shoppers are likely to be skittish about visiting stores, only the tech-savvy chains will survive.

A notable sideshow to all of this is the preponderance of commentators who think they are data scientists or epidemiologists; whose position is often rooted in some political position, and who select data to suit their purpose. This is particularly dangerous in COVID-19 land, as data are spotty due to reporting errors and artifacts, and analytics are nuanced due to the virus lifecycle, such as time delays between infection, symptoms, cases and fatalities and the fact that in many cases individuals show minimal or no symptoms at all.

Indeed, the pandemic has highlighted the need for sound data science, visual analytics and data management methods, and the infusion of these skills and literacy into broader groups of users - in companies and the population at large.

In this context, Carlie Idoine, senior research director with Gartner, writes about AI-enabled analytics that draws Analytics, BI and Data Science together. "Organizations that seize the opportunities presented by this newly catalyzed market will be able to dramatically hasten their analytics-related maturation to make competitive breakthroughs in comparison to slower-maturing rivals." We expand this theme in section 8 below.

We are certainly seeing the need and opportunities for data management, analytics, BI and data science in this COVID-19 landscape. Our Spotfire Live Report is garnering interest in a number of industries - healthcare, pharmaceutical, consumer goods, retail and insurance. In all of these businesses, data scientists, business analysts and broad swaths of citizen and casual users from many functional areas, are getting involved. Data science and analytics have become a focal point for business leaders to find out what's going on right now, how to best tune the business in this new order of demand and supply, and how to forecast what lies ahead.

Our Spotfire Live Report Application includes

Ingesting data from many data sources worldwide; managing these data and providing data services to feed our analytics applications.
Modeling Reproduction number and effects of interventions (Re/Rt) with embedded Spotfire Data Science routines that call R-language data functions
Tracking case and fatality trajectories and case counts using supersmoother fits on the epidemic curves to remove reporting artifacts
Supersmoother-derived statistics such as trajectory velocities to highlight local hotspots
GeoSpatial Analyses: map layers, cartograms, chloropleths, heatmaps and polygon analyses at county, region and country level

This paper provides an update on these analyses and some details on our modeling, simulation and analytics methodologies, in the context of the pandemic and reopening initiatives. These analyses are presented in our Spotfire Live Report visual analytics app in a publicly available cloud environment. The Spotfire app uses data from a number of public health department sources that we ingest and refresh regularly, depending on the data sources. Spotfire curated data and code are available for download from the Spotfire Data Science Community . The Spotfire COVID19 site provides a consolidated view of the work.

Figure 1 shows the Spotfire application Global Overview.

Figure 1. Spotfire Live Report Application - Global Overview as of 27 May 2020. Shows worldwide cases, fatalities, recoveries and country-level stats. Includes slider for stepping through time by date. Note the daily case counts lower right, along with supersmoother curve fit that takes out artifacts in regional case reporting. The smaller graphs can be expanded and compressed by clicking the top right corner of that graph.

Reproduction number modeling and effects of interventions

We now have clear evidence that the social distancing interventions have dramatically lowered the spread of local epidemics. See O'Connell (26 March - April 4) , for an outline of our analysis methodology, epidemiology modeling basics and estimation of the effective reproduction number (Re) over time (Rt). In summary :-

The reproduction number R0 (pronounced R-nought) is the average number of people infected from a person with an infection, without any interventions in place. This is a crucial parameter in describing an epidemic. The effective reproduction number Re includes intervention effects. If Re is bigger than 1, the disease spreads. Conversely if Re, or the time-varying reproduction number Rt can be reduced over time, the disease can be contained. For COVID-19, R0 has been widely reported to be in the range 2-3.

Delameter et al. describe R0, its use and misuse. Ingelsby provides a clear and simple explanation of Re, and Pan et al. clearly show the effects of social distancing on Rt in a study of the outbreak in Wuhan. Figure 2, from Pan et al. clearly shows the effects of non-pharmaceutical interventions (NPIs) on Rt in Wuhan. Results from these and other early studies guided the implementation of social distancing around the world.

Figure 2. From Inglesby and Pan et al. The effective reproduction number Rt calculated as a 5-day moving average since January 1, 2020. The horizontal line indicates Rt=1, below which sustained transmission is unlikely so long as anti-transmission measures are sustained, indicating that the outbreak is under control.

We have been estimating the time-varying reproduction number Rt in our live interactive Spotfire Live Report Application, across the US and worldwide, since early March.

Figure 3 shows some results of Rt modeling for European countries, and Figure 4 some results of Rt modeling of different US states. Models are fit using the package EpiEstim . This package is included in the online Spotfire Live Report App, and can be added to the Spotfire client via the Tools menu, and configured to run via a Spotfire data function. User-selected markings on maps and other visuals invoke the Rt estimates to run interactively, in context of exploratory visual data analysis.

EpiEstim (Cori et al, 2019) analyzes time series incidence data to estimate time-varying reproduction numbers as outlined in Cori et al 2013. EpiEstim incorporates uncertainty in the distribution of the serial interval - the time between the onset of symptoms in a primary case and the onset of symptoms in secondary cases.

There are five estimation methods in EpiEstim; these vary in the way the serial interval distribution is specified. In the first two methods, a unique serial interval distribution is considered, whereas in the last three, a range of serial interval distributions are integrated over:-

"parametric_si" the user specifies the mean and sd of the serial interval
"uncertain_si" the mean and sd of the serial interval are each drawn from truncated normal distributions, with parameters specified by the user
"si_from_data", the serial interval distribution is directly estimated, using Markov Chain Monte Carlo, from interval censored exposure data, with data provided by the user together with a choice of parametric distribution for the serial interval
"si_from_sample", the user directly provides the sample of serial interval distribution to use for estimation of R.

Zhanwei et al. (CDC EID) estimate the distribution of serial intervals for 468 confirmed cases of COVID-19 reported in China as of February 8, 2020. They found mean interval of 3.96 days (95% CI 3.53-4.39 days), and SD 4.75 days (95% CI 4.46-5.07 days).

We have been exploring all the above methods, following the logic and approach set out by Churches . Our live Spotfire currently uses values for SI as mean 4.7 days and standard deviation 2.9 days. Churches reasoning for these values is that they better account for transmission before the onset of symptoms, which results in shorter serial intervals than expected, possibly even shorter than the incubation period. We have also explored approaches by Abbott et al including application to estimates of Rt on US states . We are exposing these parameters to the R functions in Spotfire with ranges (3.7,6.0) for the mean serial interval and (1.9,4.9) for the standard deviation. We use a window length of 7 days.

Figure 3. Rt modeling of WW countries as of May 24. The green colored band shows Rt < 1.0 (green). The dark line is the Rt estimate and the gray lines are 95% credible intervals. The models use the R package EpiEstim invoked through a Spotfire data function.

Figure 4 shows similar results of modeling Rt for this Rt modeling on US states and Figure 5 for UK counties.

Figure 4. Rt modeling of US states as of May 24. The green colored band shows Rt < 1.0 (green). The dark line is the Rt estimate and the gray lines are 95% credible intervals. The models use the R package EpiEstim invoked through a Spotfire data function.

Figure 5. Rt modeling of UK counties as of May 24. The green colored band shows Rt < 1.0 (green). The dark line is the Rt estimate and the gray lines are 95% credible intervals. The models use the R package EpiEstim invoked through a Spotfire data function.

These Rt estimates clearly show significant reductions as a result of the social distancing intervention measures around the world. This is amplified by our analysis of COVID-19 cumulative case and fatality trajectories shown in Figures 8-10 below. In particular, note the countries where the spread is still growing - including the US, Brazil, Russia, India, Sweden, Singapore and Japan. In Sweden, social distancing was only partly implemented and in the UK and the US it was implemented slowly and varied among states and counties. In Singapore, it was implemented early, and then relaxed and implemented again as cases had a second wave. Russia, Brazil and India infection sparks arrived later.

The serial interval is an important parameter in the estimation of Re/Rt. The time sequence of virus and human host states are outlined in Figure 6. This shows a number of epidemiology parameters :-

The Latent Period is the time between the occurrence of infection and the onset of infectiousness (when the infected individual becomes infectious).
The Serial Interval = the duration of time between the onset of symptoms in a primary case and the onset of symptoms in a secondary case infected by the primary case.
The Incubation Period represents the time period between the occurrence of infection (or transmission) and the onset of disease symptoms

Figure 6. Infection and transmission timeline of COVID-19. Based on supplement to: Anderson et al.. Lancet 2020.

Worldwide COVID-19 Cases, Fatalities and Trajectories

As outlined above, the non-pharmaceutical interventions (social distancing) have had a significant effect on the spread of SARS-CoV-2. These effects have varied widely over regions, depending on the timing of the interventions.

While growth in new cases and fatalities has slowed across the world, we have seen a plateau of cases and fatalities.

Figure 7 shows COVID-19 case and fatality counts and supersmoother fits for global data. Note the arrows in top right corner of graphs - click this to expand the view from Figure 1 above.

The WW epidemic curves currently show a flat trajectory where case and fatality counts have been fairly consistent day to day throughout April.

Figure 7. Worldwide cases and fatalities daily counts, with supersmoother fit. As of May 3, WW confirmed cases have topped 3 million; new case additions are arriving at ~80,000/day and deaths at ~4,000/day, with steady decline from a peak of ~6,000/day in mid-April

The epidemic curves from Individual countries vary widely. Figure 9 shows cumulative case trajectories for countries where the epidemic spread is still well underway - including Brazil, India, Russia and the US. Figure 10 shows cumulative case trajectories for countries where the spread has been contained. The graphs are automatically annotated with Natural Language via integration with Arria NLG.

For case trajectories, the y-axis is the cumulative number of confirmed cases, on the log scale and the x-axis is the time in days after the first <100> confirmed cases. The dashed lines are at slopes representing 1-day, 2-day, 3-day 5-day and 7-day doubling.

Note that we use raw and cumulative cases rather than normalizing by total population. Normalized numbers are good at showing *relatively* how much strain a region is under, but they're not suited to tracking the extent/state of a country's outbreak, which spreads at approximately the same pace regardless of country size. Advantages of this presentation of raw counts include:

slopes (of tangents to the curves) reflect growth rates
heights reflect prevalence, per reported cases and fatalities
relative timing is maintained across the regions on the graphs

If we were to divide the counts by population, we would need to normalize by percent of population infected, rather than number of cases; so that regions would start at [first <1%> of population reported infected> rather than [first <100> cases]

Also note that cases are a function of the number of tests performed; this varies considerably by country. As such, the number of confirmed cases should not be interpreted as reflective of actual infections.

Figure 8. COVID-19 cumulative case and fatality trajectories by countries where the spread is still growing. This includes US, Brazil, Russia, India, Sweden, Singapore and Japan. Other countries that have flattened (China, South Korea) are included for comparison, The y-axis is the number of confirmed cases or fatalities (log scale), and the x-axis is the number of days after the first <100> confirmed cases or <10 confirmed fatalities. The <100> and <10> days aligns the curves to a common starting point in the epidemic outbreaks, and is configurable in the Spotfire application. The dashed lines indicate various doubling rates in days.

Figure 9. COVID-19 cumulative case and fatality trajectories by countries where the spread is contained. This includes Germany, China, Netherlands, Switzerland, Israel, Austria, South Korea, Norway, Australia, Malaysia, Thailand, Greece, New Zealand, Taiwan and Vietnam. The y-axis is the number of confirmed cases or fatalities (log scale), and the x-axis is the number of days after the first <100> confirmed cases, or first <10> fatalities. Note that the cumulative cases have flattened more than the fatalities; this is due to deaths occurring some days/weeks after cases are confirmed.

Figure 10 shows cases and fatalities daily counts, with supersmoother fit, for select countries where the epidemic is spreading, and where it is contained. Countries still showing growth or flat trajectories in cases include Brazil, India, Sweden, UK, US. Countries showing containment in new cases include Italy, Germany. Note how the supersmoother adjusts for reporting artifacts, including inconsistent reporting on weekends v weekdays.

As noted above, social distancing was only partly implemented in Sweden; and in the UK and the US it was implemented slowly and varied among states and counties. In Singapore, it was implemented early, and then relaxed and implemented again as cases had a second wave. Russia, Brazil and India infection sparks arrived later.

Figure 10. Cases and fatalities daily counts, with supersmoother fit, for select countries where the epidemic is spreading. In order from top and from growth to containment: Growth: Brazil, India, Sweden, US, UK; Containment: Italy, Germany. Note how the supersmoother adjusts for reporting artifacts, including inconsistent reporting on weekends v weekdays.

US COVID-19 Cases, Fatalities and Testing: Re-opening Scenarios

As outlined above, worldwide and US cases have been in a plateau for most of April and into May, where daily case counts have been fairly similar. Approximately ~80,000 cases and 4,000 deaths (down from peak of 6,000 in mid-April) are being added per day worldwide, and ~20,000 cases (down from peak of ~30,000 in early April) and 1,000 deaths (down from peak of ~1,900 in early April) are being added per day in the US.

The ~20,000 new cases per day corresponds to an average Re of ~1 across the US. This gradual decline / plateau at a national level results from some states with rising case counts (and Rt greater than 1) and other states with falling case counts (and Rt less than 1). As we move forward into summer, transmission rate will be affected by:

Reopening efforts to get the economy moving
Better hygiene practice eg wearing of masks
Possible impact of seasonality

Most likely we are looking at a to and fro; from local jurisdiction orders to reopen, along with people's changing behavior and perceptions re. decreasing risk that result in increasing local case counts; followed by cycles of increased social distancing as fear of infection returns.This is occurring now in local regions; regions that are recovering from being hard hit eg Michigan, and new regional outbreaks in some southeast and midwest states.

Note that even if infections were to continue at say ~300K per day, the US would have ~50M cumulative infections by September 1 and be at ~15% population immunity. This is still far short of herd immunity, which would require between 50% (for R0 =2) and 66% (for R0=3) required of the population recovered to prevent further epidemic spread. (Bedford, April 7 )

Recent data released by Apple and Google on community mobility (Figure 11) show the implementation of social distancing. Coupled with the Rt estimates we have been providing over this period, the effects of social distancing have clearly reduced Re significantly across the world.

However, note the recent uptick in Retail (purple) and Parks (green) mobility.

Figure 11. Mobility trends of Google Android devices, as of May 24.

Figure 12 shows US cases by county as a map, with sorted bar chart counts and daily case counts with supersmooth curve fit. The map enables hotspot detection via localized contour and heatmap calculations via interactive markings (See Figure 15 below)

Figure 13 shows case velocities by US county, using first derivatives of supersmoother curves; these are shown along with the map. Individual velocity curves may be obtained via interactive marking (combinations of counties) on the map. Note the high velocities in some parts of Minnesota and Nebraska, and some southeast and midwest counties.

Figure 14 shows cases by county, normalized by population (cases per 1M population). This is particularly useful for identifying case trends in low population areas. For example, note the high density of cases in the four corners area, reflecting the infecting going through the Navajo tribe in that area.

Figure 12. Cases by county, with sorted bar chart counts and daily case counts from supersmoother curve fit. The smaller graphs can be expanded and compressed by clicking the top right corner of that graph. The map enables hotspot detection via localized contour and heatmap calculations via interactive markings (See Figure 13 below)

Figure 13. Case velocities by county using first derivatives of supersmoother curves. These are shown on the right, along with velocity curves over time by county. Velocity curves v time for regions may be obtained via interactive marking (combinations of counties) on the map. Note the high velocities in parts of Minnesota, Nebraska and some midwest and southeastern counties.

Figure 14. Cases by county, normalized by population (cases per 1M population). This is particularly useful for identifying case trends in low population areas. For example, note the high density of cases in the four corners area, reflecting an infection going through the Navajo tribe in that area.

Testing

Testing is critical in the path to recovery and reopening. There are 2 main types of tests:

Tests for presence of virus; these aim to establish whether an individual is currently infected. The most common and best test is the PCR test. This uses DNA amplification technology and can detect as low as 10 copies of the virus in the sample.
Tests for presence of antibodies; these aim to establish whether an individual has been infected some time in the past.

The Spotfire Live Report currently analyzes test data provided by the COVID Tracking Project . As of May 24, the COVID Tracking Project had logged 14,604,942 tests with 1,654,829 positive cases and 92,464 deaths.

The COVID Tracking Project reports all data provided by the states, as outlined on https://covidtracking.com/data . Our World in Data on the other hand, restricts test data to RT-PCR tests as outlined at https://ourworldindata.org/coronavirus-testing

RT-PCR tests have good properties :-

SENSITIVITY measures how well we can detect patients with disease. Imagine a nasal swab from someone with SARS-CoV-2 infection. Sensitivity is the probability this sample will test positive.
With RT-PCR, the specimen is declared positive if viral RNA is detected. Patients with COVID19 often have high viral loads in their throats. Thus it is relatively easy to detect virus, and the test is typically highly sensitive.
When sensitivity is low, we get FALSE NEGATIVES. These are swabs from which no virus is detected even though the person is infected. This could occur if the specimen was degraded, virus didn't amplify well, etc.
SPECIFICITY measures how well we rule out infection for people who are tested but aren't infected. Imagine a nasal swab from someone without SARS-CoV-2 infection. This should come back negative on RT-PCR. Specificity measures how often these come back negative.
Because RT-PCR measures viral RNA, it is rare for someone to test positive if they aren't infected. It can happen, though e.g. if there is random contamination in the lab. In general, RT-PCR is usually highly specific and there are few FALSE POSITIVES.

Figure 15 shows test data from Colorado. The graph on bottom left shows new confirmed cases in dark blue, deaths in light blue and recoveries in light blue. The graph on bottom right shows negative test results in light blue bars, positive results in dark blue bars and positive test percentage as dark line. Note the separate scales on left and right axes.

Figure 15. Test data from Colorado. Selectors for total cases, case growth rate, cases/100k people, case velocity and Re are included. The graph on bottom left shows new confirmed cases in dark blue, deaths in light blue and recoveries in light blue. The graph on bottom right shows negative test results in light blue bars, positive results in dark blue bars and positive test percentage as dark line. Note the separate scales on left and right axes.

There is much current interest in in-home tests as a way of conducting wider surveillance and assisting in back-to-work programs. Since April 21, the FDA has approved in-home tests from LabCorp , Everlywell and Rutgers University in New Jersey. The FDA and surprisingly shut down the SCAN program (Seattle Coronavirus Assessment Network) reported by Maxmen in Nature .

Case Fatality Rate and Infection Fatality Rate

There is much debate about COVID-19 fatality rates. When considering COVID-19 fatality rates it is important to distinguish :-

The risk of dying from COVID-19, among people who are diagnosed with COVID-19 (Case Fatality Rate, CFR)
The risk of dying from COVID-19, among people who get it (Infection Fatality Rate, IFR).
The risk of dying from COVID-19, among people who do not currently have it (Population Fatality Rate, PFR).

The best way to calculate CFR would be to track a large group of people from the point when they develop symptoms until they later die or recover, and to then calculate the proportion of all these cases who had died. This is not possible in the real world. It is incorrect to just divide the total number of deaths by total number of cases as this does not account for unreported cases or the delay from illness to death.

The CDC recently issued guidance for five planning scenarios that "are being used by mathematical modelers throughout the federal government," according to the CDC. Four of those scenarios represent "the lower and upper bounds of disease severity and viral transmissibility." The fifth scenario is the CDC's "current best estimate about viral transmission and disease severity in the United States." In that scenario, CDC lists a symptomatic case fatality rate of 0.01, meaning that 1% of people with COVID-19 and symptoms would die. In the least severe scenario, the CDC puts that number at 0.2%.For people age 65 and older, the CDC puts that number at 1.3%. For people 49 and under, the agency estimated that 0.05% of symptomatic people will die.

Meyerowitz-Katz and Merone recently published (preprint) "A systematic review and meta-analysis of published research data on COVID-19 infection-fatality rates". This is a very comprehensive analysis of many studies.

Table 1 shows the table "symptomatic case fatality rates" from the CDC report. Table 2 shows a summary of the studies reported by Meyerowitz-Katz and Merone, along with the CDC guidance rates. As discussed by Carl Bergstrom (@CT_Bergstrom) and others, the CDC rates seem reasonable if they were being reported as IFRs, but are low for CFRs as compared to other studies.

Table 1. Case Fatality Rates, from CDC COVID-19 Pandemic Planning Scenarios .

Table 2. Meta-analysis of Infection Fatality Rates from Meyerowitz-Katz and Merone . Individual studies are shown with error bars. Diamonds represent the meta-analysis estimates. The pale orange zone is the CDC parameter range and the blue vertical line indicates the CDC best estimate.

Effects of Age on Fatality

David Spiegelhalter, Uni Cambridge, wrote 2 blogs about fatality risk in COVID-19 and non-COVID-19 populations as a function of age. In the first blog (March 21, Medium), he posited that mortality of people infected with SARS-CoV-2 was similar to the average risk that people the same age experience over a whole year. With an additional assumption that this average property holds at the individual level, he suggests that COVID-19 can be considered as packing one's current annual risk into a few weeks.

In the second blog (April 11, Medium), he offers another interpretation - that COVID-19 increases individual short-term risk by a common multiplicative factor, whatever their baseline risk; apart from health-care workers or others on the front lines who have increased exposure. To support this thesis, he analyzed 539 deaths from COVID-19 through March 27. Note that this data is registered deaths; there were far more deaths that occurred and were registered later.

Death rates per 100,000 people by age-group (15-44, 45-64, 65-74, 74-84, 85+) are presented in Figure 16. It is remarkable how closely the observed COVID mortality rates follow a parallel line (log scale) to the non-COVID-19 mortality rates versus age. This means that COVID-19 risk of death increases exponentially with age, in parallel to non-COVID-19 risk; so COVID-19 can be considered as a common risk-multiplier for all age groups.

Similarly, one can consider the proportion of all deaths that are related to COVID-19. This shows 5.8% of male deaths and 3.8% of female deaths are due to COVID-19. So males have approximately 50% higher mortality rate due to COVID-19 (Figure 16).

This is a great way to understand COVID-19 mortality; in that COVID-19 does not only affect old people. It raises the risk of everyone. In older people this results in more deaths. I find it fascinating how this applies by analogy to other risks we experience in day to day life. I see people closing their small businesses, breaking up long-standing relationships. Life attributes are all generally affected by COVID-19; things that were on the higher risk side to begin with, are closing down. COVID-19 is like a fast-forward button on many societal phenomena.

Figure 16. Weekly death rates per 100,000 people of different ages. Non-Covid and additional Covid death rates are for deaths registered in Week 13 for England and Wales, and are provided for age-groups 15-44, 45-64, 65-74, 74-84, 85+, which are plotted at 30, 55, 70, 80 and 90. From Medium, April 11.

GeoSpatial Data Science

Spotfire's map charts display multiple layers of information - including points, lines, WKB objects like shapefiles and polylines, and TMS and WMS layers that show e.g. geology, live weather, or customized image, terrain, or other information. Map layers with points, lines, and WKB objects can be configured to respond to marking, and refreshed by Spotfire data functions including model fitting in R and Python. This provides a convenient means of injecting calculations and predictions into interactive map presentations e.g. interactive contour lines, heatmaps, polygons, territory calculations, and route optimization.

Figure 17 shows US county level case data, with drill-down into hotspots in the Southeast. The hotspot colorings are relative within the markings. The companion visuals show confirmed cases sorted by county, and combination daily cases and cumulative cases from the marking.

Figure 17. Hotspot analysis by counties in southeast US. Note the increasing case counts for the highlighted region with supersmoother curve in lower right, counties sorted by cases in upper right, and totals for the highlighted region in top left. All of these summaries update on the user marking action on the map. Timeline trends may be obtained from the slider at the top.

Figure 18 shows an area cartogram (Dorling 1996) of confirmed cases in the US. This is set of non-overlapping regions with state areas proportional to the number of cases, using a rubber sheet distortion algorithm (Dougenik et al. 1985). The cartogram is invoked via a data function in Spotfire, with the R package Cartogram (Jeworutzki et al) run inside Spotfire on a mouse marking, using the built-in TERR engine.

Figure 18. Cartogram of COVID-19 confirmed cases from May 11. This shows a shifting dominance of cases in from WA and CA to NY and the northeast.

Summary

The COVID-19 pandemic has highlighted the need for sound data science, visual analytics and data management methods, and the infusion of these skills and literacy into broader groups of users - in companies and the population at large.

Our Spotfire Live Report Application includes

Ingesting data from many data sources worldwide; managing these data and providing data services to feed our analytics applications.
Modeling Reproduction number and effects of interventions (Re/Rt) with embedded Spotfire Data Science routines that call R-language data functions
Tracking case and fatality trajectories and case counts using supersmoother fits on the epidemic curves to remove reporting artifacts
Supersmoother-derived statistics such as trajectory velocities to highlight local hotspots
GeoSpatial Analyses: map layers, cartograms, chloropleths, heatmaps and polygon analyses at county, region and country level

Acknowledgements

Special thanks to the Spotfire Data Science team who are working on these analyses using Spotfire (Visual Analytics; R, Python) : Neil Kanungo, Prem Shah, Ian Pestell, Colin Gray, Andrew Berridge and David Katz did the heavy lifting, and were well supported by Peter Shaw, Vinoth Manamala, Eric Hsu, Heleen Snelting, Mike Alperin and Dan Rope.

Blog contact author: Michael O'Connell, @MichOConnell

References

Basu, A. Estimating The Infection Fatality Rate Among Symptomatic COVID-19 Cases In The United States . Health Affairs. May 7 2020
Bendavid E and Bhattacharya J. Is the Coronavirus as Deadly as They Say? WSJ March 27 2020
CDC. Severe Outcomes Among Patients with Coronavirus Disease 2019 (COVID-19) - United States, February 12-March 16, 2020. MMWR Morb Mortal Wkly Rep 2020;69:343-346. DOI: http://dx.doi.org/10.15585/mmwr.mm6912e2external icon
CDC (2007). Interim pre-pandemic planning guidance : community strategy for pandemic influenza mitigation in the United States : early, targeted, layered use of nonpharmaceutical interventions. https://stacks.cdc.gov/view/cdc/11425 , CDC, 2007
CDC (2020) COVID-19 Pandemic Planning Scenarios . https://www.cdc.gov/coronavirus/2019-ncov/hcp/planning-scenarios.html
Churches, T. Analyzing COVID-19 outbreak data with R - part 1 . published online February 7, 2020
Community mitigation guidelines to prevent pandemic influenza. https://stacks.cdc.gov/view/cdc/45220 United States, 2017
Cori A, Cauchemez S, Ferguson NM, Fraser C, Dahlqwist E, emarsh A, Jombart T, Kamvar ZN, Lessler J, Li S, Polonsky JA, tockwin J, Thompson R, van Gaalen R. EpiEstim , 2019.
Cori A, Ferguson NM, Fraser C, Cauchemez S, A New Framework and Software to Estimate Time-Varying Reproduction Numbers During Epidemics . Am J Epidemiology, 2013
Dalmeter PL, Street EJ, Leslie TF, Yang T and Jacobsen KH. (2019). Complexity of the Basic Reproduction Number (R₀) . CDC Emerging Infectious Diseases, 25, 1 - January 2019
Dorling, D. (1996). Area Cartograms: Their Use and Creation. In Concepts and Techniques in Modern Geography . Catmog, 59.
Dougenik JA, Chrisman NR, Niemeyer DR. (1985). An Algorithm to Construct Continuous Area Cartogram . Professional Geographer, 37(1). 1985, 75-81.
Fauci AS, Lane HC, Redfield RR. Covid-19 - Navigating the Uncharted. NEJM March 26, 2020 ; 382:1268-1269. DOI: 10.1056/NEJMe2002387
Faust JS (2020). Assessment of Deaths From COVID-19 and From Seasonal Influenza. JAMA Internal Medicine
FDA (2020). Approval of Rutgers in-home test for SARS-CoV-2 . May 7 2020.
Idoine, C. (2020). Worlds Collide as Augmented Analytics Draws Analytics, BI and Data Science Together. Gartner ID: G00463513
Inglesby, T. (2020). Public Health Measures and the Reproduction Number of SARS-CoV-2 JAMA Insights, April 29 2020.
Jeworutzki S, Giraud T, Lambert N, Bivand R, Pebesma E, Nowosad J, Cartogram R package. Version 0.2. CRAN 2019-12-07
Lauer et al. The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application , Pubmed, March 10, 2020
Maxmen, A. (2020). Scientists baffled by decision to stop a pioneering coronavirus testing project . Nature May 22 2020
Meyerowitz-Katz G, Lea Merone, L. (2020.05.03) A systematic review and meta-analysis of published research data on COVID-19 infection-fatality rates
Unwin HJT, Mishra S. Bhatt S. MRC Centre for Global Infections. Report 23 - State-lvel tracking of COVID-19 in the United States .
Flaxman F, Ferguson N, Bhatt S. MRC Centre for Global Infections. Report 13 - Estimating the number of infections and the impact of non-pharmaceutical interventions on COVID-19 in 11 European countries.
O'Connell M. COVID-19 : A Visual Data Science Analysis and Review Spotfire Blog, 18 March 2020
Ridenhour, B., Kowalik, J. and Shay, D. Unraveling R0: Considerations for Public Health Applications . Am J Public Health. Doi: 10.2105/AJPH.2013.301704 . Published online February 2014
Riou J, Hauser A, Counotte, MJ, Athaus CL, Adjusted Age-Specific Case Fatality Ratio during the COVID-19 Epidemic in Hubei, China, January and February 2020 , 3 March 2020, Preprint.
Ruan S Likelihood of survival of coronavirus disease 2019. March 30, 2020 DOI: https://doi.org/10.1016/S1473-3099(20)30257-7
Spiegelhalter D. How much 'normal' risk does Covid represent? Medium
Stanway, A. Real Time COVID-19 Tracking . Medium, March 14
Wilson N, Kvalsvig A, Barnard LT, Baker MG. Case-Fatality Risk Estimates for COVID-19 Calculated by Using a Lag Time for Fatality . CDC EID Journal. Voliume 26, Number 6, June 2020.

Websites with data updates

Johns Hopkins: Coronavirus Resource Center
KCDC: Daily cases update from Korea
Our World in Data: Coronavirus Testing - Source Data
Wikipedia: Case data for US States
World Health Organization: Coronavirus situation reports

	Michael O'Connell, Ph.D., is the chief analytics officer at Spotfire, where he helps clients with analytics software applications that drive business value. He has written a bunch of scientific papers and software packages on statistical methods. He also likes listening to electronic music; watching basketball, football and cricket; going to art galleries and walking around neighborhoods.
	Neil Kanungo is a Data Scientist at Spotfire and specializes in data visualization and business analytics. He helps deliver unique solutions to industry's biggest challenges. Neil takes a special interest in operationalizing analytics across organizations at multiple levels, and in fostering user engagement. In his free time, Neil enjoys hiking with his dog, live music, and playing pinball.
	*Ian Pestell* is a Senior Data Scientist at Spotfire based in the UK. With a specialty in Data Engineering, Ian focuses on solutions to prepare and manage data for Machine Learning and Analytics no matter where it resides. Ian also works on the integration of Spotfire analytics into cloud solutions, with special interests in Data Virtualization and Spark. In his spare time Ian enjoys playing music, guitar and synthesizers and designs installations focused on electronic music.
	Prem Shah is a data scientist working in the Data Science Team at Spotfire based out of their Seattle office. He has a strong inclination to figure out data driven and automated solutions and wants to work with new technologies to get insights. He likes to play the keyboard in his spare time and usually is working on pet projects that involve combining deep learning with his interests.

Sign In

COVID-19 Visual Data Science #3 - A Primer for Analytics in the Back-to-Work Landscape