Jump to content
  • COVID-19 Visual Data Science #3 - A Primer for Analytics in the Back-to-Work Landscape



    pasted_image_0_19.png.18a8b747e2bf49dd77ea4eea43252a4d.png

    Follow along with this blog in Spotfire, and for live updates

    Live Spotfire application available here

    This blog and Spotfire application are authored by the Spotfire Data Science team

    Contact: Michael O'Connell, @MichOConnell

    May 28, 2020


    Introduction 

    We are now in the middle of the first wave of worldwide COVID-19 regional outbreaks. WW confirmed cases have topped 5.5 million with more than 350 thousand deaths. New case additions have been arriving at ~80,000/day for the past month and deaths at ~3,000/day, with steady decline from a peak of ~6,000/day in mid-April (Figures 1,8)

    In the face of this, some countries and US states are starting to reopen, while the cases and fatalities continue to accrue. This poses some interesting data science problems such as how best to reopen country and state jurisdictions, stores and businesses, in a way that gets the economy moving while mitigating risks from reopening.

    Retail stores have been hit hard, but tech-savvy companies are using this situation to study the interactions between ecommerce and instore channels - where is there cannibalization and where is there a halo effect. How will stores with similar product mix and historical traffic patterns, but different COVID-19 case status, fare upon reopening? How is media consumption changing? It's like a hurricane has landed on a crop science field trial - which traits are still standing? "The retailers that were already doing it successfully are the ones that are going to recover much more quickly," says Kimberly Becker, senior research director with Gartner. And in a future when shoppers are likely to be skittish about visiting stores, only the tech-savvy chains will survive.

    A notable sideshow to all of this is the preponderance of commentators who think they are data scientists or epidemiologists; whose position is often rooted in some political position, and who select data to suit their purpose. This is particularly dangerous in COVID-19 land, as data are spotty due to reporting errors and artifacts, and analytics are nuanced due to the virus lifecycle, such as time delays between infection, symptoms, cases and fatalities and the fact that in many cases individuals show minimal or no symptoms at all.

    Indeed, the pandemic has highlighted the need for sound data science, visual analytics and data management methods, and the infusion of these skills and literacy into broader groups of users - in companies and the population at large.

    In this context, Carlie Idoine, senior research director with Gartner, writes about AI-enabled analytics that draws Analytics, BI and Data Science together. "Organizations that seize the opportunities presented by this newly catalyzed market will be able to dramatically hasten their analytics-related maturation to make competitive breakthroughs in comparison to slower-maturing rivals." We expand this theme in section 8 below.

    We are certainly seeing the need and opportunities for data management, analytics, BI and data science in this COVID-19 landscape. Our Spotfire Live Report is garnering interest in a number of industries - healthcare, pharmaceutical, consumer goods, retail and insurance. In all of these businesses, data scientists, business analysts and broad swaths of citizen and casual users from many functional areas, are getting involved. Data science and analytics have become a focal point for business leaders to find out what's going on right now, how to best tune the business in this new order of demand and supply, and how to forecast what lies ahead.

    Our Spotfire Live Report Application includes

    • Ingesting data from many data sources worldwide; managing these data and providing data services to feed our analytics applications.

    • Modeling Reproduction number and effects of interventions (Re/Rt) with embedded Spotfire Data Science routines that call R-language data functions

    • Tracking case and fatality trajectories and case counts using supersmoother fits on the epidemic curves to remove reporting artifacts

    • Supersmoother-derived statistics such as trajectory velocities to highlight local hotspots

    • GeoSpatial Analyses: map layers, cartograms, chloropleths, heatmaps and polygon analyses at county, region and country level

    This paper provides an update on these analyses and some details on our modeling, simulation and analytics methodologies, in the context of the pandemic and reopening initiatives. These analyses are presented in our Spotfire Live Report visual analytics app in a publicly available cloud environment. The Spotfire app uses data from a number of public health department sources that we ingest and refresh regularly, depending on the data sources. Spotfire curated data and code are available for download from the Spotfire Data Science Community . The Spotfire COVID19 site provides a consolidated view of the work.

    Figure 1 shows the Spotfire application Global Overview.

    fig_1_0.thumb.png.dc64d98257eb8287675fe30189e8e8d3.png

    Figure 1. Spotfire Live Report Application - Global Overview as of 27 May 2020. Shows worldwide cases, fatalities, recoveries and country-level stats. Includes slider for stepping through time by date. Note the daily case counts lower right, along with supersmoother curve fit that takes out artifacts in regional case reporting. The smaller graphs can be expanded and compressed by clicking the top right corner of that graph.


    Reproduction number modeling and effects of interventions

    We now have clear evidence that the social distancing interventions have dramatically lowered the spread of local epidemics. See O'Connell (26 March - April 4) , for an outline of our analysis methodology, epidemiology modeling basics and estimation of the effective reproduction number (Re) over time (Rt). In summary :-

    The reproduction number R0 (pronounced R-nought) is the average number of people infected from a person with an infection, without any interventions in place. This is a crucial parameter in describing an epidemic. The effective reproduction number Re includes intervention effects. If Re is bigger than 1, the disease spreads. Conversely if Re, or the time-varying reproduction number Rt can be reduced over time, the disease can be contained. For COVID-19, R0 has been widely reported to be in the range 2-3.

    Delameter et al. describe R0, its use and misuse. Ingelsby provides a clear and simple explanation of Re, and Pan et al. clearly show the effects of social distancing on Rt in a study of the outbreak in Wuhan. Figure 2, from Pan et al. clearly shows the effects of non-pharmaceutical interventions (NPIs) on Rt in Wuhan. Results from these and other early studies guided the implementation of social distancing around the world.

    fig_2_0.png.690017c6e5fbe303f90e7b8eba258afd.png

    Figure 2. From Inglesby and Pan et al. The effective reproduction number Rt calculated as a 5-day moving average since January 1, 2020. The horizontal line indicates Rt=1, below which sustained transmission is unlikely so long as anti-transmission measures are sustained, indicating that the outbreak is under control.

    We have been estimating the time-varying reproduction number Rt in our live interactive Spotfire Live Report Application, across the US and worldwide, since early March.


    Figure 3 shows some results of Rt modeling for European countries, and Figure 4 some results of Rt modeling of different US states. Models are fit using the package EpiEstim . This package is included in the online Spotfire Live Report App, and can be added to the Spotfire client via the Tools menu, and configured to run via a Spotfire data function. User-selected markings on maps and other visuals invoke the Rt estimates to run interactively, in context of exploratory visual data analysis.

    EpiEstim (Cori et al, 2019) analyzes time series incidence data to estimate time-varying reproduction numbers as outlined in Cori et al 2013. EpiEstim incorporates uncertainty in the distribution of the serial interval - the time between the onset of symptoms in a primary case and the onset of symptoms in secondary cases.

    There are five estimation methods in EpiEstim; these vary in the way the serial interval distribution is specified. In the first two methods, a unique serial interval distribution is considered, whereas in the last three, a range of serial interval distributions are integrated over:-

    • "parametric_si" the user specifies the mean and sd of the serial interval

    • "uncertain_si" the mean and sd of the serial interval are each drawn from truncated normal distributions, with parameters specified by the user

    • "si_from_data", the serial interval distribution is directly estimated, using Markov Chain Monte Carlo, from interval censored exposure data, with data provided by the user together with a choice of parametric distribution for the serial interval

    • "si_from_sample", the user directly provides the sample of serial interval distribution to use for estimation of R.

    Zhanwei et al. (CDC EID) estimate the distribution of serial intervals for 468 confirmed cases of COVID-19 reported in China as of February 8, 2020. They found mean interval of 3.96 days (95% CI 3.53-4.39 days), and SD 4.75 days (95% CI 4.46-5.07 days).

    We have been exploring all the above methods, following the logic and approach set out by Churches . Our live Spotfire currently uses values for SI as mean 4.7 days and standard deviation 2.9 days. Churches reasoning for these values is that they better account for transmission before the onset of symptoms, which results in shorter serial intervals than expected, possibly even shorter than the incubation period. We have also explored approaches by Abbott et al including application to estimates of Rt on US states . We are exposing these parameters to the R functions in Spotfire with ranges (3.7,6.0) for the mean serial interval and (1.9,4.9) for the standard deviation. We use a window length of 7 days.

    fig_3_0.thumb.png.8912cb9c606d36f1c0f254083d7d4031.png

    Figure 3. Rt modeling of WW countries as of May 24. The green colored band shows Rt < 1.0 (green). The dark line is the Rt estimate and the gray lines are 95% credible intervals. The models use the R package EpiEstim invoked through a Spotfire data function.


    Figure 4 shows similar results of modeling Rt for this Rt modeling on US states and Figure 5 for UK counties.

    fig_4.thumb.png.3430bc60fd097cf6731f6069ad664d6b.png

    Figure 4. Rt modeling of US states as of May 24. The green colored band shows Rt < 1.0 (green). The dark line is the Rt estimate and the gray lines are 95% credible intervals. The models use the R package EpiEstim invoked through a Spotfire data function.

    fig_5.thumb.png.e3652484dccbf84e8cdd178308a434c5.png

    Figure 5. Rt modeling of UK counties as of May 24. The green colored band shows Rt < 1.0 (green). The dark line is the Rt estimate and the gray lines are 95% credible intervals. The models use the R package EpiEstim invoked through a Spotfire data function.


    These Rt estimates clearly show significant reductions as a result of the social distancing intervention measures around the world. This is amplified by our analysis of COVID-19 cumulative case and fatality trajectories shown in Figures 8-10 below. In particular, note the countries where the spread is still growing - including the US, Brazil, Russia, India, Sweden, Singapore and Japan. In Sweden, social distancing was only partly implemented and in the UK and the US it was implemented slowly and varied among states and counties. In Singapore, it was implemented early, and then relaxed and implemented again as cases had a second wave. Russia, Brazil and India infection sparks arrived later.

    The serial interval is an important parameter in the estimation of Re/Rt. The time sequence of virus and human host states are outlined in Figure 6. This shows a number of epidemiology parameters :-

    • The Latent Period is the time between the occurrence of infection and the onset of infectiousness (when the infected individual becomes infectious).

    • The Serial Interval = the duration of time between the onset of symptoms in a primary case and the onset of symptoms in a secondary case infected by the primary case.

    • The Incubation Period represents the time period between the occurrence of infection (or transmission) and the onset of disease symptoms

    fig_6.thumb.png.b8e3106a6fbaa920823accf7a74d5364.png

    Figure 6. Infection and transmission timeline of COVID-19. Based on supplement to: Anderson et al.. Lancet 2020.


    Worldwide COVID-19 Cases, Fatalities and Trajectories

    As outlined above, the non-pharmaceutical interventions (social distancing) have had a significant effect on the spread of SARS-CoV-2. These effects have varied widely over regions, depending on the timing of the interventions.

    While growth in new cases and fatalities has slowed across the world, we have seen a plateau of cases and fatalities.

    Figure 7 shows COVID-19 case and fatality counts and supersmoother fits for global data. Note the arrows in top right corner of graphs - click this to expand the view from Figure 1 above.

    The WW epidemic curves currently show a flat trajectory where case and fatality counts have been fairly consistent day to day throughout April.

    fig_7a.thumb.png.9c2bade5afec20d482dcc57895aa3689.png

    fig_7b.thumb.png.b95cacfec7aa68d5c4713784266862ba.png

    Figure 7. Worldwide cases and fatalities daily counts, with supersmoother fit. As of May 3, WW confirmed cases have topped 3 million; new case additions are arriving at ~80,000/day and deaths at ~4,000/day, with steady decline from a peak of ~6,000/day in mid-April 

    The epidemic curves from Individual countries vary widely. Figure 9 shows cumulative case trajectories for countries where the epidemic spread is still well underway - including Brazil, India, Russia and the US. Figure 10 shows cumulative case trajectories for countries where the spread has been contained. The graphs are automatically annotated with Natural Language via integration with Arria NLG.

    For case trajectories, the y-axis is the cumulative number of confirmed cases, on the log scale and the x-axis is the time in days after the first <100> confirmed cases. The dashed lines are at slopes representing 1-day, 2-day, 3-day 5-day and 7-day doubling.

    Note that we use raw and cumulative cases rather than normalizing by total population. Normalized numbers are good at showing *relatively* how much strain a region is under, but they're not suited to tracking the extent/state of a country's outbreak, which spreads at approximately the same pace regardless of country size. Advantages of this presentation of raw counts include: 

    • slopes (of tangents to the curves) reflect growth rates

    • heights reflect prevalence, per reported cases and fatalities

    • relative timing is maintained across the regions on the graphs

    If we were to divide the counts by population, we would need to normalize by percent of population infected, rather than number of cases; so that regions would start at [first <1%> of population reported infected> rather than [first <100> cases]

    Also note that cases are a function of the number of tests performed; this varies considerably by country. As such, the number of confirmed cases should not be interpreted as reflective of actual infections.

    fig_8a.thumb.png.31ba0f016242b859e54e821abce7d165.png

    fig_8b.thumb.png.44f41653b5945a8ae3586d3a5ce2ce6d.png

    Figure 8. COVID-19 cumulative case and fatality trajectories by countries where the spread is still growing. This includes US, Brazil, Russia, India, Sweden, Singapore and Japan. Other countries that have flattened (China, South Korea) are included for comparison, The y-axis is the number of confirmed cases or fatalities (log scale), and the x-axis is the number of days after the first <100> confirmed cases or <10 confirmed fatalities. The <100> and <10> days aligns the curves to a common starting point in the epidemic outbreaks, and is configurable in the Spotfire application. The dashed lines indicate various doubling rates in days.


    fig_9a.thumb.png.1e479d4181c732ab56cc87aa1259b3b3.png

    fig_9b.thumb.png.3d30994c7b4e63e11aa92523c7056c7a.png

    Figure 9. COVID-19 cumulative case and fatality trajectories by countries where the spread is contained. This includes Germany, China, Netherlands, Switzerland, Israel, Austria, South Korea, Norway, Australia, Malaysia, Thailand, Greece, New Zealand, Taiwan and Vietnam. The y-axis is the number of confirmed cases or fatalities (log scale), and the x-axis is the number of days after the first <100> confirmed cases, or first <10> fatalities. Note that the cumulative cases have flattened more than the fatalities; this is due to deaths occurring some days/weeks after cases are confirmed.


    Figure 10 shows cases and fatalities daily counts, with supersmoother fit, for select countries where the epidemic is spreading, and where it is contained. Countries still showing growth or flat trajectories in cases include Brazil, India, Sweden, UK, US. Countries showing containment in new cases include Italy, Germany. Note how the supersmoother adjusts for reporting artifacts, including inconsistent reporting on weekends v weekdays.

    As noted above, social distancing was only partly implemented in Sweden; and in the UK and the US it was implemented slowly and varied among states and counties. In Singapore, it was implemented early, and then relaxed and implemented again as cases had a second wave. Russia, Brazil and India infection sparks arrived later.

    6

    fig_10b.thumb.png.589572ddf12a0b448a421cb264521ccd.png

    fig_10c.thumb.png.5636e270c5c0eb90f770de46cf9d5074.png

    fig_10d.thumb.png.cdbea1844d08659801a982d0ad633676.png

    fig_10e.thumb.png.8ceba0c26f9738aa0b80cd7fdf948d8a.png

    fig_10f.thumb.png.e2ddc292b7ac8f4c5824176000b327af.png

    fig_10g.thumb.png.46ef80e8587c371343f40226d67133e7.png

     

    Figure 10. Cases and fatalities daily counts, with supersmoother fit, for select countries where the epidemic is spreading. In order from top and from growth to containment: Growth: Brazil, India, Sweden, US, UK; Containment: Italy, Germany. Note how the supersmoother adjusts for reporting artifacts, including inconsistent reporting on weekends v weekdays.


    US COVID-19 Cases, Fatalities and Testing: Re-opening Scenarios

    As outlined above, worldwide and US cases have been in a plateau for most of April and into May, where daily case counts have been fairly similar. Approximately ~80,000 cases and 4,000 deaths (down from peak of 6,000 in mid-April) are being added per day worldwide, and ~20,000 cases (down from peak of ~30,000 in early April) and 1,000 deaths (down from peak of ~1,900 in early April) are being added per day in the US.

    The ~20,000 new cases per day corresponds to an average Re of ~1 across the US. This gradual decline / plateau at a national level results from some states with rising case counts (and Rt greater than 1) and other states with falling case counts (and Rt less than 1). As we move forward into summer, transmission rate will be affected by:

    • Reopening efforts to get the economy moving 

    • Better hygiene practice eg wearing of masks

    • Possible impact of seasonality 

    Most likely we are looking at a to and fro; from local jurisdiction orders to reopen, along with people's changing behavior and perceptions re. decreasing risk that result in increasing local case counts; followed by cycles of increased social distancing as fear of infection returns.This is occurring now in local regions; regions that are recovering from being hard hit eg Michigan, and new regional outbreaks in some southeast and midwest states.

    Note that even if infections were to continue at say ~300K per day, the US would have ~50M cumulative infections by September 1 and be at ~15% population immunity. This is still far short of herd immunity, which would require between 50% (for R0 =2) and 66% (for R0=3) required of the population recovered to prevent further epidemic spread. (Bedford, April 7 )

    Recent data released by Apple and Google on community mobility (Figure 11) show the implementation of social distancing. Coupled with the Rt estimates we have been providing over this period, the effects of social distancing have clearly reduced Re significantly across the world.

    However, note the recent uptick in Retail (purple) and Parks (green) mobility.

    fig_11.thumb.png.f00ee8a0d973aabee466ed4b362df26d.png

    Figure 11. Mobility trends of Google Android devices, as of May 24.


    Figure 12 shows US cases by county as a map, with sorted bar chart counts and daily case counts with supersmooth curve fit. The map enables hotspot detection via localized contour and heatmap calculations via interactive markings (See Figure 15 below)

    Figure 13 shows case velocities by US county, using first derivatives of supersmoother curves; these are shown along with the map. Individual velocity curves may be obtained via interactive marking (combinations of counties) on the map. Note the high velocities in some parts of Minnesota and Nebraska, and some southeast and midwest counties.

    Figure 14 shows cases by county, normalized by population (cases per 1M population). This is particularly useful for identifying case trends in low population areas. For example, note the high density of cases in the four corners area, reflecting the infecting going through the Navajo tribe in that area.

    fig_12.thumb.png.f7559ebefad0f7c901dc64db933b5000.png

    Figure 12. Cases by county, with sorted bar chart counts and daily case counts from supersmoother curve fit. The smaller graphs can be expanded and compressed by clicking the top right corner of that graph. The map enables hotspot detection via localized contour and heatmap calculations via interactive markings (See Figure 13 below)

    fig_13.thumb.png.a0dc82e48c4b3c1100cc920d65d2157a.png

    Figure 13. Case velocities by county using first derivatives of supersmoother curves. These are shown on the right, along with velocity curves over time by county. Velocity curves v time for regions may be obtained via interactive marking (combinations of counties) on the map. Note the high velocities in parts of Minnesota, Nebraska and some midwest and southeastern counties.

    fig_14.thumb.png.1e0ddcddc159cbf744a095049315f32c.png

    Figure 14.  Cases by county, normalized by population (cases per 1M population). This is particularly useful for identifying case trends in low population areas. For example, note the high density of cases in the four corners area, reflecting an infection going through the Navajo tribe in that area.


    Testing

    Testing is critical in the path to recovery and reopening. There are 2 main types of tests:

    • Tests for presence of virus; these aim to establish whether an individual is currently infected. The most common and best test is the PCR test. This uses DNA amplification technology and can detect as low as 10 copies of the virus in the sample.

    • Tests for presence of antibodies; these aim to establish whether an individual has been infected some time in the past.

    The Spotfire Live Report currently analyzes test data provided by the COVID Tracking Project . As of May 24, the COVID Tracking Project had logged 14,604,942 tests with 1,654,829 positive cases and 92,464 deaths.

    The COVID Tracking Project reports all data provided by the states, as outlined on https://covidtracking.com/data . Our World in Data on the other hand, restricts test data to RT-PCR tests as outlined at https://ourworldindata.org/coronavirus-testing

    RT-PCR tests have good properties :- 

    • SENSITIVITY measures how well we can detect patients with disease. Imagine a nasal swab from someone with SARS-CoV-2 infection. Sensitivity is the probability this sample will test positive.

    • With RT-PCR, the specimen is declared positive if viral RNA is detected. Patients with COVID19 often have high viral loads in their throats. Thus it is relatively easy to detect virus, and the test is typically highly sensitive.

    • When sensitivity is low, we get FALSE NEGATIVES. These are swabs from which no virus is detected even though the person is infected. This could occur if the specimen was degraded, virus didn't amplify well, etc.

    • SPECIFICITY measures how well we rule out infection for people who are tested but aren't infected. Imagine a nasal swab from someone without SARS-CoV-2 infection. This should come back negative on RT-PCR. Specificity measures how often these come back negative.

    • Because RT-PCR measures viral RNA, it is rare for someone to test positive if they aren't infected. It can happen, though e.g. if there is random contamination in the lab. In general, RT-PCR is usually highly specific and there are few FALSE POSITIVES.

    Figure 15 shows test data from Colorado. The graph on bottom left shows new confirmed cases in dark blue, deaths in light blue and recoveries in light blue. The graph on bottom right shows negative test results in light blue bars, positive results in dark blue bars and positive test percentage as dark line. Note the separate scales on left and right axes.

    fig_15.thumb.png.a15e3f6ec0634d0a48be22dc44dde2e6.png

    Figure 15. Test data from Colorado. Selectors for total cases, case growth rate, cases/100k people, case velocity and Re are included. The graph on bottom left shows new confirmed cases in dark blue, deaths in light blue and recoveries in light blue. The graph on bottom right shows negative test results in light blue bars, positive results in dark blue bars and positive test percentage as dark line. Note the separate scales on left and right axes.

    There is much current interest in in-home tests as a way of conducting wider surveillance and assisting in back-to-work programs. Since April 21, the FDA has approved in-home tests from LabCorp , Everlywell and Rutgers University in New Jersey. The FDA and surprisingly shut down the SCAN program (Seattle Coronavirus Assessment Network) reported by Maxmen in Nature .


    Case Fatality Rate and Infection Fatality Rate

    There is much debate about COVID-19 fatality rates. When considering COVID-19 fatality rates it is important to distinguish :- 

    • The risk of dying from COVID-19, among people who are diagnosed with COVID-19 (Case Fatality Rate, CFR)

    • The risk of dying from COVID-19, among people who get it (Infection Fatality Rate, IFR).

    • The risk of dying from COVID-19, among people who do not currently have it (Population Fatality Rate, PFR).

    The best way to calculate CFR would be to track a large group of people from the point when they develop symptoms until they later die or recover, and to then calculate the proportion of all these cases who had died. This is not possible in the real world. It is incorrect to just divide the total number of deaths by total number of cases as this does not account for unreported cases or the delay from illness to death.

    The CDC recently issued guidance for five planning scenarios that "are being used by mathematical modelers throughout the federal government," according to the CDC. Four of those scenarios represent "the lower and upper bounds of disease severity and viral transmissibility." The fifth scenario is the CDC's "current best estimate about viral transmission and disease severity in the United States." In that scenario, CDC lists a symptomatic case fatality rate of 0.01, meaning that 1% of people with COVID-19 and symptoms would die. In the least severe scenario, the CDC puts that number at 0.2%.For people age 65 and older, the CDC puts that number at 1.3%. For people 49 and under, the agency estimated that 0.05% of symptomatic people will die.

    Meyerowitz-Katz and Merone recently published (preprint) "A systematic review and meta-analysis of published research data on COVID-19 infection-fatality rates". This is a very comprehensive analysis of many studies.

    Table 1 shows the table "symptomatic case fatality rates" from the CDC report. Table 2 shows a summary of the studies reported by Meyerowitz-Katz and Merone, along with the CDC guidance rates. As discussed by Carl Bergstrom (@CT_Bergstrom) and others, the CDC rates seem reasonable if they were being reported as IFRs, but are low for CFRs as compared to other studies.

    table_1_1.thumb.png.23544231e50e7f036ebb4700e32485af.png

    Table 1. Case Fatality Rates, from CDC COVID-19 Pandemic Planning Scenarios .

    table_2_0.thumb.png.26a58fc944323d179b17113a7490b1b6.png

    Table 2. Meta-analysis of Infection Fatality Rates from Meyerowitz-Katz and Merone . Individual studies are shown with error bars. Diamonds represent the meta-analysis estimates. The pale orange zone is the CDC parameter range and the blue vertical line indicates the CDC best estimate.


    Effects of Age on Fatality

    David Spiegelhalter, Uni Cambridge, wrote 2 blogs about fatality risk in COVID-19 and non-COVID-19 populations as a function of age. In the first blog (March 21, Medium), he posited that mortality of people infected with SARS-CoV-2 was similar to the average risk that people the same age experience over a whole year. With an additional assumption that this average property holds at the individual level, he suggests that COVID-19 can be considered as packing one's current annual risk into a few weeks.

    In the second blog (April 11, Medium), he offers another interpretation - that COVID-19 increases individual short-term risk by a common multiplicative factor, whatever their baseline risk; apart from health-care workers or others on the front lines who have increased exposure. To support this thesis, he analyzed 539 deaths from COVID-19 through March 27. Note that this data is registered deaths; there were far more deaths that occurred and were registered later.

    Death rates per 100,000 people by age-group (15-44, 45-64, 65-74, 74-84, 85+) are presented in Figure 16. It is remarkable how closely the observed COVID mortality rates follow a parallel line (log scale) to the non-COVID-19 mortality rates versus age. This means that COVID-19 risk of death increases exponentially with age, in parallel to non-COVID-19 risk; so COVID-19 can be considered as a common risk-multiplier for all age groups.

    Similarly, one can consider the proportion of all deaths that are related to COVID-19. This shows 5.8% of male deaths and 3.8% of female deaths are due to COVID-19. So males have approximately 50% higher mortality rate due to COVID-19 (Figure 16).

    This is a great way to understand COVID-19 mortality; in that COVID-19 does not only affect old people. It raises the risk of everyone. In older people this results in more deaths. I find it fascinating how this applies by analogy to other risks we experience in day to day life. I see people closing their small businesses, breaking up long-standing relationships. Life attributes are all generally affected by COVID-19; things that were on the higher risk side to begin with, are closing down. COVID-19 is like a fast-forward button on many societal phenomena.

    fig_16.thumb.png.01d144d2f6f9305267a0258fe2c8f508.png

    Figure 16. Weekly death rates per 100,000 people of different ages. Non-Covid and additional Covid death rates are for deaths registered in Week 13 for England and Wales, and are provided for age-groups 15-44, 45-64, 65-74, 74-84, 85+, which are plotted at 30, 55, 70, 80 and 90. From Medium, April 11.


    GeoSpatial Data Science

    Spotfire's map charts display multiple layers of information - including points, lines, WKB objects like shapefiles and polylines, and TMS and WMS layers that show e.g. geology, live weather, or customized image, terrain, or other information. Map layers with points, lines, and WKB objects can be configured to respond to marking, and refreshed by Spotfire data functions including model fitting in R and Python. This provides a convenient means of injecting calculations and predictions into interactive map presentations e.g. interactive contour lines, heatmaps, polygons, territory calculations, and route optimization.

    Figure 17 shows US county level case data, with drill-down into hotspots in the Southeast. The hotspot colorings are relative within the markings. The companion visuals show confirmed cases sorted by county, and combination daily cases and cumulative cases from the marking.

    fig_17.thumb.png.1be8acb3e714f26a2e38027e3dc48c9e.png

    Figure 17. Hotspot analysis by counties in southeast US. Note the increasing case counts for the highlighted region with supersmoother curve in lower right, counties sorted by cases in upper right, and totals for the highlighted region in top left. All of these summaries update on the user marking action on the map. Timeline trends may be obtained from the slider at the top.


    Figure 18 shows an area cartogram (Dorling 1996) of confirmed cases in the US. This is set of non-overlapping regions with state areas proportional to the number of cases, using a rubber sheet distortion algorithm (Dougenik et al. 1985). The cartogram is invoked via a data function in Spotfire, with the R package Cartogram (Jeworutzki et al) run inside Spotfire on a mouse marking, using the built-in TERR engine.

    fig_18.thumb.png.3e59b7111dcc7ce6d29c820b3e5d3596.png

    Figure 18. Cartogram of COVID-19 confirmed cases from May 11. This shows a shifting dominance of cases in from WA and CA to NY and the northeast.


    Summary 

    We are now in the middle of the first wave of worldwide COVID-19 regional outbreaks. WW confirmed cases have topped 5.5 million with more than 350 thousand deaths. New case additions have been arriving at ~80,000/day for the past month and deaths at ~3,000/day, with steady decline from a peak of ~6,000/day in mid-April.

    In the face of this, some countries and US states are starting to reopen, while the cases and fatalities continue to accrue. This poses some interesting data science problems such as how best to reopen country and state jurisdictions, stores and businesses, in a way that gets the economy moving while mitigating risks from reopening.

    The COVID-19 pandemic has highlighted the need for sound data science, visual analytics and data management methods, and the infusion of these skills and literacy into broader groups of users - in companies and the population at large.

    We are certainly seeing the need and opportunities for data management, analytics, BI and data science in this COVID-19 landscape. Our Spotfire Live Report is garnering interest in a number of industries - healthcare, pharmaceutical, consumer goods, retail and insurance. In all of these businesses, data scientists, business analysts and broad swaths of citizen and casual users from many functional areas, are getting involved. Data science and analytics have become a focal point for business leaders to find out what's going on right now, how to best tune the business in this new order of demand and supply, and how to forecast what lies ahead.

    Our Spotfire Live Report Application includes

    • Ingesting data from many data sources worldwide; managing these data and providing data services to feed our analytics applications.

    • Modeling Reproduction number and effects of interventions (Re/Rt) with embedded Spotfire Data Science routines that call R-language data functions

    • Tracking case and fatality trajectories and case counts using supersmoother fits on the epidemic curves to remove reporting artifacts

    • Supersmoother-derived statistics such as trajectory velocities to highlight local hotspots

    • GeoSpatial Analyses: map layers, cartograms, chloropleths, heatmaps and polygon analyses at county, region and country level

    This paper provides an update on these analyses and some details on our modeling, simulation and analytics methodologies, in the context of the pandemic and reopening initiatives. These analyses are presented in our Spotfire Live Report visual analytics app in a publicly available cloud environment. The Spotfire app uses data from a number of public health department sources that we ingest and refresh regularly, depending on the data sources. Spotfire curated data and code are available for download from the Spotfire Data Science Community . The Spotfire COVID19 site provides a consolidated view of the work.

    Acknowledgements

    Special thanks to the Spotfire Data Science team who are working on these analyses using Spotfire (Visual Analytics; R, Python) : Neil Kanungo, Prem Shah, Ian Pestell, Colin Gray, Andrew Berridge and David Katz did the heavy lifting, and were well supported by Peter Shaw, Vinoth Manamala, Eric Hsu, Heleen Snelting, Mike Alperin and Dan Rope.

    Blog contact author: Michael O'Connell, @MichOConnell

    References 

    Websites with data updates

    cao_michael_oconnell_1.thumb.jpg.50a28865de6dfaa8b8155ee6df0de2a9.jpg Michael O'Connell, Ph.D., is the chief analytics officer at Spotfire, where he helps clients with analytics software applications that drive business value. He has written a bunch of scientific papers and software packages on statistical methods. He also likes listening to electronic music; watching basketball, football and cricket; going to art galleries and walking around neighborhoods.
    neil_kanungo_0.png.a44f0a6eb113fa1c7082ac82c549ffbb.png Neil Kanungo is a Data Scientist at Spotfire and specializes in data visualization and business analytics. He helps deliver unique solutions to industry's biggest challenges. Neil takes a special interest in operationalizing analytics across organizations at multiple levels, and in fostering user engagement. In his free time, Neil enjoys hiking with his dog, live music, and playing pinball. 
    headshot1.jpeg.3779c637582634b1f049a81f82c47e0b.jpeg Ian Pestell is a Senior Data Scientist at Spotfire based in the UK. With a specialty in Data Engineering, Ian focuses on solutions to prepare and manage data for Machine Learning and Analytics no matter where it resides. Ian also works on the integration of Spotfire analytics into cloud solutions, with special interests in Data Virtualization and Spark. In his spare time Ian enjoys playing music, guitar and synthesizers and designs installations focused on electronic music. 
      Prem Shah is a data scientist working in the Data Science Team at Spotfire based out of their Seattle office. He has a strong inclination to figure out data driven and automated solutions and wants to work with new technologies to get insights. He likes to play the keyboard in his spare time and usually is working on pet projects that involve combining deep learning with his interests.

    fig_10a.png


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...