Overview
Data for breaking news stories often appears in odd formats and needs some careful data wrangling before doing analysis; Spotfire® is a great tool to help with this. Here I take a look at the ongoing Ebola outbreak in central Africa using Spotfire®, to study the spatial and time-varying distribution of the new cases for patterns.
The Ebola virus disease (EVD) is a severe and often fatal illness affecting humans and other primates; this disease is currently undergoing an outbreak in the Democratic Republic of Congo (DRC). It has been one year since this outbreak was declared in the North Kivu province of the DRC. The World Health Organization (WHO) has recently declared the current outbreak to be a public health emergency of international concern.
Outline
In order to study data from current events such as this, data must often be gathered from sources such as web pages or downloaded documents. The data we use here comes in the form of downloaded pdf documents, and we'll walk through some useful tools for reading and formatting this data. We'll demonstrate how to pull data from weekly reports published by WHO in pdf format, bring the data into Spotfire® for visualization and analysis. We'll show how to use two important data wrangling tools using the R language:
- The "pdftools" package for reading data from pdf files;
- The "readr" package, part of the "tidyverse" collection of functions, for conveniently reading rectangular data.
Background
The extent of the 2018-2019 outbreak can be tracked in Spotfire® to understand both the mortality rate and the spread through the different provinces of DRC:
Figure 1: Geographic extent of the 2019 Ebola virus outbreak in central Africa. (left): Regional map. (right): detail, with a few cities identified. As of Aug 2, 2019, one new Ebola case has recently been reported near the city of Goma (southern portion of map), a concern owing to the city's large population (around two million) and its location near the border with Rwanda, and associated flow of travelers.
Figure 2: Animation of the new Ebola cases over the past year.
Figure 3: Progress of the disease over the past year showing survivors (blue) and deaths (red). The outbreak is still growing.
Figure 4: Growth of the Ebola outbreak across the affected provinces in Democratic Republic of Congo (DRC).
Methods: tabulating data from a series of pdf documents
This data was publicly available through the WHO in the form of weekly pdf-formatted "Situation Reports" (SitReps). Unfortunately these datasets are no longer available in the same place but the new WHO website does have this type of information in case you are interested in creating a similar analysis for Ebola now or an outbreak of another disease. Each SitRep document that we used for this analysis contains an overall summary of the situation as well as a table of cases across the Provinces and WHO Health Zones.
For example, here is the table of interest, Table 1, as it appeared in the pdf Situation Report for the Ebola Virus Disease for DRC, Jul 16, 2019 on the WHO Website in 2019.
The first step in bringing the data into Spotfire® is to automate the process of reading the data from these pdf reports. Here we'll use TIBCO's R engine TERR® (TIBCO Enterprise Runtime for R) together with the R package "pdftools" which is designed for the purpose of reading text data from pdf files.
Below is a snippet of R code that we can execute using TERR®. The code can loop over all the files that have been downloaded, here we read just one file for illustration.
- We load the "pdftools" R package and use the function "pdf_text()", to efficiently read the text from the pdf file.
- Next we look for the specific string "Table 1:" and extract just this page;
- We break the continuous string into lines looking for the end of line delimiter combination.
- Finally we search for a few instances of alternating digits and white spaces within a line, to identify data rows.
library(pdftools) setwd("Z:/Work/data/WHO World Health Org/2019/sitrep data/format 2") text0 = pdf_text("SITREP_EVD_DRC_20190714-eng.pdf") # Look for the string "Table 1:" in the text: ipage = grep("Table 1:", text0) # Isolate the text for just this page. text1 = strsplit(text0[ipage],split="\r\n")[[1]] # find lines of data (a few alternating digits and spaces) idata = grep("\\d+\\s+\\d+\\s+\\d+\\s+\\d+", text1) text2 = text1[idata]
Here is the result so far:
We've successfully imported the raw data from this table as character strings.
Next we go about reading this data into tabular form in TERR.
- We remove the first 20 characters of each row to eliminate the Province name column, so each data row has the same number of columns.
- We load the "readr" package and use the function "read_table2()" to conveniently read the data.
library(readr) # remove the first 20 characters to eliminate first column: text3 = sapply(text2,function(x){substring(x,20)}) # use read_table2() to obtain data in rectangular format. thistable = read_table2(text3, col_names = F)
Here's the resulting data table in TERR; the object returned by read_table2 is a "tibble"; we convert to a data frame for convenient display
We need to select and rename the specific columns, and remove the last row of the table but we've obtained the data we need in an automated script that can now be run against all of the downloaded files. We note that the format of the tables changes over the course of the year so some care is taken in addressing specific format changes.
After looping through all of our downloaded pdf files we assemble the results into a tall formatted table in TERR, identifying each block with the time stamp it represents. The data is now ready to be returned to Spotfire®.
These regions are identified by their WHO Health Zones; to show these on a map we need to find a corresponding shapefile with these boundaries. The "Map for Environment" site provides these regions in a shapefile (this link not working in April 2024), which we use here (Figures 1 and 2). Since this article was written, the original link to the datasource seems to have been replaced with https://data.humdata.org/dataset/drc-health-data?
With the data in Spotfire we can also do operations like looking at how these numbers change with time, defining new cases of Ebola. This shows the new cases come in waves:
Figure 5: New Ebola cases shown by date and by province. There have been a number of separate waves of new cases over the past year.
Figures 1, 2, 3, 4 and 5 are the result of ingesting the data from the original pdf documents bringing the resulting data into Spotfire®, and then using Spotfire®'s visualizations to gain valuable insight into the developing outbreak.
Summary
Spotfire® together with TERR® is a powerful combination for conveniently reading data from sources such as these pdf documents, that are commonly associated with current events. We've used this combination to ingest data from a year's worth of weekly WHO pdf situation reports describing the ongoing Ebola virus disease (EVD) outbreak.
This is obviously a developing situation that is being monitored by the World Health Organization, the Center for Disease Control and other organizations; refer to their websites for updates.
6 Aug 2019
Recommended Comments
There are no comments to display.