Data Requirements
Our sample data contain a combination of historical patient interactions, patient demographic and health profiles, and hospital treatment plans. The historical patient data includes the admission data from the current hospital visit, plus prior emergency room and primary care provider visits.
Joins and Feature Engineering
First template workflow for joining the source datasets and feature engineering.
Our first flow joins the various sources of patient data together into a master dataset with one row per patient and diagnostic code combination. Prior to joining, we compute various aggregations that summarize the patient's medical history. With the historical patient interactions dataset, we apply two windowing functions, Aggregate and Lag/Lead. Aggregate counts the total number of admissions, the sum duration in days of those admissions, and the most recent admit date within a five-year window. Lag/Lead calculates the discharge date of the previous encounter and the admit date of the subsequent encounter for a given admission.
Data Exploration
Second template for exploratory data visualization.
The second step performs several forms of exploratory data visualization on the ETLed dataset from the previous flow. These visualizations expose several trends in the data. Patients with long stays in past hospital visits, higher hospital expenses, and a long stay in the current visit are at high risk of readmission. We use a Gradient Boosting Classification operator to explore which features are likely most useful for classification - knowledge that we can apply in the next workflow.
Modeling
Third template for predictive modeling.
In the final workflow, we build logistic regression and random forest classification models for predicting readmission risk. Random forest performs best with a 92% accuracy rate, but the precision for readmissions is 43%, indicating a very low number of false negatives but a high false positive rate. Readmissions are comparatively rare in the dataset (they represent about 5% of cases), and getting high precision for rare events can be tricky.
Key Technique - Resampling
Due to the low representation of readmissions, we use the Resampling operator to overrepresent those cases in the training set. After resampling the dataset is 50% of each readmission class, which ensures that the classification algorithms are exposed to sufficient samples from each category. The Resampling operator allows you to choose the relative balance of the categories, which is one way to influence the tradeoff between false positives and false negatives.
Check It Out!
For access to this Playbook, including its workflows, sample data, a PowerPoint summary, and expert support from Spotfire® Data Science data scientists, contact your Spotfire® Data Science sales representative.
Recommended Comments
There are no comments to display.