Overview

This is the data preparation effort for developing a Machine Learning model for predicting hospital readmission within 30 days.

Hospital readmission is a real-world problem and an on-going topic for improving health care quality and a patient’s experience, while ensuring cost-effectiveness. Information of Hospital Readmissions Reduction Program (HRRP) is publicly available in CMS, Center for Medicare and Medicaid Services, web site.

The dataset, Diabetes 130-US hospitals for years 1999-2008 Data Set, was downloaded from UCI Machine Learning Repository. It represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks with 100,000 observations and 50 features representing patient and hospital outcomes.

The developed Machine Learning model is based on R and employed the package, SuperLearer, with ensemble learning to optimize the results. For computation needs, most of the ensemble learning ran on a Microsoft Azure public cloud an E16 Virtual Machine with 16 vcpus and 128 GB RAM, as shown below. For a training set of 10,000 observations and 21 predictors, in general the model took about 2 to 3 hours to train and more than 6 hours to carry out 10-fold cross-validation with three algorithms. The demand for computing resources was significant.

Virtual Machine Hardware Configuration

Virtual Machine Hardware Configuration

Some variables were with high missingness and unusable. A few considered as missing at random (MAR) were imputed with values using Multivariate Imputation by Chained Equations (mice) package.

The feature selection was largely based on the output from Boruta. In several test runs, Boruta took about 30 minutes and was able to confirm all variables, 21 important and 5 unimportant, within 100 iterations initially set.

Dataset

The dataset was first downloaded from the above link and imported into RStudio.

## Diabetes data set imported ( 101766  observations with  50  variables )

Removed the ID field, encounter_id.

## Removed encounter_id. The data set now has  49  variables.
## 'data.frame':    101766 obs. of  49 variables:
##  $ patient_nbr             : int  8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
##  $ race                    : Factor w/ 6 levels "?","AfricanAmerican",..: 4 4 2 4 4 4 4 4 4 4 ...
##  $ gender                  : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 2 2 2 2 1 1 ...
##  $ age                     : Factor w/ 10 levels "[0-10)","[10-20)",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weight                  : Factor w/ 10 levels "?","[0-25)","[100-125)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ admission_type_id       : int  6 1 1 1 1 2 3 1 2 3 ...
##  $ discharge_disposition_id: int  25 1 1 1 1 1 1 1 1 3 ...
##  $ admission_source_id     : int  1 7 7 7 7 2 2 7 4 4 ...
##  $ time_in_hospital        : int  1 3 2 2 1 3 4 5 13 12 ...
##  $ payer_code              : Factor w/ 18 levels "?","BC","CH",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ medical_specialty       : Factor w/ 73 levels "?","AllergyandImmunology",..: 39 1 1 1 1 1 1 1 1 20 ...
##  $ num_lab_procedures      : int  41 59 11 44 51 31 70 73 68 33 ...
##  $ num_procedures          : int  0 0 5 1 0 6 1 0 2 3 ...
##  $ num_medications         : int  1 18 13 16 8 16 21 12 28 18 ...
##  $ number_outpatient       : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ number_emergency        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_inpatient        : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ diag_1                  : Factor w/ 717 levels "?","10","11",..: 126 145 456 556 56 265 265 278 254 284 ...
##  $ diag_2                  : Factor w/ 749 levels "?","11","110",..: 1 81 80 99 26 248 248 316 262 48 ...
##  $ diag_3                  : Factor w/ 790 levels "?","11","110",..: 1 123 768 250 88 88 772 88 231 319 ...
##  $ number_diagnoses        : int  1 9 6 7 5 9 7 8 8 8 ...
##  $ max_glu_serum           : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ repaglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ nateglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ chlorpropamide          : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glimepiride             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ acetohexamide           : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glipizide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
##  $ glyburide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
##  $ tolbutamide             : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pioglitazone            : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ rosiglitazone           : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
##  $ acarbose                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ miglitol                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ troglitazone            : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tolazamide              : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ examide                 : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ citoglipton             : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
##  $ glyburide.metformin     : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glipizide.metformin     : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.pioglitazone  : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ change                  : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
##  $ diabetesMed             : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ readmitted              : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...

Missingness

In the data set, there were many variables with ‘?’ as the value. Apparently, it was an indication of a missing value which was therefore replaced with NA.

## Missing values indicated by "?" =  192849
## Replacing  192849  values stored as "?" with NA.
## Total NA count =  192849

Examining missingness of the dataset revealed disproportional amount of data were missing in variables, particularly weight, medical_specialty, and payer_code.

Percentage of Values Missing

  • Each of the aforementioned variables had a high percentage of missing values, which made them essentially unusable. In the following histogram, the red line indicated a set threshold of 30% and those with missing values above 30% were removed from consideration at this point.
##              race            weight        payer_code medical_specialty 
##              2.23             96.86             39.56             49.08 
##            diag_1            diag_2            diag_3 
##              0.02              0.35              1.40

## Removing the three variables:  weight payer_code medical_specialty
## At this time, the data set has  101766  observations with  46  variables.

Near Zero-Variance Variables

Two variables, examide and citoglipton, had only one level with no missing value. Therefore these two variables were with zero-variance, and not informative and would contribute little for predicting an outcome. All other near zero-variance (nzv) variables were also removed from the dataset.

Although some may argue that zero-variance variables may in fact have some influence, in the diabetes dataset a few factor variables with multiple levels were nzv. If to keep them, it would later generate considerable number of dummy variables and increase the computation complexities and resource requirements. Consequently, removed all nzv variables.

## caret reports  18 near zero-variables as the following:
##  [1] "max_glu_serum"            "repaglinide"             
##  [3] "nateglinide"              "chlorpropamide"          
##  [5] "glimepiride"              "acetohexamide"           
##  [7] "tolbutamide"              "acarbose"                
##  [9] "miglitol"                 "troglitazone"            
## [11] "tolazamide"               "examide"                 
## [13] "citoglipton"              "glyburide.metformin"     
## [15] "glipizide.metformin"      "glimepiride.pioglitazone"
## [17] "metformin.rosiglitazone"  "metformin.pioglitazone"
## The listed,  18  near-zero variables have been removed.
## 
##  At this time, the data set has  101766  observations with  28  variables remaining.

Multiple Encounters of a Patient

The data set contained multiple rows with the same patient_nbr, i.e. a patient number. It was unclear if these encounters, i.e. visits, were independent. There was a risk that these multiple visits of a patient might be related, hence introduce bias since some encounters of a patient then become correlated. To eliminate this risk, kept one and only one encounter which had the maximum time_in_hospital, assuming time_in_hospital was characteristic for readmission and would present sufficient variance in training data.

## patient_nbr with multiple encounters =  30248
## Before eliminating multiple encounters of a patient, total  101766  observations
## After eliminating multiple encounters of a patient, total  71518  observations
## Multiple encounters of a patient now =  0

Once having processed multiple-encounter of a patient, removed the patient ID from the dataset.

## Dropping the ID fields,  patient_nbr  and  race
## At this time, the data set has  71518  observations with  26  variables.

Categorical Variables

Now moving to prepare categorical variables. For feature description, reference IDs_mapping.csv from the original dataset downloaded form UCI Machine Learning repository.

Gender

  • Eliminated the ‘unknown’ type.
## "gender" is now  factor  with the levels:  Female Male

Age

Consolidated from a 10-level factor to 3 and numeric as:

  • [0-10), [10-20), [20-30), [30-40), [40-50) , [50-60) as 1
  • [60-70), [70-80) as 1.5
  • [80-90), [90-100) as 2
## Considering those older than 60 are twice more likely to be readmitted.
## "age" is now  numeric  with the unique values:  1.5 1 2 
##  where age<60 is assigned as 1, 60<=age<80 1.5, and age>80 as 2

Admission Type

  • Changed from 8 levels to 2.
## "admission_type_id" is now  factor  with levels:  k u 
##  where u: unknown, k: known

Admission Source

  • Consolidated from 25 levels to 5.
## "admission_source_id" is now  factor  with levels:  b o r t u 
##  where r: referral, t: transfer, u: unknown, o: other, b: birth

Disposition Ids

  • Removed disposition code associated with ‘Expired’ since not relevant to readmission.
  • Consolidated form 25 levels to 5.
## Before removing "discharge_disposition_id" of 11,19,20,21, 
## there were  71518  observations with unique ids: 
##  1 3 25 6 2 11 5 8 13 4 18 17 22 14 23 7 28 27 15 10 16 9 20 24 12 19
## After removing "discharge_disposition_id" of 11,19,20,21,
## there were  70245  observations with unique ids: 
##  1 3 25 6 2 5 8 13 4 18 17 22 14 23 7 28 27 15 10 16 9 24 12
## "discharge_disposition_id)" is now  factor  with levels:  d h o u 
##  where d: discharge, h: hospice, u: unknown, o: other

Diagnostic Information

There were three variables for diagnostic information. Each had more than 700 levels. Per Table 2 of Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records, they are converted each into 9 categories.

diag_1

## *** diag_1 with  717  levels
## *** diag_1 is now converted to  9  levels as the following: 
##  Circulatory Diabetes Digestive Genitourinary Injury Musculoskeletal Neoplasms Other Respiratory

diag_2

## *** diag_2 with  749  levels
## *** diag_2 is now converted to  9  levels as the following:
##  Circulatory Diabetes Digestive Genitourinary Injury Musculoskeletal Neoplasms Other Respiratory

diag_3

## *** diag_3 with  790  levels
## *** diag_3 is now converted to  9  levels as the following: 
##  Circulatory Diabetes Digestive Genitourinary Injury Musculoskeletal Neoplasms Other Respiratory

Response Variable

  • The response variable, readmitted, originally had three levels: <30, >30, and NO.
  • For classification, converted to 2 levels, no and yes, then to numeric, 0 and 1, respectively.
## "readmitted" levels:  <30 >30 NO
## "readmitted" is now a  numeric  with unique values:  0 1
## At this time, the dataset has  70245  observations with  26  variables.

Data Types

For convenience, here separated variables based on the data types, i.e. factor, numeric, and integer.

## $factor
##  [1] "gender"                   "admission_type_id"       
##  [3] "discharge_disposition_id" "admission_source_id"     
##  [5] "diag_1"                   "diag_2"                  
##  [7] "diag_3"                   "A1Cresult"               
##  [9] "metformin"                "glipizide"               
## [11] "glyburide"                "pioglitazone"            
## [13] "rosiglitazone"            "insulin"                 
## [15] "change"                   "diabetesMed"             
## 
## $numeric
## [1] "age"        "readmitted"
## 
## $integer
## [1] "time_in_hospital"   "num_lab_procedures" "num_procedures"    
## [4] "num_medications"    "number_outpatient"  "number_emergency"  
## [7] "number_inpatient"   "number_diagnoses"
##  factor numeric integer 
##    "16"    " 2"    " 8"
## Total  26  variables
##  time_in_hospital num_lab_procedures num_procedures num_medications
##  Min.   : 1.000   Min.   :  1.00     Min.   :0.00   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 32.00     1st Qu.:0.00   1st Qu.:10.00  
##  Median : 4.000   Median : 45.00     Median :1.00   Median :15.00  
##  Mean   : 4.751   Mean   : 43.95     Mean   :1.48   Mean   :16.34  
##  3rd Qu.: 6.000   3rd Qu.: 58.00     3rd Qu.:2.00   3rd Qu.:21.00  
##  Max.   :14.000   Max.   :132.00     Max.   :6.00   Max.   :81.00  
##  number_outpatient number_emergency  number_inpatient  number_diagnoses
##  Min.   : 0.000    Min.   : 0.0000   Min.   : 0.0000   Min.   : 1.000  
##  1st Qu.: 0.000    1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 6.000  
##  Median : 0.000    Median : 0.0000   Median : 0.0000   Median : 8.000  
##  Mean   : 0.302    Mean   : 0.1233   Mean   : 0.3203   Mean   : 7.325  
##  3rd Qu.: 0.000    3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 9.000  
##  Max.   :40.000    Max.   :63.0000   Max.   :19.0000   Max.   :16.000

Numeric Variables

All numeric variables were centered and normalized. The correlation plot showed two on 0.4-level, while overall considered acceptable. While examining the correlation coefficients and considering the general healthcare practices, it is logical to assume that the more procedures are performed, the more medications likely used and the longer observation and recovery time required. For an advanced modeling, may consider the interactions among the four variables:

There was no interactions modeled here nevertheless.

## [1] "time_in_hospital"   "num_lab_procedures" "num_procedures"    
## [4] "num_medications"    "number_outpatient"  "number_emergency"  
## [7] "number_inpatient"   "number_diagnoses"

Centering and Normalization

## Centering and normalizing the  8 integer variables
##   time_in_hospital num_lab_procedures num_procedures num_medications
## 1         2.918277       -0.800474000      0.8571450       1.3588927
## 2         2.918277        1.709052148     -0.2705126       0.3100768
## 3         2.918277       -0.147997201     -0.2705126       0.1935417
## 4         2.918277        0.002574368      0.2933162      -0.5056689
## 5         2.918277        1.357718487      0.8571450       1.9415682
## 6         2.918277        1.357718487     -0.2705126       1.8250331
##   number_outpatient number_emergency number_inpatient number_diagnoses
## 1        -0.2745189       -0.1937564       -0.3935239        0.3412069
## 2        -0.2745189       -0.1937564       -0.3935239        0.3412069
## 3        -0.2745189       -0.1937564       -0.3935239       -1.6794122
## 4        -0.2745189       -0.1937564       -0.3935239       -2.1845669
## 5        -0.2745189       -0.1937564       -0.3935239        0.8463617
## 6        -0.2745189       -0.1937564       -0.3935239        0.8463617
##  time_in_hospital  num_lab_procedures num_procedures    num_medications  
##  Min.   :-1.1836   Min.   :-2.15562   Min.   :-0.8343   Min.   :-1.7876  
##  1st Qu.:-0.8681   1st Qu.:-0.59971   1st Qu.:-0.8343   1st Qu.:-0.7387  
##  Median :-0.2370   Median : 0.05276   Median :-0.2705   Median :-0.1561  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3941   3rd Qu.: 0.70524   3rd Qu.: 0.2933   3rd Qu.: 0.5431  
##  Max.   : 2.9183   Max.   : 4.41934   Max.   : 2.5486   Max.   : 7.5353  
##  number_outpatient number_emergency  number_inpatient  number_diagnoses 
##  Min.   :-0.2745   Min.   :-0.1938   Min.   :-0.3935   Min.   :-3.1949  
##  1st Qu.:-0.2745   1st Qu.:-0.1938   1st Qu.:-0.3935   1st Qu.:-0.6691  
##  Median :-0.2745   Median :-0.1938   Median :-0.3935   Median : 0.3412  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2745   3rd Qu.:-0.1938   3rd Qu.:-0.3935   3rd Qu.: 0.8464  
##  Max.   :36.0839   Max.   :98.8311   Max.   :22.9485   Max.   : 4.3824

Visualization

Overall, nothing immediately raised a concern. There were however a few outliers of number_outpatient, number_inpatient, and number_emergency. Examined with boxplots (not shown here) these outliers were with relatively extreme values compared with other observations of the variable.