Data Preparation of Diabetes Dataset

Overview

This is the data preparation effort for developing a Machine Learning model for predicting hospital readmission within 30 days.

Hospital readmission is a real-world problem and an on-going topic for improving health care quality and a patient’s experience, while ensuring cost-effectiveness. Information of Hospital Readmissions Reduction Program (HRRP) is publicly available in CMS, Center for Medicare and Medicaid Services, web site.

The dataset, Diabetes 130-US hospitals for years 1999-2008 Data Set, was downloaded from UCI Machine Learning Repository. It represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks with 100,000 observations and 50 features representing patient and hospital outcomes.

The developed Machine Learning model is based on R and employed the package, SuperLearer, with ensemble learning to optimize the results. For computation needs, most of the ensemble learning ran on a Microsoft Azure public cloud an E16 Virtual Machine with 16 vcpus and 128 GB RAM, as shown below. For a training set of 10,000 observations and 21 predictors, in general the model took about 2 to 3 hours to train and more than 6 hours to carry out 10-fold cross-validation with three algorithms. The demand for computing resources was significant.

Virtual Machine Hardware Configuration

Some variables were with high missingness and unusable. A few considered as missing at random (MAR) were imputed with values using Multivariate Imputation by Chained Equations (mice) package.

The feature selection was largely based on the output from Boruta. In several test runs, Boruta took about 30 minutes and was able to confirm all variables, 21 important and 5 unimportant, within 100 iterations initially set.

Dataset

The dataset was first downloaded from the above link and imported into RStudio.

## Diabetes data set imported ( 101766  observations with  50  variables )

Removed the ID field, encounter_id.

## Removed encounter_id. The data set now has  49  variables.

## 'data.frame':    101766 obs. of  49 variables:
##  $ patient_nbr             : int  8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
##  $ race                    : Factor w/ 6 levels "?","AfricanAmerican",..: 4 4 2 4 4 4 4 4 4 4 ...
##  $ gender                  : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 2 2 2 2 1 1 ...
##  $ age                     : Factor w/ 10 levels "[0-10)","[10-20)",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weight                  : Factor w/ 10 levels "?","[0-25)","[100-125)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ admission_type_id       : int  6 1 1 1 1 2 3 1 2 3 ...
##  $ discharge_disposition_id: int  25 1 1 1 1 1 1 1 1 3 ...
##  $ admission_source_id     : int  1 7 7 7 7 2 2 7 4 4 ...
##  $ time_in_hospital        : int  1 3 2 2 1 3 4 5 13 12 ...
##  $ payer_code              : Factor w/ 18 levels "?","BC","CH",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ medical_specialty       : Factor w/ 73 levels "?","AllergyandImmunology",..: 39 1 1 1 1 1 1 1 1 20 ...
##  $ num_lab_procedures      : int  41 59 11 44 51 31 70 73 68 33 ...
##  $ num_procedures          : int  0 0 5 1 0 6 1 0 2 3 ...
##  $ num_medications         : int  1 18 13 16 8 16 21 12 28 18 ...
##  $ number_outpatient       : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ number_emergency        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_inpatient        : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ diag_1                  : Factor w/ 717 levels "?","10","11",..: 126 145 456 556 56 265 265 278 254 284 ...
##  $ diag_2                  : Factor w/ 749 levels "?","11","110",..: 1 81 80 99 26 248 248 316 262 48 ...
##  $ diag_3                  : Factor w/ 790 levels "?","11","110",..: 1 123 768 250 88 88 772 88 231 319 ...
##  $ number_diagnoses        : int  1 9 6 7 5 9 7 8 8 8 ...
##  $ max_glu_serum           : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ repaglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ nateglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ chlorpropamide          : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glimepiride             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ acetohexamide           : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glipizide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
##  $ glyburide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
##  $ tolbutamide             : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pioglitazone            : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ rosiglitazone           : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
##  $ acarbose                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ miglitol                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ troglitazone            : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tolazamide              : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ examide                 : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ citoglipton             : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
##  $ glyburide.metformin     : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glipizide.metformin     : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.pioglitazone  : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ change                  : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
##  $ diabetesMed             : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ readmitted              : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...

Missingness

In the data set, there were many variables with ‘?’ as the value. Apparently, it was an indication of a missing value which was therefore replaced with NA.

## Missing values indicated by "?" =  192849

## Replacing  192849  values stored as "?" with NA.

## Total NA count =  192849

Examining missingness of the dataset revealed disproportional amount of data were missing in variables, particularly weight, medical_specialty, and payer_code.

Percentage of Values Missing

Each of the aforementioned variables had a high percentage of missing values, which made them essentially unusable. In the following histogram, the red line indicated a set threshold of 30% and those with missing values above 30% were removed from consideration at this point.

##              race            weight        payer_code medical_specialty 
##              2.23             96.86             39.56             49.08 
##            diag_1            diag_2            diag_3 
##              0.02              0.35              1.40

## Removing the three variables:  weight payer_code medical_specialty

## At this time, the data set has  101766  observations with  46  variables.

Near Zero-Variance Variables

Two variables, examide and citoglipton, had only one level with no missing value. Therefore these two variables were with zero-variance, and not informative and would contribute little for predicting an outcome. All other near zero-variance (nzv) variables were also removed from the dataset.

Although some may argue that zero-variance variables may in fact have some influence, in the diabetes dataset a few factor variables with multiple levels were nzv. If to keep them, it would later generate considerable number of dummy variables and increase the computation complexities and resource requirements. Consequently, removed all nzv variables.

## caret reports  18 near zero-variables as the following:

##  [1] "max_glu_serum"            "repaglinide"             
##  [3] "nateglinide"              "chlorpropamide"          
##  [5] "glimepiride"              "acetohexamide"           
##  [7] "tolbutamide"              "acarbose"                
##  [9] "miglitol"                 "troglitazone"            
## [11] "tolazamide"               "examide"                 
## [13] "citoglipton"              "glyburide.metformin"     
## [15] "glipizide.metformin"      "glimepiride.pioglitazone"
## [17] "metformin.rosiglitazone"  "metformin.pioglitazone"

## The listed,  18  near-zero variables have been removed.
## 
##  At this time, the data set has  101766  observations with  28  variables remaining.

Multiple Encounters of a Patient

The data set contained multiple rows with the same patient_nbr, i.e. a patient number. It was unclear if these encounters, i.e. visits, were independent. There was a risk that these multiple visits of a patient might be related, hence introduce bias since some encounters of a patient then become correlated. To eliminate this risk, kept one and only one encounter which had the maximum time_in_hospital, assuming time_in_hospital was characteristic for readmission and would present sufficient variance in training data.

## patient_nbr with multiple encounters =  30248

## Before eliminating multiple encounters of a patient, total  101766  observations

## After eliminating multiple encounters of a patient, total  71518  observations

## Multiple encounters of a patient now =  0

Once having processed multiple-encounter of a patient, removed the patient ID from the dataset.

## Dropping the ID fields,  patient_nbr  and  race

## At this time, the data set has  71518  observations with  26  variables.

Categorical Variables

Now moving to prepare categorical variables. For feature description, reference IDs_mapping.csv from the original dataset downloaded form UCI Machine Learning repository.

The three: diag_1, diag_2, and diag_3, each had some 700 levels. Which would require around 900 dummy variables and the computation needs would be expensive to manage. To consolidate the levels, followed Table 2 of the research report, Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records and converted the levels of all three variables into 9 categories. The programming part was lengthy and tedious. It however reduced the complexities to a manageable level.
Consolidated levels of other factor variables were based on analysis of the dataset, general experience on receiving health care services and some common senses for possibly delivering most variance in the Machine Learning model.

Gender

Eliminated the ‘unknown’ type.

## "gender" is now  factor  with the levels:  Female Male

Age

Consolidated from a 10-level factor to 3 and numeric as:

[0-10), [10-20), [20-30), [30-40), [40-50) , [50-60) as 1
[60-70), [70-80) as 1.5
[80-90), [90-100) as 2

## Considering those older than 60 are twice more likely to be readmitted.

## "age" is now  numeric  with the unique values:  1.5 1 2 
##  where age<60 is assigned as 1, 60<=age<80 1.5, and age>80 as 2

Admission Type

Changed from 8 levels to 2.

## "admission_type_id" is now  factor  with levels:  k u 
##  where u: unknown, k: known

Admission Source

Consolidated from 25 levels to 5.

## "admission_source_id" is now  factor  with levels:  b o r t u 
##  where r: referral, t: transfer, u: unknown, o: other, b: birth

Disposition Ids

Removed disposition code associated with ‘Expired’ since not relevant to readmission.
Consolidated form 25 levels to 5.

## Before removing "discharge_disposition_id" of 11,19,20,21, 
## there were  71518  observations with unique ids: 
##  1 3 25 6 2 11 5 8 13 4 18 17 22 14 23 7 28 27 15 10 16 9 20 24 12 19

## After removing "discharge_disposition_id" of 11,19,20,21,
## there were  70245  observations with unique ids: 
##  1 3 25 6 2 5 8 13 4 18 17 22 14 23 7 28 27 15 10 16 9 24 12

## "discharge_disposition_id)" is now  factor  with levels:  d h o u 
##  where d: discharge, h: hospice, u: unknown, o: other

Diagnostic Information

There were three variables for diagnostic information. Each had more than 700 levels. Per Table 2 of Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records, they are converted each into 9 categories.

diag_1

## *** diag_1 with  717  levels

## *** diag_1 is now converted to  9  levels as the following: 
##  Circulatory Diabetes Digestive Genitourinary Injury Musculoskeletal Neoplasms Other Respiratory

diag_2

## *** diag_2 with  749  levels

## *** diag_2 is now converted to  9  levels as the following:
##  Circulatory Diabetes Digestive Genitourinary Injury Musculoskeletal Neoplasms Other Respiratory

diag_3

## *** diag_3 with  790  levels

## *** diag_3 is now converted to  9  levels as the following: 
##  Circulatory Diabetes Digestive Genitourinary Injury Musculoskeletal Neoplasms Other Respiratory

Response Variable

The response variable, readmitted, originally had three levels: <30, >30, and NO.
For classification, converted to 2 levels, no and yes, then to numeric, 0 and 1, respectively.

## "readmitted" levels:  <30 >30 NO

## "readmitted" is now a  numeric  with unique values:  0 1

## At this time, the dataset has  70245  observations with  26  variables.

Data Types

For convenience, here separated variables based on the data types, i.e. factor, numeric, and integer.

## $factor
##  [1] "gender"                   "admission_type_id"       
##  [3] "discharge_disposition_id" "admission_source_id"     
##  [5] "diag_1"                   "diag_2"                  
##  [7] "diag_3"                   "A1Cresult"               
##  [9] "metformin"                "glipizide"               
## [11] "glyburide"                "pioglitazone"            
## [13] "rosiglitazone"            "insulin"                 
## [15] "change"                   "diabetesMed"             
## 
## $numeric
## [1] "age"        "readmitted"
## 
## $integer
## [1] "time_in_hospital"   "num_lab_procedures" "num_procedures"    
## [4] "num_medications"    "number_outpatient"  "number_emergency"  
## [7] "number_inpatient"   "number_diagnoses"

##  factor numeric integer 
##    "16"    " 2"    " 8"

## Total  26  variables

##  time_in_hospital num_lab_procedures num_procedures num_medications
##  Min.   : 1.000   Min.   :  1.00     Min.   :0.00   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 32.00     1st Qu.:0.00   1st Qu.:10.00  
##  Median : 4.000   Median : 45.00     Median :1.00   Median :15.00  
##  Mean   : 4.751   Mean   : 43.95     Mean   :1.48   Mean   :16.34  
##  3rd Qu.: 6.000   3rd Qu.: 58.00     3rd Qu.:2.00   3rd Qu.:21.00  
##  Max.   :14.000   Max.   :132.00     Max.   :6.00   Max.   :81.00  
##  number_outpatient number_emergency  number_inpatient  number_diagnoses
##  Min.   : 0.000    Min.   : 0.0000   Min.   : 0.0000   Min.   : 1.000  
##  1st Qu.: 0.000    1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 6.000  
##  Median : 0.000    Median : 0.0000   Median : 0.0000   Median : 8.000  
##  Mean   : 0.302    Mean   : 0.1233   Mean   : 0.3203   Mean   : 7.325  
##  3rd Qu.: 0.000    3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 9.000  
##  Max.   :40.000    Max.   :63.0000   Max.   :19.0000   Max.   :16.000

Numeric Variables

All numeric variables were centered and normalized. The correlation plot showed two on 0.4-level, while overall considered acceptable. While examining the correlation coefficients and considering the general healthcare practices, it is logical to assume that the more procedures are performed, the more medications likely used and the longer observation and recovery time required. For an advanced modeling, may consider the interactions among the four variables:

time_in_hospital
num_lab_procedures
num_medications
num_procedures

There was no interactions modeled here nevertheless.

## [1] "time_in_hospital"   "num_lab_procedures" "num_procedures"    
## [4] "num_medications"    "number_outpatient"  "number_emergency"  
## [7] "number_inpatient"   "number_diagnoses"

Centering and Normalization

## Centering and normalizing the  8 integer variables

##   time_in_hospital num_lab_procedures num_procedures num_medications
## 1         2.918277       -0.800474000      0.8571450       1.3588927
## 2         2.918277        1.709052148     -0.2705126       0.3100768
## 3         2.918277       -0.147997201     -0.2705126       0.1935417
## 4         2.918277        0.002574368      0.2933162      -0.5056689
## 5         2.918277        1.357718487      0.8571450       1.9415682
## 6         2.918277        1.357718487     -0.2705126       1.8250331
##   number_outpatient number_emergency number_inpatient number_diagnoses
## 1        -0.2745189       -0.1937564       -0.3935239        0.3412069
## 2        -0.2745189       -0.1937564       -0.3935239        0.3412069
## 3        -0.2745189       -0.1937564       -0.3935239       -1.6794122
## 4        -0.2745189       -0.1937564       -0.3935239       -2.1845669
## 5        -0.2745189       -0.1937564       -0.3935239        0.8463617
## 6        -0.2745189       -0.1937564       -0.3935239        0.8463617

##  time_in_hospital  num_lab_procedures num_procedures    num_medications  
##  Min.   :-1.1836   Min.   :-2.15562   Min.   :-0.8343   Min.   :-1.7876  
##  1st Qu.:-0.8681   1st Qu.:-0.59971   1st Qu.:-0.8343   1st Qu.:-0.7387  
##  Median :-0.2370   Median : 0.05276   Median :-0.2705   Median :-0.1561  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3941   3rd Qu.: 0.70524   3rd Qu.: 0.2933   3rd Qu.: 0.5431  
##  Max.   : 2.9183   Max.   : 4.41934   Max.   : 2.5486   Max.   : 7.5353  
##  number_outpatient number_emergency  number_inpatient  number_diagnoses 
##  Min.   :-0.2745   Min.   :-0.1938   Min.   :-0.3935   Min.   :-3.1949  
##  1st Qu.:-0.2745   1st Qu.:-0.1938   1st Qu.:-0.3935   1st Qu.:-0.6691  
##  Median :-0.2745   Median :-0.1938   Median :-0.3935   Median : 0.3412  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2745   3rd Qu.:-0.1938   3rd Qu.:-0.3935   3rd Qu.: 0.8464  
##  Max.   :36.0839   Max.   :98.8311   Max.   :22.9485   Max.   : 4.3824

Visualization

Overall, nothing immediately raised a concern. There were however a few outliers of number_outpatient, number_inpatient, and number_emergency. Examined with boxplots (not shown here) these outliers were with relatively extreme values compared with other observations of the variable.

Outliers

Further looking in the dataset, those variable with extreme values were spread among a handful observations. And for these variables:

number_outpatient
number_inpatient
number_emergency

were with a mean value very close to zero, removing a few outliers resulted in zeroing all summary statistics, which caused some computation issues in subsequent processing. Consequently, the few outliers were kept as they were.

##  time_in_hospital  num_lab_procedures num_procedures    num_medications  
##  Min.   :-1.1836   Min.   :-2.15562   Min.   :-0.8343   Min.   :-1.7876  
##  1st Qu.:-0.8681   1st Qu.:-0.59971   1st Qu.:-0.8343   1st Qu.:-0.7387  
##  Median :-0.2370   Median : 0.05276   Median :-0.2705   Median :-0.1561  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3941   3rd Qu.: 0.70524   3rd Qu.: 0.2933   3rd Qu.: 0.5431  
##  Max.   : 2.9183   Max.   : 4.41934   Max.   : 2.5486   Max.   : 7.5353  
##  number_outpatient number_emergency  number_inpatient  number_diagnoses 
##  Min.   :-0.2745   Min.   :-0.1938   Min.   :-0.3935   Min.   :-3.1949  
##  1st Qu.:-0.2745   1st Qu.:-0.1938   1st Qu.:-0.3935   1st Qu.:-0.6691  
##  Median :-0.2745   Median :-0.1938   Median :-0.3935   Median : 0.3412  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2745   3rd Qu.:-0.1938   3rd Qu.:-0.3935   3rd Qu.: 0.8464  
##  Max.   :36.0839   Max.   :98.8311   Max.   :22.9485   Max.   : 4.3824

Multicollinearity

The corrplot reported the following pairs with the coefficient above 0.4 level.

time_in_hospital and num_medications
num_procedures and num_medications

##                    time_in_hospital num_lab_procedures num_procedures
## time_in_hospital        1.000000000        0.326984553     0.17795424
## num_lab_procedures      0.326984553        1.000000000     0.05212467
## num_procedures          0.177954240        0.052124670     1.00000000
## num_medications         0.471308790        0.272938439     0.40269545
## number_outpatient       0.007770235       -0.002151718    -0.01604211
## number_emergency        0.029281798        0.022392959    -0.02602895
## number_inpatient        0.212521455        0.094370649    -0.01162087
## number_diagnoses        0.258738703        0.168668329     0.08947737
##                    num_medications number_outpatient number_emergency
## time_in_hospital        0.47130879       0.007770235       0.02928180
## num_lab_procedures      0.27293844      -0.002151718       0.02239296
## num_procedures          0.40269545      -0.016042113      -0.02602895
## num_medications         1.00000000       0.042216488       0.02420536
## number_outpatient       0.04221649       1.000000000       0.09645106
## number_emergency        0.02420536       0.096451065       1.00000000
## number_inpatient        0.10364130       0.085257136       0.18444792
## number_diagnoses        0.27051709       0.087293152       0.05916346
##                    number_inpatient number_diagnoses
## time_in_hospital         0.21252145       0.25873870
## num_lab_procedures       0.09437065       0.16866833
## num_procedures          -0.01162087       0.08947737
## num_medications          0.10364130       0.27051709
## number_outpatient        0.08525714       0.08729315
## number_emergency         0.18444792       0.05916346
## number_inpatient         1.00000000       0.11520971
## number_diagnoses         0.11520971       1.00000000

This seemed logical since the longer a patient stayed, the more medications one likely to had. Similarly, the more medications a doctor had subscribed for a patient, the longer the patient was likely to stay in the hospital. Modeling the interactions is something to be considered. In this project, due to the very limit computation resources and time constraint, the interactions were not included in the modeling.

Imputation of Data

The missing values were largely of these variables including age, diag_1, diag_2 and diag_3. Considering the information was essential and less likely not provided or never developed. The decision was to impute values with criteria based on the dataset.
Used the package, Multivariate Imputation by Chained Equations(mice), to impute values based on existing data as shown in the following stripplot.

## 0  missing values of all numeric variables

## 2417  missing values of all factor variables

Features Selection

To facilitate feature selection, employed another tool.

A forest spirit in the Slavic mythology, Boruta (also called Leśny or Lešny) was portrayed as an imposing figure, with horns over the head, surrounded by packs of wolves and bears. In R, Boruta is a helpful package for facilitating a feature selection process.

Splitting Data

The dataset was then partition into a 70/30 split for training and testing. The training part was also used for Boruta to confirm features subsequently.

Oh, Boruta

By default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attributes’ importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilize that test.

Sample code to run Boruta is available.

A successful Boruta run resulted in a set of features confirmed as important, tentative and unimportant, as applicable. During a run, Boruta sets up shadow variables to model each individual variable as a predictor and determine the importance. These shadow variables were referenced as the maximum and the minimum values for confirming or denying variables. Those tested as predictors with performance greater than the maximum were confirmed, smaller than the minimum denied. Unresolved variables, as applicable, were consider tentative.

Considering Boruta’s output, implemented feature selection.

##  [1] "race"                     "age"                     
##  [3] "admission_type_id"        "discharge_disposition_id"
##  [5] "admission_source_id"      "time_in_hospital"        
##  [7] "num_lab_procedures"       "num_procedures"          
##  [9] "num_medications"          "number_outpatient"       
## [11] "number_emergency"         "number_inpatient"        
## [13] "diag_1"                   "diag_2"                  
## [15] "diag_3"                   "number_diagnoses"        
## [17] "A1Cresult"                "metformin"               
## [19] "insulin"                  "change"                  
## [21] "diabetesMed"

Finalizing Dataset

Finally stored the prepared dataset ready for importing into a Machine Learning algorithm.

## 'data.frame':    70245 obs. of  22 variables:
##  $ race                    : Factor w/ 5 levels "AfricanAmerican",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ age                     : num  1.5 1.5 1 2 1.5 1.5 1.5 1.5 1 2 ...
##  $ admission_type_id       : Factor w/ 2 levels "k","u": 1 1 2 2 2 2 2 1 2 2 ...
##  $ discharge_disposition_id: Factor w/ 4 levels "d","h","o","u": 1 1 4 1 4 4 4 1 1 4 ...
##  $ admission_source_id     : Factor w/ 5 levels "b","o","r","t",..: 3 2 3 3 2 2 3 4 3 2 ...
##  $ time_in_hospital        : num  2.92 2.92 2.92 2.92 2.92 ...
##  $ num_lab_procedures      : num  -0.80047 1.70905 -0.148 0.00257 1.35772 ...
##  $ num_procedures          : num  0.857 -0.271 -0.271 0.293 0.857 ...
##  $ num_medications         : num  1.359 0.31 0.194 -0.506 1.942 ...
##  $ number_outpatient       : num  -0.275 -0.275 -0.275 -0.275 -0.275 ...
##  $ number_emergency        : num  -0.194 -0.194 -0.194 -0.194 -0.194 ...
##  $ number_inpatient        : num  -0.394 -0.394 -0.394 -0.394 -0.394 ...
##  $ diag_1                  : Factor w/ 9 levels "Circulatory",..: 7 1 6 2 3 9 1 1 2 1 ...
##  $ diag_2                  : Factor w/ 9 levels "Circulatory",..: 7 2 5 2 9 1 5 1 2 1 ...
##  $ diag_3                  : Factor w/ 9 levels "Circulatory",..: 3 2 2 1 1 2 1 1 3 6 ...
##  $ number_diagnoses        : num  0.341 0.341 -1.679 -2.185 0.846 ...
##  $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 4 3 4 1 3 3 3 1 3 ...
##  $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 3 2 3 3 2 2 2 3 2 2 ...
##  $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 1 4 2 2 4 1 3 3 2 3 ...
##  $ change                  : Factor w/ 2 levels "Ch","No": 1 1 1 1 1 1 1 1 2 2 ...
##  $ diabetesMed             : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 1 2 ...
##  $ readmitted              : int  0 1 0 0 0 0 1 0 0 0 ...

##               race            age        admission_type_id
##  AfricanAmerican:13005   Min.   :1.000   k:62629          
##  Asian          :  509   1st Qu.:1.000   u: 7616          
##  Caucasian      :54004   Median :1.500                    
##  Hispanic       : 1539   Mean   :1.426                    
##  Other          : 1188   3rd Qu.:1.500                    
##                          Max.   :2.000                    
##                                                           
##  discharge_disposition_id admission_source_id time_in_hospital 
##  d:66503                  b:    5             Min.   :-1.1836  
##  h:  561                  o:37630             1st Qu.:-0.8681  
##  o:   20                  r:22497             Median :-0.2370  
##  u: 3161                  t: 5054             Mean   : 0.0000  
##                           u: 5059             3rd Qu.: 0.3941  
##                                               Max.   : 2.9183  
##                                                                
##  num_lab_procedures num_procedures    num_medications   number_outpatient
##  Min.   :-2.15562   Min.   :-0.8343   Min.   :-1.7876   Min.   :-0.2745  
##  1st Qu.:-0.59971   1st Qu.:-0.8343   1st Qu.:-0.7387   1st Qu.:-0.2745  
##  Median : 0.05276   Median :-0.2705   Median :-0.1561   Median :-0.2745  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.70524   3rd Qu.: 0.2933   3rd Qu.: 0.5431   3rd Qu.:-0.2745  
##  Max.   : 4.41934   Max.   : 2.5486   Max.   : 7.5353   Max.   :36.0839  
##                                                                          
##  number_emergency  number_inpatient          diag_1     
##  Min.   :-0.1938   Min.   :-0.3935   Circulatory:21209  
##  1st Qu.:-0.1938   1st Qu.:-0.3935   Diabetes   :11751  
##  Median :-0.1938   Median :-0.3935   Respiratory: 9492  
##  Mean   : 0.0000   Mean   : 0.0000   Neoplasms  : 8198  
##  3rd Qu.:-0.1938   3rd Qu.:-0.3935   Digestive  : 6650  
##  Max.   :98.8311   Max.   :22.9485   Injury     : 4827  
##                                      (Other)    : 8118  
##            diag_2                diag_3      number_diagnoses 
##  Circulatory  :22019   Diabetes     :27358   Min.   :-3.1949  
##  Diabetes     :21545   Circulatory  :21111   1st Qu.:-0.6691  
##  Neoplasms    : 7311   Neoplasms    : 6489   Median : 0.3412  
##  Respiratory  : 7141   Respiratory  : 4913   Mean   : 0.0000  
##  Genitourinary: 5586   Genitourinary: 4356   3rd Qu.: 0.8464  
##  Digestive    : 2967   Digestive    : 2758   Max.   : 4.3824  
##  (Other)      : 3676   (Other)      : 3260                    
##  A1Cresult     metformin       insulin      change     diabetesMed
##  >7  : 2882   Down  :  453   Down  : 7743   Ch:32390   No :16409  
##  >8  : 6208   No    :55348   No    :33410   No:37855   Yes:53836  
##  None:57336   Steady:13581   Steady:21756                         
##  Norm: 3819   Up    :  863   Up    : 7336                         
##                                                                   
##                                                                   
##                                                                   
##    readmitted    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.0721  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
##