This is the data preparation effort for developing a Machine Learning model for predicting hospital readmission within 30 days.
Hospital readmission is a real-world problem and an on-going topic for improving health care quality and a patient’s experience, while ensuring cost-effectiveness. Information of Hospital Readmissions Reduction Program (HRRP) is publicly available in CMS, Center for Medicare and Medicaid Services, web site.
The dataset, Diabetes 130-US hospitals for years 1999-2008 Data Set, was downloaded from UCI Machine Learning Repository. It represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks with 100,000 observations and 50 features representing patient and hospital outcomes.
The developed Machine Learning model is based on R and employed the package, SuperLearer, with ensemble learning to optimize the results. For computation needs, most of the ensemble learning ran on a Microsoft Azure public cloud an E16 Virtual Machine with 16 vcpus and 128 GB RAM, as shown below. For a training set of 10,000 observations and 21 predictors, in general the model took about 2 to 3 hours to train and more than 6 hours to carry out 10-fold cross-validation with three algorithms. The demand for computing resources was significant.
Virtual Machine Hardware Configuration
Some variables were with high missingness and unusable. A few considered as missing at random (MAR) were imputed with values using Multivariate Imputation by Chained Equations (mice) package.
The feature selection was largely based on the output from Boruta. In several test runs, Boruta took about 30 minutes and was able to confirm all variables, 21 important and 5 unimportant, within 100 iterations initially set.
The dataset was first downloaded from the above link and imported into RStudio.
## Diabetes data set imported ( 101766 observations with 50 variables )
Removed the ID field, encounter_id.
## Removed encounter_id. The data set now has 49 variables.
## 'data.frame': 101766 obs. of 49 variables:
## $ patient_nbr : int 8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
## $ race : Factor w/ 6 levels "?","AfricanAmerican",..: 4 4 2 4 4 4 4 4 4 4 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 2 2 2 2 1 1 ...
## $ age : Factor w/ 10 levels "[0-10)","[10-20)",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weight : Factor w/ 10 levels "?","[0-25)","[100-125)",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ admission_type_id : int 6 1 1 1 1 2 3 1 2 3 ...
## $ discharge_disposition_id: int 25 1 1 1 1 1 1 1 1 3 ...
## $ admission_source_id : int 1 7 7 7 7 2 2 7 4 4 ...
## $ time_in_hospital : int 1 3 2 2 1 3 4 5 13 12 ...
## $ payer_code : Factor w/ 18 levels "?","BC","CH",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ medical_specialty : Factor w/ 73 levels "?","AllergyandImmunology",..: 39 1 1 1 1 1 1 1 1 20 ...
## $ num_lab_procedures : int 41 59 11 44 51 31 70 73 68 33 ...
## $ num_procedures : int 0 0 5 1 0 6 1 0 2 3 ...
## $ num_medications : int 1 18 13 16 8 16 21 12 28 18 ...
## $ number_outpatient : int 0 0 2 0 0 0 0 0 0 0 ...
## $ number_emergency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ number_inpatient : int 0 0 1 0 0 0 0 0 0 0 ...
## $ diag_1 : Factor w/ 717 levels "?","10","11",..: 126 145 456 556 56 265 265 278 254 284 ...
## $ diag_2 : Factor w/ 749 levels "?","11","110",..: 1 81 80 99 26 248 248 316 262 48 ...
## $ diag_3 : Factor w/ 790 levels "?","11","110",..: 1 123 768 250 88 88 772 88 231 319 ...
## $ number_diagnoses : int 1 9 6 7 5 9 7 8 8 8 ...
## $ max_glu_serum : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1Cresult : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
## $ repaglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ nateglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ chlorpropamide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glimepiride : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
## $ acetohexamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glipizide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
## $ glyburide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
## $ tolbutamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ pioglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ rosiglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
## $ acarbose : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ miglitol : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ troglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ tolazamide : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ examide : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
## $ citoglipton : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
## $ insulin : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
## $ glyburide.metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glipizide.metformin : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.pioglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ change : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
## $ diabetesMed : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ readmitted : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...
In the data set, there were many variables with ‘?’ as the value. Apparently, it was an indication of a missing value which was therefore replaced with NA.
## Missing values indicated by "?" = 192849
## Replacing 192849 values stored as "?" with NA.
## Total NA count = 192849
Examining missingness of the dataset revealed disproportional amount of data were missing in variables, particularly weight, medical_specialty, and payer_code.
## race weight payer_code medical_specialty
## 2.23 96.86 39.56 49.08
## diag_1 diag_2 diag_3
## 0.02 0.35 1.40
## Removing the three variables: weight payer_code medical_specialty
## At this time, the data set has 101766 observations with 46 variables.
Two variables, examide and citoglipton, had only one level with no missing value. Therefore these two variables were with zero-variance, and not informative and would contribute little for predicting an outcome. All other near zero-variance (nzv) variables were also removed from the dataset.
Although some may argue that zero-variance variables may in fact have some influence, in the diabetes dataset a few factor variables with multiple levels were nzv. If to keep them, it would later generate considerable number of dummy variables and increase the computation complexities and resource requirements. Consequently, removed all nzv variables.
## caret reports 18 near zero-variables as the following:
## [1] "max_glu_serum" "repaglinide"
## [3] "nateglinide" "chlorpropamide"
## [5] "glimepiride" "acetohexamide"
## [7] "tolbutamide" "acarbose"
## [9] "miglitol" "troglitazone"
## [11] "tolazamide" "examide"
## [13] "citoglipton" "glyburide.metformin"
## [15] "glipizide.metformin" "glimepiride.pioglitazone"
## [17] "metformin.rosiglitazone" "metformin.pioglitazone"
## The listed, 18 near-zero variables have been removed.
##
## At this time, the data set has 101766 observations with 28 variables remaining.
The data set contained multiple rows with the same patient_nbr, i.e. a patient number. It was unclear if these encounters, i.e. visits, were independent. There was a risk that these multiple visits of a patient might be related, hence introduce bias since some encounters of a patient then become correlated. To eliminate this risk, kept one and only one encounter which had the maximum time_in_hospital, assuming time_in_hospital was characteristic for readmission and would present sufficient variance in training data.
## patient_nbr with multiple encounters = 30248
## Before eliminating multiple encounters of a patient, total 101766 observations
## After eliminating multiple encounters of a patient, total 71518 observations
## Multiple encounters of a patient now = 0
Once having processed multiple-encounter of a patient, removed the patient ID from the dataset.
## Dropping the ID fields, patient_nbr and race
## At this time, the data set has 71518 observations with 26 variables.
Now moving to prepare categorical variables. For feature description, reference IDs_mapping.csv from the original dataset downloaded form UCI Machine Learning repository.
The three: diag_1, diag_2, and diag_3, each had some 700 levels. Which would require around 900 dummy variables and the computation needs would be expensive to manage. To consolidate the levels, followed Table 2 of the research report, Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records and converted the levels of all three variables into 9 categories. The programming part was lengthy and tedious. It however reduced the complexities to a manageable level.
Consolidated levels of other factor variables were based on analysis of the dataset, general experience on receiving health care services and some common senses for possibly delivering most variance in the Machine Learning model.
## "gender" is now factor with the levels: Female Male
Consolidated from a 10-level factor to 3 and numeric as:
## Considering those older than 60 are twice more likely to be readmitted.
## "age" is now numeric with the unique values: 1.5 1 2
## where age<60 is assigned as 1, 60<=age<80 1.5, and age>80 as 2
## "admission_type_id" is now factor with levels: k u
## where u: unknown, k: known
## "admission_source_id" is now factor with levels: b o r t u
## where r: referral, t: transfer, u: unknown, o: other, b: birth
## Before removing "discharge_disposition_id" of 11,19,20,21,
## there were 71518 observations with unique ids:
## 1 3 25 6 2 11 5 8 13 4 18 17 22 14 23 7 28 27 15 10 16 9 20 24 12 19
## After removing "discharge_disposition_id" of 11,19,20,21,
## there were 70245 observations with unique ids:
## 1 3 25 6 2 5 8 13 4 18 17 22 14 23 7 28 27 15 10 16 9 24 12
## "discharge_disposition_id)" is now factor with levels: d h o u
## where d: discharge, h: hospice, u: unknown, o: other
There were three variables for diagnostic information. Each had more than 700 levels. Per Table 2 of Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records, they are converted each into 9 categories.
## *** diag_1 with 717 levels
## *** diag_1 is now converted to 9 levels as the following:
## Circulatory Diabetes Digestive Genitourinary Injury Musculoskeletal Neoplasms Other Respiratory
## *** diag_2 with 749 levels
## *** diag_2 is now converted to 9 levels as the following:
## Circulatory Diabetes Digestive Genitourinary Injury Musculoskeletal Neoplasms Other Respiratory
## *** diag_3 with 790 levels
## *** diag_3 is now converted to 9 levels as the following:
## Circulatory Diabetes Digestive Genitourinary Injury Musculoskeletal Neoplasms Other Respiratory
## "readmitted" levels: <30 >30 NO
## "readmitted" is now a numeric with unique values: 0 1
## At this time, the dataset has 70245 observations with 26 variables.
For convenience, here separated variables based on the data types, i.e. factor, numeric, and integer.
## $factor
## [1] "gender" "admission_type_id"
## [3] "discharge_disposition_id" "admission_source_id"
## [5] "diag_1" "diag_2"
## [7] "diag_3" "A1Cresult"
## [9] "metformin" "glipizide"
## [11] "glyburide" "pioglitazone"
## [13] "rosiglitazone" "insulin"
## [15] "change" "diabetesMed"
##
## $numeric
## [1] "age" "readmitted"
##
## $integer
## [1] "time_in_hospital" "num_lab_procedures" "num_procedures"
## [4] "num_medications" "number_outpatient" "number_emergency"
## [7] "number_inpatient" "number_diagnoses"
## factor numeric integer
## "16" " 2" " 8"
## Total 26 variables
## time_in_hospital num_lab_procedures num_procedures num_medications
## Min. : 1.000 Min. : 1.00 Min. :0.00 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 32.00 1st Qu.:0.00 1st Qu.:10.00
## Median : 4.000 Median : 45.00 Median :1.00 Median :15.00
## Mean : 4.751 Mean : 43.95 Mean :1.48 Mean :16.34
## 3rd Qu.: 6.000 3rd Qu.: 58.00 3rd Qu.:2.00 3rd Qu.:21.00
## Max. :14.000 Max. :132.00 Max. :6.00 Max. :81.00
## number_outpatient number_emergency number_inpatient number_diagnoses
## Min. : 0.000 Min. : 0.0000 Min. : 0.0000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 6.000
## Median : 0.000 Median : 0.0000 Median : 0.0000 Median : 8.000
## Mean : 0.302 Mean : 0.1233 Mean : 0.3203 Mean : 7.325
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 9.000
## Max. :40.000 Max. :63.0000 Max. :19.0000 Max. :16.000
All numeric variables were centered and normalized. The correlation plot showed two on 0.4-level, while overall considered acceptable. While examining the correlation coefficients and considering the general healthcare practices, it is logical to assume that the more procedures are performed, the more medications likely used and the longer observation and recovery time required. For an advanced modeling, may consider the interactions among the four variables:
There was no interactions modeled here nevertheless.
## [1] "time_in_hospital" "num_lab_procedures" "num_procedures"
## [4] "num_medications" "number_outpatient" "number_emergency"
## [7] "number_inpatient" "number_diagnoses"
## Centering and normalizing the 8 integer variables
## time_in_hospital num_lab_procedures num_procedures num_medications
## 1 2.918277 -0.800474000 0.8571450 1.3588927
## 2 2.918277 1.709052148 -0.2705126 0.3100768
## 3 2.918277 -0.147997201 -0.2705126 0.1935417
## 4 2.918277 0.002574368 0.2933162 -0.5056689
## 5 2.918277 1.357718487 0.8571450 1.9415682
## 6 2.918277 1.357718487 -0.2705126 1.8250331
## number_outpatient number_emergency number_inpatient number_diagnoses
## 1 -0.2745189 -0.1937564 -0.3935239 0.3412069
## 2 -0.2745189 -0.1937564 -0.3935239 0.3412069
## 3 -0.2745189 -0.1937564 -0.3935239 -1.6794122
## 4 -0.2745189 -0.1937564 -0.3935239 -2.1845669
## 5 -0.2745189 -0.1937564 -0.3935239 0.8463617
## 6 -0.2745189 -0.1937564 -0.3935239 0.8463617
## time_in_hospital num_lab_procedures num_procedures num_medications
## Min. :-1.1836 Min. :-2.15562 Min. :-0.8343 Min. :-1.7876
## 1st Qu.:-0.8681 1st Qu.:-0.59971 1st Qu.:-0.8343 1st Qu.:-0.7387
## Median :-0.2370 Median : 0.05276 Median :-0.2705 Median :-0.1561
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3941 3rd Qu.: 0.70524 3rd Qu.: 0.2933 3rd Qu.: 0.5431
## Max. : 2.9183 Max. : 4.41934 Max. : 2.5486 Max. : 7.5353
## number_outpatient number_emergency number_inpatient number_diagnoses
## Min. :-0.2745 Min. :-0.1938 Min. :-0.3935 Min. :-3.1949
## 1st Qu.:-0.2745 1st Qu.:-0.1938 1st Qu.:-0.3935 1st Qu.:-0.6691
## Median :-0.2745 Median :-0.1938 Median :-0.3935 Median : 0.3412
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2745 3rd Qu.:-0.1938 3rd Qu.:-0.3935 3rd Qu.: 0.8464
## Max. :36.0839 Max. :98.8311 Max. :22.9485 Max. : 4.3824
Overall, nothing immediately raised a concern. There were however a few outliers of number_outpatient, number_inpatient, and number_emergency. Examined with boxplots (not shown here) these outliers were with relatively extreme values compared with other observations of the variable.
Further looking in the dataset, those variable with extreme values were spread among a handful observations. And for these variables:
were with a mean value very close to zero, removing a few outliers resulted in zeroing all summary statistics, which caused some computation issues in subsequent processing. Consequently, the few outliers were kept as they were.
## time_in_hospital num_lab_procedures num_procedures num_medications
## Min. :-1.1836 Min. :-2.15562 Min. :-0.8343 Min. :-1.7876
## 1st Qu.:-0.8681 1st Qu.:-0.59971 1st Qu.:-0.8343 1st Qu.:-0.7387
## Median :-0.2370 Median : 0.05276 Median :-0.2705 Median :-0.1561
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3941 3rd Qu.: 0.70524 3rd Qu.: 0.2933 3rd Qu.: 0.5431
## Max. : 2.9183 Max. : 4.41934 Max. : 2.5486 Max. : 7.5353
## number_outpatient number_emergency number_inpatient number_diagnoses
## Min. :-0.2745 Min. :-0.1938 Min. :-0.3935 Min. :-3.1949
## 1st Qu.:-0.2745 1st Qu.:-0.1938 1st Qu.:-0.3935 1st Qu.:-0.6691
## Median :-0.2745 Median :-0.1938 Median :-0.3935 Median : 0.3412
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2745 3rd Qu.:-0.1938 3rd Qu.:-0.3935 3rd Qu.: 0.8464
## Max. :36.0839 Max. :98.8311 Max. :22.9485 Max. : 4.3824
The corrplot reported the following pairs with the coefficient above 0.4 level.
## time_in_hospital num_lab_procedures num_procedures
## time_in_hospital 1.000000000 0.326984553 0.17795424
## num_lab_procedures 0.326984553 1.000000000 0.05212467
## num_procedures 0.177954240 0.052124670 1.00000000
## num_medications 0.471308790 0.272938439 0.40269545
## number_outpatient 0.007770235 -0.002151718 -0.01604211
## number_emergency 0.029281798 0.022392959 -0.02602895
## number_inpatient 0.212521455 0.094370649 -0.01162087
## number_diagnoses 0.258738703 0.168668329 0.08947737
## num_medications number_outpatient number_emergency
## time_in_hospital 0.47130879 0.007770235 0.02928180
## num_lab_procedures 0.27293844 -0.002151718 0.02239296
## num_procedures 0.40269545 -0.016042113 -0.02602895
## num_medications 1.00000000 0.042216488 0.02420536
## number_outpatient 0.04221649 1.000000000 0.09645106
## number_emergency 0.02420536 0.096451065 1.00000000
## number_inpatient 0.10364130 0.085257136 0.18444792
## number_diagnoses 0.27051709 0.087293152 0.05916346
## number_inpatient number_diagnoses
## time_in_hospital 0.21252145 0.25873870
## num_lab_procedures 0.09437065 0.16866833
## num_procedures -0.01162087 0.08947737
## num_medications 0.10364130 0.27051709
## number_outpatient 0.08525714 0.08729315
## number_emergency 0.18444792 0.05916346
## number_inpatient 1.00000000 0.11520971
## number_diagnoses 0.11520971 1.00000000
This seemed logical since the longer a patient stayed, the more medications one likely to had. Similarly, the more medications a doctor had subscribed for a patient, the longer the patient was likely to stay in the hospital. Modeling the interactions is something to be considered. In this project, due to the very limit computation resources and time constraint, the interactions were not included in the modeling.
## 0 missing values of all numeric variables
## 2417 missing values of all factor variables
To facilitate feature selection, employed another tool.
A forest spirit in the Slavic mythology, Boruta (also called Leśny or Lešny) was portrayed as an imposing figure, with horns over the head, surrounded by packs of wolves and bears. In R, Boruta is a helpful package for facilitating a feature selection process.
The dataset was then partition into a 70/30 split for training and testing. The training part was also used for Boruta to confirm features subsequently.
By default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attributes’ importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilize that test.
Sample code to run Boruta is available.
A successful Boruta run resulted in a set of features confirmed as important, tentative and unimportant, as applicable. During a run, Boruta sets up shadow variables to model each individual variable as a predictor and determine the importance. These shadow variables were referenced as the maximum and the minimum values for confirming or denying variables. Those tested as predictors with performance greater than the maximum were confirmed, smaller than the minimum denied. Unresolved variables, as applicable, were consider tentative.
## [1] "race" "age"
## [3] "admission_type_id" "discharge_disposition_id"
## [5] "admission_source_id" "time_in_hospital"
## [7] "num_lab_procedures" "num_procedures"
## [9] "num_medications" "number_outpatient"
## [11] "number_emergency" "number_inpatient"
## [13] "diag_1" "diag_2"
## [15] "diag_3" "number_diagnoses"
## [17] "A1Cresult" "metformin"
## [19] "insulin" "change"
## [21] "diabetesMed"
Finally stored the prepared dataset ready for importing into a Machine Learning algorithm.
## 'data.frame': 70245 obs. of 22 variables:
## $ race : Factor w/ 5 levels "AfricanAmerican",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ age : num 1.5 1.5 1 2 1.5 1.5 1.5 1.5 1 2 ...
## $ admission_type_id : Factor w/ 2 levels "k","u": 1 1 2 2 2 2 2 1 2 2 ...
## $ discharge_disposition_id: Factor w/ 4 levels "d","h","o","u": 1 1 4 1 4 4 4 1 1 4 ...
## $ admission_source_id : Factor w/ 5 levels "b","o","r","t",..: 3 2 3 3 2 2 3 4 3 2 ...
## $ time_in_hospital : num 2.92 2.92 2.92 2.92 2.92 ...
## $ num_lab_procedures : num -0.80047 1.70905 -0.148 0.00257 1.35772 ...
## $ num_procedures : num 0.857 -0.271 -0.271 0.293 0.857 ...
## $ num_medications : num 1.359 0.31 0.194 -0.506 1.942 ...
## $ number_outpatient : num -0.275 -0.275 -0.275 -0.275 -0.275 ...
## $ number_emergency : num -0.194 -0.194 -0.194 -0.194 -0.194 ...
## $ number_inpatient : num -0.394 -0.394 -0.394 -0.394 -0.394 ...
## $ diag_1 : Factor w/ 9 levels "Circulatory",..: 7 1 6 2 3 9 1 1 2 1 ...
## $ diag_2 : Factor w/ 9 levels "Circulatory",..: 7 2 5 2 9 1 5 1 2 1 ...
## $ diag_3 : Factor w/ 9 levels "Circulatory",..: 3 2 2 1 1 2 1 1 3 6 ...
## $ number_diagnoses : num 0.341 0.341 -1.679 -2.185 0.846 ...
## $ A1Cresult : Factor w/ 4 levels ">7",">8","None",..: 3 4 3 4 1 3 3 3 1 3 ...
## $ metformin : Factor w/ 4 levels "Down","No","Steady",..: 3 2 3 3 2 2 2 3 2 2 ...
## $ insulin : Factor w/ 4 levels "Down","No","Steady",..: 1 4 2 2 4 1 3 3 2 3 ...
## $ change : Factor w/ 2 levels "Ch","No": 1 1 1 1 1 1 1 1 2 2 ...
## $ diabetesMed : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 1 2 ...
## $ readmitted : int 0 1 0 0 0 0 1 0 0 0 ...
## race age admission_type_id
## AfricanAmerican:13005 Min. :1.000 k:62629
## Asian : 509 1st Qu.:1.000 u: 7616
## Caucasian :54004 Median :1.500
## Hispanic : 1539 Mean :1.426
## Other : 1188 3rd Qu.:1.500
## Max. :2.000
##
## discharge_disposition_id admission_source_id time_in_hospital
## d:66503 b: 5 Min. :-1.1836
## h: 561 o:37630 1st Qu.:-0.8681
## o: 20 r:22497 Median :-0.2370
## u: 3161 t: 5054 Mean : 0.0000
## u: 5059 3rd Qu.: 0.3941
## Max. : 2.9183
##
## num_lab_procedures num_procedures num_medications number_outpatient
## Min. :-2.15562 Min. :-0.8343 Min. :-1.7876 Min. :-0.2745
## 1st Qu.:-0.59971 1st Qu.:-0.8343 1st Qu.:-0.7387 1st Qu.:-0.2745
## Median : 0.05276 Median :-0.2705 Median :-0.1561 Median :-0.2745
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.70524 3rd Qu.: 0.2933 3rd Qu.: 0.5431 3rd Qu.:-0.2745
## Max. : 4.41934 Max. : 2.5486 Max. : 7.5353 Max. :36.0839
##
## number_emergency number_inpatient diag_1
## Min. :-0.1938 Min. :-0.3935 Circulatory:21209
## 1st Qu.:-0.1938 1st Qu.:-0.3935 Diabetes :11751
## Median :-0.1938 Median :-0.3935 Respiratory: 9492
## Mean : 0.0000 Mean : 0.0000 Neoplasms : 8198
## 3rd Qu.:-0.1938 3rd Qu.:-0.3935 Digestive : 6650
## Max. :98.8311 Max. :22.9485 Injury : 4827
## (Other) : 8118
## diag_2 diag_3 number_diagnoses
## Circulatory :22019 Diabetes :27358 Min. :-3.1949
## Diabetes :21545 Circulatory :21111 1st Qu.:-0.6691
## Neoplasms : 7311 Neoplasms : 6489 Median : 0.3412
## Respiratory : 7141 Respiratory : 4913 Mean : 0.0000
## Genitourinary: 5586 Genitourinary: 4356 3rd Qu.: 0.8464
## Digestive : 2967 Digestive : 2758 Max. : 4.3824
## (Other) : 3676 (Other) : 3260
## A1Cresult metformin insulin change diabetesMed
## >7 : 2882 Down : 453 Down : 7743 Ch:32390 No :16409
## >8 : 6208 No :55348 No :33410 No:37855 Yes:53836
## None:57336 Steady:13581 Steady:21756
## Norm: 3819 Up : 863 Up : 7336
##
##
##
## readmitted
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.0721
## 3rd Qu.:0.0000
## Max. :1.0000
##