This is the 2nd article of the following series and highlights the development and assessment of ensemble learning.

In Part 1, I detailed the process to analyze and prepare the dataset, Diabetes 130-US hospitals for years 1999-2008 Data Set, downloaded from UCI Machine Learning Repository for Machine Learning. Here, to continue, I developed a Machine Learning model with ensemble learning for predicting hospital readmissions. A locally run H2O cluster on my laptop was the development environment. The developed model included a stacked ensemble with the following four algorithms or learners:

The logical steps for constructing and conducting ensemble learning with pertinent information are highlighted as the following:

Data Set

The data set employed for developing the ensemble described in this article was slightly different from the finalized data set in Part 1. Nevertheless, the process for preparing the data set was very much identical other than the feature set was based on results from a different Boruta run.

## Imported file: dataimp.csv with 70245 obs. and 27 variables

The imported data set had the following structure where ‘readmitted’ was the label:

## 'data.frame':    70245 obs. of  27 variables:
##  $ race                    : Factor w/ 5 levels "AfricanAmerican",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ gender                  : Factor w/ 2 levels "Femals","Male": 1 2 2 1 1 1 1 1 1 1 ...
##  $ age                     : num  1.5 1.5 1 2 1.5 1.5 1.5 1.5 1 2 ...
##  $ admission_type_id       : Factor w/ 2 levels "k","u": 1 1 2 2 2 2 2 1 2 2 ...
##  $ discharge_disposition_id: Factor w/ 4 levels "d","h","o","u": 1 1 4 1 4 4 4 1 1 4 ...
##  $ admission_source_id     : Factor w/ 5 levels "b","o","r","t",..: 3 2 3 3 2 2 3 4 3 2 ...
##  $ time_in_hospital        : num  2.92 2.92 2.92 2.92 2.92 ...
##  $ num_lab_procedures      : num  -0.80047 1.70905 -0.148 0.00257 1.35772 ...
##  $ num_procedures          : num  0.857 -0.271 -0.271 0.293 0.857 ...
##  $ num_medications         : num  1.359 0.31 0.194 -0.506 1.942 ...
##  $ number_outpatient       : num  -0.275 -0.275 -0.275 -0.275 -0.275 ...
##  $ number_emergency        : num  -0.194 -0.194 -0.194 -0.194 -0.194 ...
##  $ number_inpatient        : num  -0.394 -0.394 -0.394 -0.394 -0.394 ...
##  $ diag_1                  : Factor w/ 9 levels "Circulatory",..: 7 1 6 2 3 9 1 1 2 1 ...
##  $ diag_2                  : Factor w/ 9 levels "Circulatory",..: 7 2 5 2 9 1 5 1 2 1 ...
##  $ diag_3                  : Factor w/ 9 levels "Circulatory",..: 3 2 2 1 1 2 1 1 3 6 ...
##  $ number_diagnoses        : num  0.341 0.341 -1.679 -2.185 0.846 ...
##  $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 4 3 4 1 3 3 3 1 3 ...
##  $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 3 2 3 3 2 2 2 3 2 2 ...
##  $ glipizide               : Factor w/ 4 levels "Down","No","Steady",..: 3 2 3 4 2 2 3 3 2 2 ...
##  $ glyburide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 2 2 2 2 2 2 ...
##  $ pioglitazone            : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ rosiglitazone           : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 1 4 2 2 4 1 3 3 2 3 ...
##  $ change                  : Factor w/ 2 levels "Ch","No": 1 1 1 1 1 1 1 1 2 2 ...
##  $ diabetesMed             : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 1 2 ...
##  $ readmitted              : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 2 1 1 1 ...

Subsetting and Partitioning the Data

Since the Machine Learning results and statistics presented in this article are generated dynmaically, to reduce the wait time and computing resource requirements, here I used a subset of the imported data set for demonstrating the development of an ensemble. I further partitioned the data into three parts for trianing, cross-validation, and testing. The actual data employed for developing and conducting the ensemble learning were with the following configuration:

## Sampling 14048 obs. with indicated percentage into three partitions: 
## 
## Trainging data  ( 60 %) = 8302 obs. 
## Validation data ( 20 %) = 2688 obs. 
## Testing data    ( 20 %) = 3058 obs.

Class Imbalance

While examining the training data, it ws apparent that the label, readmitted, was with highly disproportional distribution of values. This was problematic, as class imbalance tends to overwhelm a model and leads to incorrect classification. Since during training, the model would have learned much more about and become prone to classifying the over-sampled class. On the other hand, the model knows little about the situations to calssify the under-sampled class. Consequently, a model trained with imbalance class data would potentially produce a high misclassification rate. Additional informaiton of Class Imbalance Problem is available elsewhere.

SMOTE

To circumvent the imbalance issue, I used SMOTE from the package, Data Mining with R(DMwR), to generate more balanced sets of label values for training. Here is one SMOTE’d sample set on the right and derived from the left one with high class inbalance.

Notice the balance between oversampling and undersampling data are configurable with perc.over and perc.under as detailed in the documentation.

Ensemble Learning

For a complex problem like hospital readmissions, realizing and optimizing biases-variance tradeoff is a challenge. And using ensemble learning to complement some algorithms’ weakness with the others’ strength by evaluating, weighting, combining, and optimizing their results seemed a right strategy and logical approach. The following illustrated the concept of ensemble learning.

H2O

As opposed to preparing data with R/RStudio, I ran a local cluster wiht my laptop using H2O which provides a user-friendly framework and essentially eliminates from a Machine Learning developer the mechanics for setting up a cluster and orchestrating cross-validation of each algorithm. One important benefit for me to use H2O is in particular the speed and the relative low resource requirements. Overall, H2O worked well throughout this project.

Cluster Initialization

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         40 minutes 9 seconds 
##     H2O cluster timezone:       America/Chicago 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    1 month and 11 days  
##     H2O cluster name:           H2O_started_from_R_da_bfs143 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        13579 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.2 (2018-12-20)

Data Frames Conversion

For fitting a model, data frames must be loaded into an H2O cluster. Since the data were prepared and stored in memory as R resources, I needed to convert them to H2O objects first. I also set up the label, readmitted, as the response variable and the rest as predictors.

{ # CONVERTING DATA PARTITIONS TO H2O OBJECTS
  training_frame   <- as.h2o(train)
  validation_frame <- as.h2o(valid)
  testing_frame    <- as.h2o(test)

  # SETTING UP THE LABEL
  y <- 'readmitted'
  x <- setdiff(names(training_frame), y)
}

Algorithms/Learners

Hospital readmissions is a classification problem in which a patient is either readmitted or not. To develop ensemble learning, the task at this time was to investigate and select a set of algorithms, or learners, complementary to one another to form an ensemble model. There have been a few algorithms known for solving classification problems including Random Forest and Grandient Boosting. In this project, all four algorithms nicely included in H2O were configured as learners with most default settings to form the ensemble.

## nfolds = 10 
## seed   = 55
{ #----------
  # LEARNERS
  #----------
  rf <- h2o.randomForest( x, y, model_id='rf' ,nfolds=nfolds ,seed=seed
    ,training_frame = training_frame, validation_frame = validation_frame
    ,fold_assignment='Modulo', keep_cross_validation_predictions=TRUE)
  gbm <- h2o.gbm(x, y,model_id='gbm',nfolds=nfolds ,seed=seed
    ,training_frame = training_frame, validation_frame = validation_frame
    ,fold_assignment='Modulo', keep_cross_validation_predictions=TRUE)
  glm <- h2o.glm(x, y, ,model_id='glm',nfolds=nfolds ,seed=seed ,family= family['binomial']
    ,training_frame = training_frame, validation_frame = validation_frame
    ,fold_assignment='Modulo', keep_cross_validation_predictions=TRUE)
  dl <- h2o.deeplearning(x, y, model_id='dl',nfolds=nfolds ,seed=seed
    ,training_frame = training_frame, validation_frame = validation_frame
    ,fold_assignment='Modulo', keep_cross_validation_predictions=TRUE)

  models <- list(rf@model_id, gbm@model_id, glm@model_id, dl@model_id)
  learners <- c(rf, gbm, glm, dl);saveRDS(learners,paste0(save.dir,'learners.rds'))

}

Stacked Ensemble

The four learners were stacked up to form an ensemble and carry out learning. For the stacking to work, all learners must use the same cross-validation settings including nfolds, data frames, etc.

  #-----------------
  # STACKED ENSEMBLE
  #-----------------
  stacked <- h2o.stackedEnsemble(x, y, seed=seed
      ,model_id='stacked',base_models=models
      ,training_frame = training_frame, validation_frame = validation_frame
  );saveRDS(stacked,paste0(save.dir,'stacked.rds'))

Model Performance Assessment

For a binary classification using data with Class Imbalance Problem as what this model experienced, “accuracy” is not the best measure due to “Accuracy Paradox.” There are various measures available in H2O for assessing model performance. Here, I examined logloss and AUC to evaluate the performance. The shown numbers of decimal points below were rounded for presentation and readerability.

##              train.logloss cv.logloss train.auc  cv.auc    
## [1,] rf      0.24329862    0.31403355 0.98091478 0.86740404
## [2,] gbm     0.22481902    0.30448103 0.97968734 0.73887129
## [3,] glm     0.56802023    0.56947973 0.77977217 0.67792463
## [4,] dl      0.13465203    0.59842143 0.9913248  0.74567276
## [5,] stacked 0.03623253    0.24199489 0.99997423 0.85608126

For logloss, the smaller value, the better, while opposite for the AUC. As expected, the stacked ensemble in general performed a little better than than that on an individual learner. An ensemble generally should improve some performnce, yet the improvement should not be dramatic like, for example, from poor to exceptional. Regardless, a drastic performnce change of an algorithm in my opinion always warrants further examination.

ROC Curves

From the ROC curves below,it showed how Random Forest was a strong contributor of the ensemle and performed closely to what the Stacked model had ultimately delivered. It may appear that the other three models were not actively contributing, and perhaps even to be excluded in a final ensemble. This is however not necessarily true. After all, the context of a learning enviroment including the randomization, the composition, and the state of data all could influence an outcome.

Predictions and Visualization Example of Confusion Matrix

Here, for demonstration, I showed what the data distribution of predictions and visualized what the associated confusion matrix had looked like based on the test data with a cutoff point at 0.5 which was arbitrarily chosen, as opposed to being derived from an associated ROC curve.

Closing Thoughts

Finally, time to talk about Convolutional Neural Network yet? Stay tuned. That’s coming soon.