Mining high-dimensional administrative claims data to predict early hospital readmissions

BACKGROUND
Current readmission models use administrative data supplemented with clinical information. However, the majority of these result in poor predictive performance (area under the curve (AUC)<0.70).


OBJECTIVE
To develop an administrative claim-based algorithm to predict 30-day readmission using standardized billing codes and basic admission characteristics available before discharge.


MATERIALS AND METHODS
The algorithm works by exploiting high-dimensional information in administrative claims data and automatically selecting empirical risk factors. We applied the algorithm to index admissions in two types of hospitalized patient: (1) medical patients and (2) patients with chronic pancreatitis (CP). We trained the models on 26,091 medical admissions and 3218 CP admissions from The Johns Hopkins Hospital (a tertiary research medical center) and tested them on 16,194 medical admissions and 706 CP admissions from Johns Hopkins Bayview Medical Center (a hospital that serves a more general patient population), and vice versa. Performance metrics included AUC, sensitivity, specificity, positive predictive values, negative predictive values, and F-measure.


RESULTS
From a pool of up to 5665 International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM) diagnoses, 599 ICD-9-CM procedures, and 1815 Current Procedural Terminology codes observed, the algorithm learned a model consisting of 18 attributes from the medical patient cohort and five attributes from the CP cohort. Within-site and across-site validations had an AUC≥0.75 for the medical patient cohort and an AUC≥0.65 for the CP cohort.


CONCLUSIONS
We have created an algorithm that is widely applicable to various patient cohorts and portable across institutions. The algorithm performed similarly to state-of-the-art readmission models that require clinical data.


BACKGROUND AND SIGNIFICANCE
Administrative claims data are records of transactions that occurred between patients and healthcare providers.They are often the electronic versions of bills healthcare providers submit to payers for outpatient visits and inpatient hospital stays.The core data elements include services, diagnosis, procedures, beneficiary demographics, and admission characteristics.Administrative data are a viable source for research 1 because they are readily available, inexpensive, cover large populations, and can be available in near real time in some hospitals (depending on the source).7][18][19] These predictive models can be used to make decisions about reimbursement, compare quality across hospitals, and perform comparative effectiveness research. 2 3ospitalizations account for almost 31% of healthcare expenditures. 20Yet it is estimated that 20% of Medicare patients discharged from a hospital return within 30 days, and 34% are rehospitalized within 90 days. 21][23] The Medicare Payment Advisory Commission initiatives 24 aim to reduce hospital readmissions through payment policies that penalize for excessive readmission rates.A first step to reducing preventable hospital readmissions is to identify predictors of early readmission and assess risk in individual patients.Reducing readmissions will reduce unnecessary costs and increase the value of the healthcare institution. 25 recent systematic review 2 summarized 26 unique readmission prediction models based on administrative data, 3 electronic medical records, 26 27 or a combination of the two. 26 28Of the 14 models using 30-day readmission as the outcome, most models have poor predictive performance.Among the four models 3 26 27 29 with moderate discriminative power (area under the curve (AUC)>0.70),three 3 27 29 used clinical details and one 26 was based on small samples (700 training samples, 704 validation samples).

OBJECTIVE
In this study, we developed an algorithm to predict 30-day readmission using billing codes and basic administrative information available before discharge.We evaluated the performance using both within-site cross-validation and across-site validation at The Johns Hopkins Hospital ( JHH) (a tertiary referral center) and at Johns Hopkins Bayview Medical Center (BMC) (a community hospital).We used two distinct application settings, (1) predicting 30-day outcome in medical patients and (2) predicting 30-day outcome in patients with chronic pancreatitis (CP) (known to be at high risk of readmission). 30The resulting models demonstrated discriminative power and easy portability between the referral and community hospitals.The latter model demonstrated a disease-specific application of this modeling approach, which could be applied to a number of other clinical contexts.

Data source
The JHH's Casemix Information Management Department developed the Johns Hopkins Medicine ( JHM) DataMart using the Microsoft SQL Server.It includes JHH casemix and billing data (starting from 1993) and BMC casemix and billing data (starting from 1995).The core data elements at JHH and BMC include admission date, discharge date, admission type, admission source, length of stay, primary and varying numbers of secondary International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM) diagnosis and procedure codes, revenue procedure codes mappable to Current Procedural Terminology (CPT) codes, and some demographic attributes.
For our study, we created two types of patient cohort from JHH and BMC (table 1) Medical patients were directly identified by an attribute in DataMart that classifies an index admission into either the 'medical' or 'surgical' service.We identified CP patients as those with an ICD-9-CM diagnosis code 577.1 at any position.For both patient cohorts, we required patients to be at least 18 years old, discharged home, and not having died during the hospitalization.We defined a case of readmission as an admission within 30 days of the prior discharge date, in concordance with published research. 2 27 28Since planned readmissions, possibly scheduled at the time of previous admissions, are unavoidable, only risk factors discriminating unplanned readmissions from those not followed by 30-day readmissions are of interest.For medical patients, readmission cases that have diagnoses or procedures in chemo-and radio-therapy, treatment follow-up, rehabilitation, procedures not carried out, and planned surgical interventions were excluded from the study cohorts.Specific ICD-9-CM codes are provided in online supplementary table S1.The CP cohorts were more homogeneous, and therefore no exclusion was made.In the subsequent analyses, ICD-9-CM codes used to identify planned readmissions were excluded from consideration.The attributes embedded in our derived patient cohort include age, sex, ethnicity, marital status, length of stay, accumulated number of inpatient admissions in the past 5 years, admission date, discharge date, admission to the medical or surgical service, total number of diagnoses, total number of procedures, all individual ICD-9-CM diagnoses and procedures, and all CPT codes that occurred during an index inpatient admission.The JHM institutional review board approved the protocol (IRB#: NA_00082734).

Study design
We evaluated our approach using two applications: (1) predicting 30-day outcomes in medical patients; (2) predicting 30-day outcomes in CP patients.As shown in figure 1, for each application, we curated the patient cohort from the administrative database and supplied it into the algorithm.On the training dataset, the algorithm inputted all billing codes except those used to identify planned readmissions, as well as a set of administrative attributes, and achieved fully automatic model learning.The output of the algorithm-the model-was a set of selected attributes together with their estimated coefficients.For within-site performance evaluation, we used fivefold cross-validation.For across-site performance evaluation, the model learned from the entire JHH cohort was tested on the BMC cohort, and in a similar fashion, the model learned from the entire BMC cohort was tested on the JHH cohort (figure 1).

Branch and bound algorithm
We used a five-step procedure to achieve automatic feature selection and model learning (figure 2).The latest source code for the algorithm is available at https://my.vanderbilt.edu/readmission/. 1. Defining input attributes.Input attributes included age, sex, ethnicity, marital status, length of stay, accumulated number of inpatient admissions in the past 5 years, ICD-9-CM diagnosis and procedure codes, and CPT codes.For billing codes, an attribute takes a value of 1 if that code is observed during the index admission, or 0 otherwise.2. Curating potential attributes by prevalence prioritization.
This step excluded the attributes with low prevalence (<1%) from further analyses.Prevalence is the proportion of individuals who have a specific code.Although candidate attributes were of high dimension (table 1), the majority of them occurred in <1% of the population.In hierarchical coding, increasing the granularity decreases the prevalence, 19 suggesting that generic codes may be preferable.However, finelevel information serves as a better proxy of the conditions  and procedures experienced by the patient, making it easier for healthcare providers to interpret models.We therefore kept all codes at their original level.3. Univariate variable selection.This step identified top ranked attributes.Each attribute's significance of correlation with 30-day outcome was defined as the Likelihood Ratio Test (LRT) p values of the fitted logistic regression, in which the response factor was 30-day outcome, and the explanatory factor was the tested attribute.Attributes with a significance level above the user-supplied threshold were retained for further analyses.4. Multivariate variable selection.This step performed subset selection for the multivariate model by forward variable selection along a path generated by random ordering of the variables.LRT was conducted to determine whether the addition of a new attribute significantly reduced the unexplained variance.An insignificant attribute was dropped, and the next attribute in the path was added.The process was repeated until all attributes had been tested once.Further details about this approach and discussion of its limitations are provided in online supplementary text and in the source code. 5. Final model.We estimated a risk score for each patient as the predicted probability of 30-day readmission conditional on attributes selected from step 4. We performed logistic regression on the selected attributes using the GLM function in the statistical program R.

Issues related to parameter adjustment
The algorithm required two parameters: the p value for univariate variable selection and the p value for multivariate variable selection, which were chosen via internal fivefold cross-validation.The search of p values was on a log10 scale.The search spaces for the first p values were those that allow 0.5-1.5% of the total observed codes to be included in the logistic regression model, and the search spaces for the second p values were those that allow <0.5% of the total observed codes to be in the final model (table 2).

Evaluation of performance
Within-site performance was evaluated by fivefold crossvalidation.The complete algorithm (step 1 through step 5) was applied to the four training folds, and the final model learned from the fourfolds was validated on the remaining onefold.We reported performance metrics such as AUC, sensitivity, specificity, optimal F-measure, positive predictive values, and negative predictive values at the optimal cutoff (averaged over folds for cross-validation).We plotted receiver operating characteristic The models refer to the ones learned from the entire patient cohort that were subject to across-site analyses using the p value cutoffs identified via internal cross-validation.BMC, Bayview Medical Center; CP, patients with chronic pancreatitis; JHH, Johns Hopkins Hospital; ME, all patients admitted to the medical service.curves using the ROCR package. 31We based our interpretation of the model on the final models learned from the entire JHH cohort because it was the larger cohort.

RESULTS
We list basic demographic and admission characteristics for the patient cohorts from JHH and BMC in table 1.The JHH cohorts had longer lengths of stay (Wilcoxon test p value=1.6E-120), possibly because the patient population included sicker patients seen at a referral center.

Evaluation on medical patients
In the JHH medical patient cohort, of 26 091 indexed admissions between January 2012 and April 2013 (corresponding to 18 974 individuals), 11.5% were unplanned readmissions within 30 days of the prior discharge.From a pool of 5665 ICD-9-CM diagnoses, 599 ICD-9-CM procedures, 1815 CPT procedures (which have at least one occurrence in the population), the learned model contained 18 attributes (one ICD-9-CM procedure, 16 CPT procedures, and one administrative attribute).We list the coefficients' estimates and their p values for the top 10 attributes in table 3, and provide a full model in online supplementary table S2.Across-site validation on the BMC cohort had AUC=0.81(figure 3A).We provide the model derived from the entire BMC cohort (n=16 194) in online supplementary table S3.Across-site validation on the JHH cohort had AUC=0.78(figure 3B).In general, the algorithm showed similar performance for JHH and BMC.Across-site validations showed similar performance to withinsite fivefold cross-validation (AUC=0.75±0.01on the JHH cohort and AUC=0.79±0.02 on the BMC cohort).Other performance metrics are shown in table 4.
Of the 19 attributes in the model derived from the JHH cohort, the top predictor was accumulated number of inpatient admissions in the past 5 years (t test p value=3.13E-172).Consistent with previous studies of readmissions, 28 29 32 33 the accumulated number of prior admissions was an important predictor of early readmission.

Evaluation on CP patients
In the JHH CP cohort, of 3218 indexed admissions between January 2007 and April 2013 (corresponding to 1763 individuals), 15.6% were readmissions within 30 days of the previous discharge.There were 3696 ICD-9-CM diagnoses, 751 ICD-9-CM procedures, and 1333 CPT procedures that occurred at least once in the population, of which seven ICD-9-CM diagnoses, four ICD-9-CM procedures, 24 CPT procedures, and one administrative attribute passed univariate variable selection (we list their Spearman correlation coefficients in figure 4).The final model consisted of five attributes (one ICD-9-CM procedure, three CPT procedures, and one administrative attribute); we list their coefficients' estimates and p values in table 5. Testing the model on the BMC cohort produced an AUC=0.65 (figure 3C).We provide the model derived from the BMC CP cohort (n=706) in online supplementary table S4.Testing it on the JHH cohort produced an AUC=0.73 (figure 3D).We list various performance metrics in table 3.
Of five attributes in the model derived from JHH, the strongest predictor was the accumulated number of inpatient admissions over the past 5 years (t test p value=1.41E-28), the same attribute as that of medical patients.Other predictors were disease specific.The next two predictors were lipase (CPT: 83690, t test p value=1.78E-07), which is related to pancreatic enzyme analysis and radical pancreaticoduodenectomy (ICD-9-CM procedure: 527, LRT p value=6.45E-20), a surgical procedure for total pancreatic resection.The other two attributes were arterial blood gas (CPT: 82803, t test p value=3.56E-01) and surgical pathology (CPT: 88309, t test p value=6.26E-02).

Significance of the method
We developed an algorithm to predict hospital readmission based on standardized billing codes and basic administrative attributes that are readily accessible in most billing data.Our data-driven method was inspired by pharmacoepidemiologic studies, 19 34 which identified empirical covariates such as diagnoses and medications from administrative data to adjust for confounders.We subsequently evaluated the predictive performance of the algorithm on large patient cohorts at both a tertiary and community hospital.The proposed algorithm has four significant advantages.First, it is fully automatic and can be applied to a wide range of patient cohorts, from medical patients to patients with specific diseases such as CP.Second, a type of branch and bound feature selection was applied by exploiting high-dimensional candidate attributes.No assumptions were made about the potential predictors of interest, with the algorithm automatically selecting top ranked, independent attributes specific to the given patient cohort.Such automated routines are most appropriate for fast exploitation of a large number of attributes to identify candidate risk factors and their surrogates for further detailed analysis, since little is known about the potential risk factors at this preliminary stage of discovery and there is no a priori reason to prefer one candidate attribute over another.Third, the algorithm requires only basic administrative attributes and billing codes that are readily available and standardized in billing databases across institutions.We were able to develop the model using a referral center cohort, and successfully validate it on a community hospital cohort (and vice versa) because of its easy portability.Although it is possible that additional information that is not stored as part of billing records or within standard electronic medical records may add to the predictability of 30-day readmission, such as functional status or physiological status at the time of discharge, 29 these measures are not universally available.We aimed to create an algorithm that was widely applicable to various settings, and therefore did not include these measures as potential attributes.Fourth, our method demonstrated fairly good discriminatory power to predict the risk of readmission in medical patients (internal validation, AUC=0.75±0.01 and 0.79±0.02;We report sensitivity, specificity, positive predictive values (PPVs), and negative predictive values (NPVs) at the cutoff thresholds that maximize the F-measure.For cross-validation, the table shows the mean±SD of the performance metric over the folds.AUC, area under the curve; BMC, Bayview Medical Center; CP, patients with chronic pancreatitis; JHH, Johns Hopkins Hospital; ME, all patients admitted to the medical service.
across-site validation, AUC=0.78 and 0.81).In a recent systematic review 2 that summarized 14 unique prediction models for 30-day readmission, of eight models targeting medical patients, only two had an AUC >0.70, 3 26 and only one had an AUC >0.75 (this study was based on a much smaller cohort with a training size of 700 patients and validation size of 704 patients; internal validation, AUC=0.77 26).The most recently published study 29 on 30-day readmission in medical patients had an AUC of 0.71 via internal validation.We had hoped to compare our model with the most recent model, but we were not able to because three laboratory test values required by the model were not available in our administrative data.In summary, our model was created in a fully automatic fashion yet still showed comparable or even better performance when compared with state-of-the-art approaches that involve data not available in standard billing records.
Admittedly, there are many other machine learning techniques serving similar purposes.There are several reasons why we prefer the current approach to many others.(1) Our approach is simpler and more intuitive than other sophisticated machine learning techniques; it is important for non-statisticians such as clinicians and administrative staff to understand the method in order to adopt it.(2) Our choice of approach is also related to the administrative data we used.Unlike clinical data, which contain fine-grained detail at high accuracy, administrative data only provide coarse-grained information about a patient.Given that the documented diagnoses and procedures themselves only serve as proxies for the true conditions experienced by the patients and that our goal is fast exploitation of all attributes to identify candidate risk factors or their surrogates for further detailed clinical investigation, we believe the current approach is sufficient for pivotal discovery.(3) We compared the Figure 4 Correlation matrix for the 37 attributes in Johns Hopkins Hospital chronic pancreatitis (CP) cohort after univariate variable selection.The matrix represents the set of attributes highly correlated with outcome.The final model contains five attributes that independently contribute to predicting outcome.Darker color corresponds to higher correlation, and lighter color corresponds to lower correlation.Diagonal entries represent self-correlation, which is always 1.0.computation and predictive performance of our approach with the standard stepwise forward approach: our approach has less computation complexity, but similar predictive performance (using AUC as the performance metric).Since we aimed to produce an algorithm applied to high-dimensional data for cross-institutional analyses, speed is of key concern here.( 4) The contribution of our study lies in the algorithmic workflow as a whole; further research can definitely be based on our proposed framework, but replacing the simple approach with more sophisticated machine learning techniques.

Significance of the discovery
While the large cohort of medical patients served as a good patient population for evaluating the performance of our algorithm relative to recent readmission prediction approaches, the additional application to a cohort of patients with a specific disease known to be at high risk of readmission may allow targeted interventions to reduce early readmissions.CP was selected for evaluation for several reasons.First, there is no current prediction model available for readmission in CP patients.Second, many patients with this condition need early unplanned readmission to hospital, resulting in significant healthcare costs.Finally, the incidence of CP is increasing in the USA.During hospitalizations, gastroenterologists often guide care.For an intervention to be implemented for patients with CP, the gastroenterologist must agree that these predictors have content validity before they are likely to support a readmission intervention as part of their care.Here, we implemented the algorithm in a CP patient cohort with the aim of identifying individual-level risk factors for preventable readmission.The risk factors, together with their highly correlated proxies (figure 4), differ from those for all patients admitted to the medical service and are specific to CP patients.Our model provides validated predictors for readmission on a going-forward basis creating a foundation for identifying potential targeted interventions.For example, in our CP cohort, this could mean real-time monitoring of specific test utilization associated with disease-specific readmission (eg, lipase and blood gas) through the electronic medical record, while also advocating more aggressive follow-up in a specific sub-population (eg, patients who have had a pancreaticoduodenectomy).The gastroenterology co-investigators are in the process of designing an intervention that will facilitate the adoption of best practice to prevent costly readmission of CP patients identified as high risk on the basis of predictors from our model and their proxies.The intervention may include delaying discharge, scheduling early outpatient follow-up, or phone calls from or home visits by gastroenterology nurses soon after discharge.In contrast, if the patient is predicted to be at low risk of readmission, the clinician might feel comfortable discharging him or her without the need for these interventions to prevent readmission.

Study limitations
Our study has several limitations.First, models based on administrative claims data may contain coding irregularities and lack clinical details.In addition, they generally cannot be used early in the course of an initial hospitalization, limiting the time and interventions available to effectively prevent potentially avoidable readmissions, especially for admissions with a short length of stay.Second, we were not able to identify readmissions to hospitals outside of JHH and BMC because of the lack of a universal identifier shared across sites.Third, our identification of unplanned readmissions was based on clinical judgment, since there is no follow-up indicator variable available in our administrative data.It is likely that our exclusion of planned readmissions is incomplete.Fourth, our identification of a CP patient cohort based solely on ICD-9-CM diagnosis codes may not be accurate enough, and predictive performance may be compromised as a result.Except for certain medical conditions, such as hip fracture and cancer, diagnosis/procedure codes for many other medical conditions, such as peripheral vascular disease (sensitivity=0.58,positive values=0.53),are of limited accuracy. 35For example, the absence of a chronic disease diagnosis code may be the result of more serious and acute illnesses that push the chronic condition off the list. 34However, we found the average diagnosis ranking of CP was 4, and our administrative data allow up to 50 diagnosis positions, suggesting that this potential limitation had minimal impact on our cohort identification.Fifth, the predictive performance of CP-specific readmission is lower than that of medical patients.Possible reasons may be related to cohort identification (discussed in the section above) and the small size of the validation cohort at BMC (n=706) (a significant proportion of codes that occurred in the JHH CP cohort never occurred in the BMC CP cohort, whereas this is not the case for the medical patient cohort).Another possible reason is that, whereas the CP cohort is a homogeneous population, the medical patient cohort contains various disease types.Patients with some types of disease were more likely than average to be readmitted, and surrogates of such diseases increase the predictive performance of the model, although these predictors may not be the driving underlying reason for readmission itself.Sixth, we were unable to incorporate clinical laboratory data in our model and as a result were not able to directly compare its performance with other recent approaches.Nevertheless, our model had performance characteristics that were comparable to or better than existing alternatives.Seventh, there are many other more sophisticated machine learning techniques for variable selection than our proposed approach.However, given that the administrative data only provide coarse information on the patients, and the documented diagnoses and procedures themselves only serve as proxies for the true conditions experienced by the patients, we felt that a simple intuitive approach, which would be more acceptable for clinicians and administrative staff, would suffice for our purposes.In fact, a comparison with another variableselection method reveals similar predictive performances (see online supplementary text).

CONCLUSION
This study presents a branch and bound algorithm for predicting 30-day readmission using administrative claims data.Through exploitation of high-dimensional yet universally available information in claims, the algorithm is widely applicable to various patient cohorts across both tertiary and communitybased hospital centers, with good performance characteristics.In summary, the advantage of the algorithm is its ability to maintain performance while maximizing portability and adaptability.Future application of this approach could include the use of nationally available data such as Medicaid and Medicare billing records for cross-institution analyses.

Figure 1
Figure 1 Evaluation workflow.The figure shows the model learning on the entire Johns Hopkins Hospital ( JHH) cohort and tested on the entire Bayview Medical Center (BMC) cohort.Training on BMC and testing on JHH were performed in a similar fashion (not pictured).

Figure 2
Figure 2 Overview of the algorithm.The flow chart outlines the five steps in the algorithm, which can be classified as branch and bound.CPT, Current Procedural Terminology; ICD9, International Classification of Diseases, 9th Revision.

Figure 3
Figure 3 Receiver operating characteristic curves for across-site analyses on medical patient (ME) cohort and chronic pancreatitis (CP) cohort.(A) Training ME cohort on Johns Hopkins Hospital ( JHH) and testing on Bayview Medical Center (BMC) (area under the curve (AUC)=0.81).(B) Training ME cohort on BMC and testing on JHH (AUC=0.78).(C) Training CP cohort on JHH and testing on BMC (AUC=0.65).(D) Training CP cohort on BMC and testing on JHH (AUC=0.73).Line color: threshold cutoff.

Table 1
Characteristics of study cohorts Number of observed diagnoses/procedures/CPTs is the total number of codes that appear at least once in the population cohort.BMC, Bayview Medical Center; CP, patients with chronic pancreatitis; CPT, Current Procedural Terminology; JHH, Johns Hopkins Hospital; LOS, length of stay; ME, all patients admitted to the medical service.

Table 2
Parameter adjustment and model size after univariate variable selection and multivariate variable selection

Table 3
Coefficients for the top 10 attributes included in model derived from the JHH ME cohort We provide the complete model in online supplementary tableS2.Estimate, estimated coefficients; Pr(>|z|), coefficient t test p values.CPT, Current Procedural Terminology; JHH, Johns Hopkins Hospital; ME, all patients admitted to the medical service.

Table 4
Model performance