The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression

Methods to correct class imbalance, i.e. imbalance between the frequency of outcome events and non-events, are receiving increasing interest for developing prediction models. We examined the effect of imbalance correction on the performance of standard and penalized (ridge) logistic regression models in terms of discrimination, calibration, and classification. We examined random undersampling, random oversampling and SMOTE using Monte Carlo simulations and a case study on ovarian cancer diagnosis. The results indicated that all imbalance correction methods led to poor calibration (strong overestimation of the probability to belong to the minority class), but not to better discrimination in terms of the area under the receiver operating characteristic curve. Imbalance correction improved classification in terms of sensitivity and specificity, but similar results were obtained by shifting the probability threshold instead. Our study shows that outcome imbalance is not a problem in itself, and that imbalance correction may even worsen model performance.


Introduction
When developing clinical prediction models for a binary outcome, the percentage of individuals with the event of interest (i.e. the event fraction) is often much lower than 50%. When the frequency of individuals with and without the event is unequal, the term 'class imbalance' is often used. 1 Class imbalance is has been identified as a problem for the development of prediction models, in particularly when the interest is in the classification of individuals into a high risk versus low risk group ('classifier'). [1][2][3] Commonly suggested solutions to address class imbalance include some form of resampling to create an artificially balanced dataset for model training. Common approaches are random undersampling (RUS), random oversampling (ROS), and SMOTE (Synthetic Minority Oversampling Technique). [2][3][4][5] The classification accuracy of a model that classifies individuals into a high risk vs low risk group is defined as the percentage of individuals that are either true positive (individuals that have the event and are either correctly classified as high risk) or true negative (individuals that do not have the event and are correctly classified as low risk). To illustrate the possible impact of class imbalance, consider a simple model that classifies everyone as low risk. Such a classifier yields a classification accuracy of 50% if the event fraction is 50% (balanced), but a classification accuracy of 99% if the event fraction is 1% (highly imbalanced). That imbalanced datasets can easily lead to high classification accuracy is often labeled as problematic. For instance, He and Garcia write "we find that classifiers tend to provide a severely imbalanced degree of accuracy, with the majority class having close to 100 percent accuracy and the minority class having accuracies of 0-10 percent, for instance". 2 Fernandez and colleagues write "the truth is that classifiers ... tend to have great accuracy for the majority class while obtaining poor results (closer to 0%) for the minority class". 3 We argue that the class imbalance is not a pervasive problem for prediction model development. First, the problem is specific to the classification accuracy measure. The limitations of focusing on classification accuracy as a measure of predictive performance is well known. 6,7 Second, if we consider models that produce estimated probabilities of the event of interest, an adjustment of the classification threshold probability can be used to ensure adequate classification performance (i.e. probability threshold to classify individuals as high risk does not have to be 0.5). 8 A probability threshold to select individuals for a given treatment implies certain misclassification costs and should be determined using clinical considerations. 8 If we use a probability threshold of 0.1 to classify individuals as high risk and suggest a specific treatment, this means that we accept to treat up to 10 individuals in order to treat 1 individual with the event: we accept up to 9 false positives, or unnecessary treatments, per true positive. [9][10][11] As Birch and colleagues write, models should be able to accommodate differing attitudes regarding misclassification costs. 12 The problem then shifts from class imbalance to probability calibration: the model's probability estimates should be reliable in order to make optimal decisions. This raises the question how class imbalance methods affect calibration.
In this study, we investigate the performance of standard and penalized logistic regression models developed in datasets with class imbalance. We hypothesize (1) that imbalance correction methods distort model calibration by leading to probability estimates that are too high, and (2) that shifting the probability threshold has similar impact on sensitivity and specificity as the use of imbalance correction methods.

Imbalance correction methods and logistic regression models
We examine RUS, ROS, and SMOTE, three common approaches to correct for class imbalance. [2][3][4][5] As lower sample size is well known to increase the risk of overfitting, we anticipated that RUS would require a larger sample size to perform well. [13][14][15] Prediction models were developed using standard maximum likelihood logistic regression (SLR) and using penalized logistic regression with the ridge (or L2) penalty (Ridge). 16 The lambda hyperparameter was tuned using a grid search based on 10-fold cross-validation. 17 See Supplementary Note for details.

Case study: estimating the probability of ovarian cancer
For illustration, we developed prediction models to estimate the risk of ovarian malignancy in premenopausal women presenting with at least one adnexal (ovarian, para-ovarian, or tubal) tumor. Prediction models for ovarian cancer diagnosis could be used to decide whether to operate and by whom (e.g. whether referral to an experienced gynecological oncologist is warranted or not). We use data women who were recruited consecutively across three waves (1999-2005, 2005-2007, and 2009-2012) of the International Ovarian Tumor Analysis (IOTA) study. 18,19 We have ethics approval for secondary use of these data for methodological/statistical research (Research Ethics Committee University Hospitals KU Leuven, S64709). The study only included patients who were operated on, such that the reference standard could be based on histology. Borderline malignant tumors were considered malignant.
Overall, 5914 patients were recruited across the three waves, of which 3369 premenopausal patients between 18 and 59 years. The prevalence of malignancy was 20% (658/3369), reflecting moderate imbalance.
We used the following predictors: age of the patient (years), maximum diameter of the lesion (mm), and number of papillary structures (ordinal variable with values 0 to 4; 4 referring to four or more papillary structures). To investigate performance of all models in combination with the different imbalance solutions, the data was first split up into a training set and a test set using a 4:1 ratio. This yielded a training dataset of size 2695 (518 events), and a test dataset of size 674 (140 events). The training set was either unadjusted or pre-processed using RUS, ROS or SMOTE, resulting in four different datasets on which models were fitted: Dunadjusted; DRUS; DROS and DSMOTE.
Subsequently, prediction models were developed using SLR and Ridge, resulting in 4 (datasets) x 2 (algorithms) = 8 different models. The continuous predictor variables were modeled using restricted cubic splines with 3 knots to address potential nonlinearity. The resulting models were applied to the test set to obtain the following model performance in terms of the area under the ROC curve, accuracy, sensitivity, specificity, calibration intercept and slope, flexible calibration curves, and Net Benefit (Table 1). 10,11,20,21 For classification, the 'default' risk threshold of 0.5 was used as well as a risk threshold of 0.192 (518/2695, prevalence of malignancy in the training dataset) when class imbalance was not corrected.

Monte Carlo simulation study
We used the ADEMP (aim, data, estimands, methods, performance) guideline to design and report the simulation study. 22 Aim. The aim of this study was to investigate the impact of class imbalance corrections on model performance in terms of discrimination, calibration and classification.
Data generating mechanism. Twenty-four scenarios were investigated by a varying the following simulation factors: original training set size (N) (2500 or 5000), number of predictors (p) (3,6,12 or 24), and outcome event fraction (0.3, 0.1, 0.01). The values for p and the event fraction reflect common situations for clinical prediction models. 23 A sample size of 2500 will include 25 events on average when the event fraction is 1%. Smaller values for N may hence lead to computational issues. Candidate predictor variables were drawn from a multivariate standard normal distribution with zero correlation between predictors. Then, the outcome probability of each case was computed by applying a logistic function to the generated predictors. The coefficients of this function were approximated numerically for each scenario (Supplementary Note), such that the predictors were of equal strength, the c-statistic of the data generating model was approximately 0.75, and the outcome prevalence expected in accordance with the simulation condition. The outcome variable was sampled from a binomial distribution.
Estimands/targets of analysis. The focus is on discrimination, calibration, and classification performance of the fitted models on a large out-of-sample dataset.
Methods. For each generated development data set, four prediction model development datasets were created: Dunadjusted, DRUS, DROS and DSMOTE. On each of these data sets, SLR and Ridge models were fit. This resulted in 8 different prediction models per simulation scenario.
Because we anticipated imbalance correction would lead to overestimation of probabilities (i.e. that the model intercept would be too high), we also implemented a logistic re-calibration approach for the models developed on DRUS, DROS and DSMOTE, resulting in another 6 models. 24 This re-calibration was done by fitting a logistic regression model on the development dataset with the logit of the estimated probabilities from the initial model as an offset variable and the intercept as the only free parameter: For each scenario, 2,000 simulation runs were performed. In each run, a newly simulated training dataset was used. To evaluate the performance of the resulting models for a given scenario, a single test set per scenario was simulated with size N = 100,000 using the same data generating mechanism.
Performance metrics. We applied each model on its respective test set, and calculated the AUROC, accuracy, sensitivity, specificity, calibration intercept and slope. To convert the estimated probabilities into a dichotomous prediction, a default risk threshold of 0.5 was used.
For models trained on unadjusted development datasets, we also used a threshold that is equal to the true event fraction. The primary metric was the calibration intercept. 20,21 Software and error handling. All analyses were performed using R version 3.6.2 (www.Rproject.org). The simulation study was performed on a high-performance computing facility running on a Linux-based Operating System (CentOS7). To fit the regression models, the R packages stat and glmnet version 4.0-2 were used. 25 Errors in the generation of the development data sets and estimation of the models were closely monitored (details in Supplementary Note). 27 A summary of the data sets in which data separation occurred is given in in Table S1.

Case study
There was little variation in discrimination across algorithms and imbalance correction methods, with average AUROC of 0.79 to 0.80 (Table S2). The calibration curves indicate that all imbalance correction methods had strong impact on calibration, yielding strongly overestimated probability estimates after imbalance correction but not without correction ( Figure 1). This is confirmed by the calibration intercepts: these were 0.06 (95% CI -0. 16  . When using the 0.5 probability threshold on models trained on unadjusted data, specificity (96% for SLR and Ridge) was clearly higher than sensitivity (31% for SLR, 29% for Ridge). As expected, sensitivity increased and specificity increased by changing the classification threshold for models based on unadjusted data or using the 0.5 threshold for models after imbalance correction (sensitivities between 69% and 75%, specificities between 74% and 78%).
Our results also show that the overestimation of the probability of a malignancy for models that were trained on imbalance corrected datasets could lead to overtreatment: too many individuals would exceed a given probability threshold and would be selected for treatment (for instance, referral to specialized gynecologic oncology centers for surgery). This is reflected in the Net Benefit measurement of clinical utility ( Figure 2). The decision curves show that models trained on imbalance corrected datasets had strongly reduced clinical utility, even to the extent that the Net Benefit was negative when using a probability threshold of 0.3 or higher to select individuals for treatment.

Simulation study
The simulation results did not provide evidence that imbalance correction methods systematically improved the AUROC compared to developing models on the original (imbalanced) training data (Figures 3 and S2-6, Table S3). The median AUROC of models trained on unadjusted data was never lower than the median AUROC of models after RUS, ROS, or SMOTE. For RUS, the median AUROC was often lower, with larger differences when event fraction was lower, training set size was lower, and number of predictors was higher.
Training models on imbalance corrected datasets resulted in severe overestimation of the   Table S6).
This finding was more evident for lower event fraction, lower training set size, and a larger number of predictors. Median calibration slopes below 1 were also observed for SLR developed on unadjusted training data. The median slopes obtained for models developed on training data after ROS or SMOTE were still lower. When using SLR after RUS, overfitting was stronger than when the unadjusted training data were used, leading to very low calibration slopes in some scenarios.
Regarding classification, using a probability threshold of 0.5 for models trained on unadjusted data resulted in median sensitivities of 0% and median specificities of 100% when the true event fraction was 1% (Figures S23-40, Tables S7-8). More balanced results for sensitivity and specificity were obtained by either using imbalance correction methods or shifting the probability threshold ( Figures S29-40).

Discussion
The key finding of our work is that training logistic regression models on imbalance corrected data did not lead to better AUROC compared to models trained on uncorrected data, but did result in strong and systematic overestimation of the probability for the minority class. This strong miscalibration reduces the clinical utility of the model: models yielding probability estimates that are clearly too high may lead to overtreatment. For example, if a model overestimates the risk of malignancy of a detected ovarian tumor, the decision to refer patients to advanced and specialized surgery may be taken too quickly.
Class imbalance is often framed as problematic in the context of prediction models that classify patients into low versus high risk groups. [1][2][3]28 Nevertheless, for clinical prediction models the accurate estimation of probabilities is essential to define such low risk and high risk groups. For instance, clinical staff using the model to support treatment decisions may choose probability thresholds that match the assumed misclassification costs. Hence, when probability estimation is important, calibration becomes a central performance criterion. 29,30 The relation between correction for class imbalance and calibration of estimated probabilities is rarely made. For instance, it is not discussed in some key publications on class imbalance for prediction models. [1][2][3]29 A study from 2011 hinted at this link by stating that 'the predicted probability using a logistic regression model is closest to the true probability when the sample has the same class distribution as the original population', and that differences in class distribution between study sample and population should be avoided. 31 However, the authors did not systematically study typical imbalance correction methods, and the simulations were based on an unrealistic setting with only one predictor and a true AUROC around 0.99. Another study into imbalance corrections quantified calibration incorrectly by using Brier score and class-specific Brier scores. 32 Brier score is a statistically proper measure of overall measure of performance, that captures both discrimination and calibration. This study incorrectly claimed that using RUS improved probability estimates compared to using uncorrected data in the minority class based on observed lower values of the Brier score in the minority class. This, however, does not mean that the probability estimates are well calibrated, but simply means that the probabilities in the minority class are closer to 1. This is consistent with our findings: probability estimates under RUS are indeed miscalibrated toward too extreme values.
Another study did indicate that undersampling distorts probability estimates and increases the variance of the prediction model (which relates to the higher tendency of overfitting due to artificially reducing sample size). 33 However, the study focused on classification accuracy, by claiming that the effect of undersampling on accuracy depends on many factors such that it is difficult to know when it will lead to better accuracy. In contrast, our study suggests that, at least for logistic regression models, RUS (or ROS or SMOTE) is unlikely to lead to better discrimination or separability between the minority and minority classes.
It is well known that developing robust clinical prediction models, the sample size should be large enough to reduce overfitting. [13][14][15]17,34 Recent studies indicate that the most important factors to determine overall sample size are the event fraction, the number of considered parameters, and the expected performance of the model. From that perspective, undersampling is a very counterintuitive approach, because it deliberately decreases sample size available for model training, which may lead to artificially increased risk of overfitting. Our results are consistent with this expectation: RUS resulted in lower AUROC values on the test data.
Based on the results presented in this study, it is warranted to conduct follow-up studies that systematically study the impact of imbalance corrections on discrimination and calibration performance, in particular in the context of other algorithms. For instance, the calibration performance of increasingly popular approaches for prediction model development such as Random Forest, Support Vector Machines and Neural Networks remains to be investigated.
Also, other imbalance correction methods exist, such as weighting, cost-sensitive learning, or variants of RUS, ROS and SMOTE. 3,28,35 We anticipate that risk miscalibration will remain present regardless of type of model or imbalance correction technique, unless the models are recalibrated. However, class imbalance correction followed by recalibration is only worth the effort if imbalance correction leads to better discrimination of the resulting models.
In conclusion, our study shows that correcting class imbalance did not result in better prediction models based on standard or ridge logistic regression. The imbalance corrections resulted in inaccurate probability estimates without improving discrimination in terms of AUROC. We therefore warn researchers for the limitations of imbalance corrections when developing a prediction model.   Metric Description Concordance (c) statistic The c statistic estimates the probability that a model gives a higher prediction (e.g. estimated probability) for a random individual with the event than for a random individual without the event. For binary outcomes, this equals the area under the receiver operating characteristic curve (AUROC). The c statistic is 1 when all patients with an event have a higher risk estimate than all patients without an event, and is 0.5 when risk estimates cannot differentiate at all between patients with and without the event. Classification accuracy The accuracy is the proportion of patients that are classified correctly, i.e. the proportion of patients that are either true positives (TP) or true negatives (TN): The sensitivity is the proportion of patients with the event that are classified as high risk: TP / (TP + FN). Specificity The specificity is the proportion of patients without the event that are classified as low risk: TN / (TN + FP).

Calibration intercept
The calibration intercept quantifies whether risk estimates are on average too high (overestimation, calibration intercept < 0) or too low (underestimation, calibration intercept >0). It is calculated as the intercept of the following logistic regression analysis: ( ) = + , where LP is the linear predictor (logit of the estimated risk from the model). The LP is added as an offset, meaning that its coefficient is fixed at 1.

Calibration slope
The calibration slope quantifies whether risk estimates are too extreme (too close to 0 or 1, calibration slope < 1) or too modest (too close to the event fraction, calibration slope > 1). It is calculated as the coefficient of the following logistic regression analysis: Assuming that we use a model to identify high risk patients, for which a given clinical intervention (treatment) is warranted, Net Benefit quantifies the utility of the model to make such treatment decisions. It exploits the link between the risk threshold and misclassification costs. Using a risk threshold t = 0.1 means that we accept to treat at most 10 patients per true positive. In other words, we tolerate 9 false positives (unnecessary treatments) per true positive (necessary treatment). This implies that the benefit of 1 true positive is 9 times higher than the harm of a false positive. So as long as you have <9 false positives per true positive, the benefits outweigh the harms. Net Benefit is therefore calculated as (TP -w*FP) / N, with w equal to odds(t) = 1/(1-t). Net Benefit is conditional on the adopted misclassification costs, and can therefore be calculated for several potential risk thresholds. A plot of Net Benefit for a range of thresholds is a decision curve. Net Benefit can also be calculated for two default strategies: treating everyone (all) or treating no one (none). Whatever the misclassification costs, treating no one has a Net Benefit of 0 by definition. Treating everyone has a positive Net Benefit when misclassification costs clearly favor true positives (t is low). If, for a given t, the Net Benefit of a model is not higher than the Net Benefit of the two default strategies, the model has no clinical utility for the misclassification costs associated with t.

Class imbalance methods
When using RUS, the size of the majority class (i.e. the group of individuals with observed events or non-events, whichever is larger) is reduced by discarding a random set of cases until the majority class has the same size as the minority class. When using ROS, the size of the minority class is increased by resampling cases from the minority class, with replacement, until the minority class has the same size as the majority class. This results in an artificially balanced dataset containing duplicate cases for the minority class.
SMOTE is a form of oversampling that creates new, synthetic cases. Contrary to ROS, where the minority cases are simply duplicated and thus result in a data set with cases that are identical, SMOTE results in synthetic data points that are interpolations of the original minority class cases. The procedure is as follows: for every minority class case, the k nearest minority class neighbors in the predictor space are determined, based on the Euclidean distance. Then, the differences between the feature vector of the minority case and those of its k nearest neighbors are taken. These differences are then multiplied by a random number between 0 and 1 and added to the feature vector of the minority case. By creating synthetic data in this manner, there is more variation in the minority cases and hence, the models trained on this data set may be less prone to overfitting than when trained on ROS data. We used k=5 when implementing SMOTE. Figure S1 illustrates these methods using two predictors.

Note: Logistic regression models
Let denote the outcome taking on the value 1 for events and value 0 for non-events, the We will refer to this model as standard logistic regression (SLR).
Alternatively, when using a ridge penalty, the following penalized version of the log-likelihood function is used for estimating , in order to penalize the coefficients towards zero: We refer to this model simply as Ridge. Here, is a hyperparameter that controls the amount of penalization. We tuned using 10-fold cross-validation on the deviance from a grid of 251 possible values between 0 (no penalization) and 64 (very strong penalization). The non-null values in this grid were equidistant on logarithmic scale.

Coefficient estimation
The intercepts and coefficients that result in the desired true AUROC and event fraction were estimated by numerical optimization using the optim() function from the stats R package. The method used for optimization was the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. The function to be minimized was the sum of the difference between the observed AUROC and the desired AUROC squared and the difference between the observed prevalence and the desired prevalence squared. As this method can lead to slightly varying results, the minimization procedure was deployed 20 times after which the median coefficient values were chosen. This process was repeated for 20 generated data sets of size N = 10 5

Error handling
Errors in the generation of the development datasets were closely monitored, a table summarizing the error occurrence per simulation cell is included in the article. Data separation in the development datasets was assumed when the apparent AUROC in the development data set was equal to 1, based on the maximum likelihood logistic regression model. Because, in practice, clinical prediction modelers should not develop prediction models on separated data, separated data sets were removed from the analysis. Very few cases of data separation, or no cases at all, were expected in the generated development data sets, given that the true AUROC was set to be approximately 0.75 and the minimum sample size was 2,500. However, data separation was likely to occur when random undersampling is used.
If a development data set contained cases of only one class, this data set was excluded from the analysis; the simulation results are based on complete case analysis. Development datasets with fewer than 8 events or non-events can cause severe problems in estimating tuning parameter in ridge logistic regression using 10-fold cross-validation. In such cases, leave-one-out crossvalidation was used to estimate . When there were less than 6 minority-class events in the development data set, the SMOTE-algorithm fails when using the default setting because it searches for the k=5 nearest neighbors. In such cases k was set to the number of minority-class events minus 1. In the most extreme scenario with event fraction 0.01 and sample size 2,500, the probability of generating a data set with < 8 events was only 0.00002 (1 in 50,000).             Figure S1. Visualization of imbalance correction methods.