Comparison of early warning scores for sepsis early identification and prediction in the general ward setting

Abstract The objective of this study was to directly compare the ability of commonly used early warning scores (EWS) for early identification and prediction of sepsis in the general ward setting. For general ward patients at a large, academic medical center between early-2012 and mid-2018, common EWS and patient acuity scoring systems were calculated from electronic health records (EHR) data for patients that both met and did not meet Sepsis-3 criteria. For identification of sepsis at index time, National Early Warning Score 2 (NEWS 2) had the highest performance (area under the receiver operating characteristic curve: 0.803 [95% confidence interval [CI]: 0.795–0.811], area under the precision recall curves: 0.130 [95% CI: 0.121–0.140]) followed NEWS, Modified Early Warning Score, and quick Sequential Organ Failure Assessment (qSOFA). Using validated thresholds, NEWS 2 also had the highest recall (0.758 [95% CI: 0.736–0.778]) but qSOFA had the highest specificity (0.950 [95% CI: 0.948–0.952]), positive predictive value (0.184 [95% CI: 0.169–0.198]), and F1 score (0.236 [95% CI: 0.220–0.253]). While NEWS 2 outperformed all other compared EWS and patient acuity scores, due to the low prevalence of sepsis, all scoring systems were prone to false positives (low positive predictive value without drastic sacrifices in sensitivity), thus leaving room for more computationally advanced approaches.


BACKGROUND AND SIGNIFICANCE
Sepsis is the dysregulated host response to infection that can lead to life-threatening organ failure. 1 It is a deadly disease process that contributes to nearly 50% of all inpatient deaths and is the most expensive inpatient condition paid for by the US healthcare system, totaling $24 billion on an annual basis. 2,3 Early recognition and effective antimicrobial therapy are the cornerstones of sepsis management, but timely detection remains a clinical challenge. 4,5 Several approaches to early sepsis identification have been linked to key physiologic derangements commonly seen during disease progression. The previously used Systemic Inflammatory Response Syndrome (SIRS) criteria which graded the host's response to an inflammatory insult were easy to use at the bedside, but nearly half of all inpatients met these criteria during their hospitalization. 6 As a result, the SIRS criteria have been criticized for being overly sensitive, which greatly limited its utility as a sepsis surveillance tool. 6 The most recent sepsis consensus statement introduced the quick Sequential Organ Failure Assessment (qSOFA) as a mortality risk stratification tool, but qSOFA was not validated as a sepsis surveillance tool. 1,7 One emerging approach to sepsis screening is to implement early warning scores (EWS), such as the Modified Early Warning Score (MEWS), the National Early Warning Score (NEWS), or its successor, the NEWS 2. 5 These scores grade the severity of physiologic derangement and provide a well-validated means of assessing risk for all-cause clinical deterioration. Other patient acuity scoring systems, also based on physiological measurements, such as Acute Physiology and Chronic Health Evaluation (APACHE II) have been used longitudinally for risk stratification. 8 Although many hospital systems are starting to deploy these EWS to aid in sepsis screening on the general ward, they have not been validated or directly compared for this purpose and their performances remain unknown. 5,[9][10][11] The objective of this study was to evaluate and compare the performance of commonly used EWS on sepsis surveillance for patients admitted to the general ward.

MATERIALS AND METHODS
Study design, data sources, and population All patients 18 years of age admitted to Washington University in St. Louis/Barnes-Jewish Hospital between January 1, 2012 and June 1, 2018 were eligible for inclusion. Patients were excluded if discharged < 12 h after sepsis onset, total length of stay was < 48 h, surgery was performed in the preceding 72 h, < 1 set of vital signs were recorded in the 24-h preceding index time, or if < 1 set of common labs results (creatinine and white blood cell count) were recorded in the 24-h preceding index time. Patients were excluded if sepsis was present on admission or if admission service was hospice, psychiatry, or obstetrics and gynecology due to the highly variable rates of physiologic data collection. Patients were also excluded if they no encounter billing code, vital sign, laboratory, service, room, or medication data to indicate a complete hospitalization. To ensure temporal similitude between cohorts, patient encounters <12 h or >14 days in duration were excluded. Electronic health record (EHR) data were extracted from the Research Data Core at Washington University in St. Louis School of Medicine. This project was approved with a waiver of informed consent by the Washington University in St. Louis Institutional Review Board (IRB#201804121).

Sepsis criteria
Sepsis was defined according to the Sepsis-3 consensus statement as suspicion of infection (SOI; culture collection followed by antibiot-ics within 72 h or antibiotics followed by culture procurement within 24 h, Supplementary Appendix I) accompanied by a qSOFA score 2. 12 Only the first sepsis event for each patient was evaluated. Time of onset was set as the time of SOI.

Index time for the nonsepsis cohort
Unlike the sepsis cohort where a specific event-sepsis onset-can be used as the index event, there is no such event for nonsepsis patients. To minimize bias introduced by difference in time-to-index time, nonsepsis patients were subsampled at a ratio of 30:1 and assigned an index-time such that the resultant histograms of time-toindex time (3-h bins) were equivalent (Supplementary Appendix II).

Early warning scores
The SIRS, MEWS, NEWS, NEWS 2, qSOFA, Sequential Organ Failure Assessment (SOFA), and Acute Physiology And Chronic Health Evaluation (APACHE II) scores were calculated every hour from 12h prior to index time to 12 h after index time. 7,13-16 Scores were calculated using the most abnormal physiological measurement (contributing the most points to the scoring system) as well as the most recent measurement in the 24 h preceding time of measurement. If no values were present in the lookback period, missing values were assumed normal. Additional details on EWS calculations can be found in Supplementary Appendix III. Sensitivity analysis was performed using a lookback period of 12 h. Further, EWS were compared at index time using thresholds defined in previous validation studies on the ability to discriminate between sepsis and non-sepsis patients. 1,7,11,14,17 Lastly, EWS were evaluated on their capability for early identification of secondary outcomes: in-hospital mortality within 48 h of index time and the composite outcome of in-hospital mortality or intensive care unit (ICU) transfer within 48 h of index time.

Statistical analysis
Patient characteristics and outcomes were compared between the sepsis and nonsepsis cohorts using the two-sided Mann-Whitney U test or v 2 test for numeric and categorical variables, respectively, where P < .01 was considered significant. Performance metrics such as the area under the receiver operating characteristic curve (AUROC) and area under the precision recall curves (AUPRC) were reported as the median and 95% confidence interval determined through 1000 sample bootstrap.

EWS performance
For the discrimination of sepsis versus nonsepsis, performance of NEWS was nearly identical to that of NEWS 2, both of which were superior to all other EWS ( Figure 1). As expected, performance for all EWS declines as the score predicts further ahead of index time, and continues to improve postindex time. Using the most abnormal value in the lookback period was significantly better than using the most recent value for all EWS. There was minimal difference in performance when using an alternate lookback period of 12 h (

DISCUSSION
In this large retrospective analysis of EWS performance on sepsis discrimination in the general ward setting, patients who met Sepsis-3 criteria were older and had more medical comorbidities compared to other patients in the general ward. This sepsis cohort also had a higher level of acuity, length of stay, and rates of in-hospital mortality (Table 1).
Among the compared EWS and patient acuity scoring systems, NEWS 2 had the highest discriminatory capability throughout the assessed time points, including at the time of onset ( Figure 1, Table  2). NEWS performed nearly identically to NEWS 2, which was followed by MEWS, qSOFA, and SIRS. Six hours prior to index time, a time when clinical action could change patient outcomes, NEWS 2 performance was 0.74 compared to 0.80 at onset. Due to the low prevalence of sepsis (3.3%), the AUPRC was <0.15 for all EWS at all time points preceding index time, is reflected in the low positive predictive value (PPV) across all EWS, which represents a propensity for high rates of false positives (Table 2). While it is possible to improve PPV through changing the threshold, it comes at the expense of reducing sensitivity (Supplementary Table S2).
The relatively poor performance of SOFA and APACHE II likely reflects the lower rate of vital sign and laboratory data collection available to patients on the hospital floor, as these tools were originally designed for the ICU setting and as patient acuity scores, not EWS. Such scores relying on infrequently measured variables (eg, arterial blood gases) appear to translate poorly to the general ward setting, as would be expected.
As seen in Figure 1, time-to-onset has a significant impact on the predictability of sepsis, and thus the performance of prediction tools. However, identification of sepsis onset time is not defined in the Sepsis-3 criteria and is prone to disagreement, which can significantly alter the results. 7,12,18 Studies comparing EWS are heterogeneous in their experimental design, especially in identifying the time-at-risk interval from which measurements are gathered for the control population. Methods include the usage of random time intervals, full encounters, or the first 24 h of admission. [19][20][21] To calculate the discriminatory ability of EWS surrounding sepsis onset, it was necessary to assign an index time for controls, and to minimize bias introduced by the duration of hospitalization, sepsis and nonsepsis cohorts were matched on time-toindex time. As a result, however, the ratio of sepsis to nonsepsis patients may not reflect the full set of hospital stays, favoring a sicker nonsepsis cohort compared to that if sampled randomly or taken whole.
While none of the compared EWS were used for the study population during the study period, a locally developed sepsis alert tool was used during the study period. 22 Thus compared EWS that share variables with the tool may be biased towards better performance.
Surprisingly, the update from NEWS to NEWS 2 had a nearly unnoticeable impact on the performance. Many of the changes described in the report, however, address concerns not directly relating to the score calculations, but to the usage of the score.
The limitations of this study are as follows: first, this is a singlecenter study at a large academic medical center and its patient population and culture-of-practice may preclude widespread generaliza-tion. Second, the retrospective nature of this study may yield EWS performance metrics different from those obtained from a prospective trial. Third, the choice of sepsis definition used may have resulted in biased performance metrics of EWS, especially for qSOFA which is used in the Sepsis-3 consensus definition. Fourth, this study evaluates only sepsis that developed on the general ward within 14 days of hospitalization and does not include patients with surgery within 72 h. Further evaluation of EWS in these specific populations may provide additional insight into their utility as a sepsis surveillance tool. Fifth,