Deep learning for acute rib fracture detection in CT data: a systematic review and meta-analysis

Abstract Objectives To review studies on deep learning (DL) models for classification, detection, and segmentation of rib fractures in CT data, to determine their risk of bias (ROB), and to analyse the performance of acute rib fracture detection models. Methods Research articles written in English were retrieved from PubMed, Embase, and Web of Science in April 2023. A study was only included if a DL model was used to classify, detect, or segment rib fractures, and only if the model was trained with CT data from humans. For the ROB assessment, the Quality Assessment of Diagnostic Accuracy Studies tool was used. The performance of acute rib fracture detection models was meta-analysed with forest plots. Results A total of 27 studies were selected. About 75% of the studies have ROB by not reporting the patient selection criteria, including control patients or using 5-mm slice thickness CT scans. The sensitivity, precision, and F1-score of the subgroup of low ROB studies were 89.60% (95%CI, 86.31%-92.90%), 84.89% (95%CI, 81.59%-88.18%), and 86.66% (95%CI, 84.62%-88.71%), respectively. The ROB subgroup differences test for the F1-score led to a p-value below 0.1. Conclusion ROB in studies mostly stems from an inappropriate patient and data selection. The studies with low ROB have better F1-score in acute rib fracture detection using DL models. Advances in knowledge This systematic review will be a reference to the taxonomy of the current status of rib fracture detection with DL models, and upcoming studies will benefit from our data extraction, our ROB assessment, and our meta-analysis.


Introduction
Rib fractures are the most common injury in blunt chest trauma patients. 1Although a chest radiograph may suffice to diagnose displaced fractures, a multidetector CT (MDCT, or CT for simplicity) scan is recommended to ensure a more sensitive report.However, due to the complexity of CT scans, between 25% and 35% of non-displaced rib fractures are missed in diagnoses. 2eep learning (DL) models, and other artificial intelligence models, can increase the diagnostic accuracy, reduce interreader variability, and shorten reading time. 3,4DL models consist of artificial neural networks with multiple layers to capture different levels of abstraction from the data.A particular configuration of neural networks is a convolutional neural network, which is powerful and efficient in computer vision thanks to the use of small kernels of parameters to capture local features. 5L models have been used for the analysis of medical imaging in three main applications: classification (eg, benign vs malignant lesion), detection (eg, lesion localization), and segmentation (eg, organ contouring). 6In particular, DL models have been used in many studies for the detection of orthopaedic fractures, including rib fractures, with an accuracy close to that of experienced radiologists. 7Similarly, DL models have been applied to the analysis of postmortem CT (PMCT) scans to perform tasks such as automatic segmentation of organs, identification of mass disaster victims, or sex and age estimation in the investigation of unknown remains. 8espite their promising results, DL models still face several challenges due to their strong dependence with data quantity and quality, not to mention their low interpretability. 9,10he objective of this systematic review is to study rib fracture classification, detection, and segmentation in CT data with DL models, both in clinical and in postmortem (PM) cases.In addition, the risk of bias (ROB) and the concerns about applicability (CAA) of the selected studies are assessed, and the impact of ROB on acute rib fracture detection performance is analysed.The review will be useful as a reference for radiologists and future research projects.

Methods
This systematic review was registered in the PROSPERO international prospective register of systematic reviews, and it followed the guidelines proposed by the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) 2020 statement. 11ll steps of the methods were conducted by the first author, who has one year of research experience in artificial intelligence for medical imaging.For the ROB and CAA assessment, articles were further reviewed to balance the answers to each signalling question.

Literature search
Articles were retrieved between the 3rd and 5th of April 2023 from PubMed, Embase, and Web of Science using the query of keywords ((deep learning) OR (convolutional network)) AND (rib fracture).
Only the studies falling into the following criteria were included in the systematic review: (1) written in English, (2) published as a journal article or as a conference paper, (3) used a DL model, (4) the DL model was used to classify, detect, or segment rib fractures, (5) the DL model was trained on data coming from CT scans, and (6) the CT scans were taken from humans.
Studies using PMCT scans to train their models were also considered.Moreover, studies with patients with healing and old rib fractures were also included in the systematic review.

Data extraction
Study characteristics and model performance metrics were extracted without using automation tools.To complete unreported data, corresponding authors were contacted twice within three weeks.All data were collected in Excel spreadsheets.
For classification and detection models, the extracted performance metrics were the sensitivity (or recall), the precision (or positive predictive value), and the F1-score, all at lesion level (except for two studies on rib fracture classification, which only reported performance at scan level).For segmentation models, the extracted performance metrics were the Dice score and the intersection over union.See the Supplementary Material for the formulae of these performance metrics.
In studies comparing the performance of different models on the same testing dataset, only the model with best performance was considered.

Risk of bias assessment
The Quality Assessment of Diagnostic Accuracy Studies 12 tool was used to assess the ROB and the CAA of the included studies.The suggested list of signalling questions was modified to adapt it to the objectives of this systematic review.See the Supplementary Material for the full list of signalling questions used in each domain.
For each study, each signalling question was answered and classified into one of the following four categories, ordered by level of risk: low, no information, some concerns, and high.Then, each domain was also classified, assigning the category with the maximum level of risk of those obtained in the signalling questions in that domain.Finally, each study was given an overall assessment following the same procedure.
The results of these assessments were presented in the form of traffic light plots (in the Supplementary Material) and summary plots, all generated with the robvis tool. 13

Meta-analysis
For rib fracture detection studies, the sensitivity, the precision, and the F1-score of their models were analysed with forest plots.The purpose of these forest plots was to visualize and compare the results of the selected models and to compute a global performance of rib fracture detection with DL models.However, we remind the reader that each study used a different dataset and a different DL model to achieve the results.Therefore, we could not extract strong conclusions concerning which of these algorithms was the best suited for this task.
Only studies reporting performance metrics of acute rib fracture detection were included in the meta-analysis.That is, studies that trained the model with acute, healing, and old rib fracture annotations and only reported a global performance were excluded.In addition, only the studies providing the 95% CI of the sensitivity and the precision were selected.If the F1-score was not reported with 95% CI, it was simulated with the Monte Carlo method.
Heterogeneity was calculated with the I 2 statistic and complemented with the results obtained for the variance s 2 of the random effects (RE) model, which was chosen over the fixedeffects model because the studies could not be considered to be coming from the same population. 14he meta-analysis was repeated with two subgroups of acute rib fracture detection studies: those judged as having a low ROB and those judged as having a higher level of ROB, namely no information, some concern, or high.The p-value used for significance in the subgroup differences test was 0.1. 15ll plots and statistical analyses were performed with the R (version 4.2.3)package metafor 16 (version 4.0.0).Find the data and scripts to generate the forest plots in https://github.com/manellopez13/dl4rf_meta_analysis.

Literature search
Using the search strategy stated in the methods, a total of 132 records were retrieved from the databases.From these, 68 records were not considered because they were duplicates, and 10 were excluded because they were either not written in English, not journal articles or conference papers, or not available online.Afterwards, the reports of the remaining records were assessed, which led to the following 27 exclusions: 8 studies did not use DL, 10 records of studies did not have rib fracture classification, detection, or segmentation as their objective, and 9 studies used chest radiographs instead of CT data to train their models.
The total number of studies included in the systematic review was n ¼ 27.  The ISMA flow diagram of Figure 1 shows the study exclusion process.

Data extraction
The included studies were published between 2020 and 2023.The models of three studies only classify rib fractures, and the model of one study only performed rib fracture segmentation.The remaining 23 studies trained models that detect rib fractures.Most of the models were trained with medium to high-resolution CT scans, with slice thickness ranging from 1 to 5 mm.The ratio of female patients ranged from 30% to 40% in most of the study designs, and the average age of the patients was in the range of 50-60 years.The rest of the study characteristics are gathered in Table 1.For more information on the selection criteria of the patients, refer to the Supplementary Material.
All studies used clinical CT scans except one that used PMCT scans. 23This study applied specific selection criteria, such as excluding cases of bodies in an advanced state of decomposition or cases of bodies with severe trauma.
Tables 2-4 contain the average performance metrics obtained by the models in terms of acute rib fracture classification, detection, and segmentation, respectively.Some studies did not report the performance of the model on acute rib fractures alone, but the global performance of the model on acute, healing, and old rib fractures.These cases, which are highlighted in the tables, were excluded from the metaanalysis of acute rib fracture detection.

Risk of bias assessment
Figures 2 and 3 present the summary plots of the ROB and the CAA assessments, respectively.
In the ROB assessment, the patient selection domain was the most affected.About a fourth of the studies did not report the inclusion and exclusion criteria used to select patients.Some studies on rib fracture detection introduced ROB in their reports of precision and F1-score by including control patients in the testing dataset.Additionally, a fourth of the studies collected the CT scans with slice thickness at 5 mm.
For the domain of the index test, there was low ROB in most of the studies, but three quarters of the studies had high CAA, as their models were not publicly available, neither commercially nor as open-source tools.
Concerning the reference standard domain, there was low ROB in the majority of studies, with the exception of one study that used annotations that were not 100% sensitive, and one study in which labels of lesions were removed if they were not annotated by all experts.A model trained with these data may learn to ignore lesions, which would increase the number of FN.
Finally, for the flow and timing domain, no ROB was detected.The traffic light plots showing the detailed results of the ROB and the CAA assessments can be found in the Supplementary Material.

Meta-analysis
Only 7 studies were selected for the meta-analysis.As some studies applied their models to more than one testing dataset, the sensitivity meta-analysis consists of 15 points, and the precision and the F1-score meta-analysis consists of 14 points each.
Figures 4 and 5 show the forest plots of the sensitivity and the precision, respectively.At first glance, one can see that, while models from the low ROB had both high sensitivity and high precision, some studies with ROB had either sensitivity or precision significantly lower than the total RE model average.Indeed, S10 had good precision, but poor sensitivity, and S14 had a notable sensitivity but an improvable precision.This trade-off is no longer observed in Figure 6, where the forest plot of the F1-score studies S10 and S14 from the rest.However, two models from the ROB group, S24-1 and S24-3, presented high sensitivity, precision, and F1-score.
The subgroup analysis led to an averaged sensitivity of 89.60% (95%CI, 86.31%-92.90%)for studies with low ROB and 84.00% (95%CI, 71.37%-96.63%)for studies with ROB.The I 2 statistic was higher than 95% in both subgroups, indicating considerable heterogeneity in both cases.However, the low ROB subgroup was less heterogenic than the ROB subgroup, as the variance s 2 of the low ROB subgroup was much lower than that of the ROB studies.The subgroup differences test resulted in a p-value of 0.23, higher than the threshold of significance 0.1, meaning that there was no evidence that ROB had an impact on rib fracture detection sensitivity.
Similar results were obtained for the precision analysis, where the subgroup of low ROB studies had a precision of 84.89% (95%CI, 81.59%-88.18%),while for the ROB subgroup it was 80.26% (95%CI, 68.21%-92.32%).Again, the low ROB was less heterogenic than the ROB subgroup, with I 2 and s 2 lower in the low ROB subgroup.The subgroup differences test had a p-value of 0.35, above the significance threshold 0.1.Therefore, no evidence was found that ROB had an impact on rib fracture detection precision.Finally, the F1-score analysis led to the least heterogenic results, with estimates of 86.66% (95%CI, 84.62%-88.71%)and 81.14% (95%CI, 72.25%-90.03%)for the low ROB and the ROB subgroups, respectively.In this case, the subgroup differences test yields a p-value lower than 0.1, pointing at the conclusion that ROB has an impact on the F1-score of acute rib fracture detection.

Discussion and recommendations
Although the treatment of rib fractures is mostly conservative, these lesions are an indicator of associated injuries in more than 90% of patients, and in around 10% of the cases the associated injuries are fatal. 44The age of the patient and the number of rib fractures increase the morbidity and mortality of the injuries, 45,46 but single rib fractures may also lead to adverse outcomes in 20% of the cases. 47By reducing the diagnosis time and achieving a higher sensitivity than radiologists, DL models for rib fracture detection only improve healthcare.
The selected studies in this systematic review are heterogeneous, with different data and models.The inclusion criteria for patients in each study are also varied.Thus, it is difficult to make any recommendation among the tools presented, as each has its own advantages in a specific application.For instance, while most of the studies focus on acute rib fracture detection, some models can distinguish among acute, healing, and old rib fractures. 18,29,32,36,38, [40][41][42] In other studies, the models can also classify acute rib fractures into displaced, non-displaced, and buckle (or incomplete) rib fractures. 17,27,29,30,32,34,36,38,42One study analysed the performance of the rib fracture detection model depending on the number of rib fractures in the CT scan. 27ble 3.Average model performance on acute rib fracture detection.

Study Dim.
Model architecture Pretrained CT scans ST (mm) N annot.Sensitivity (%) Precision (%) F1-score (%) As a proof of quality of the DL tool, many studies have compared the rib fracture detection performance of the model against that of experienced radiologists.With the assistance of a DL model, the sensitivity of radiologists (60%-80%) can increase up to 20 percentage points while maintaining a similar level of precision (70%-90%) and considerably reducing reading time. 17,18,21, [24][25][26][27]29,32,[35][36][37][38][40][41][42] The most common DL model architectures, used by around 10 studies each, are the U-Net 48 and the Faster R-CNN. 49 A good examplef how rib fracture detection can be resolved via various paths is the use of the U-Netalthough this model is designed to perform object segmentation, its results can be postprocessed to output bounding boxes around the predicted object localizations.
There is also considerable heterogeneity concerning the choice of input image dimensions.While the majority of studies decided to extract 3D patches from CT scans to train their models, a number of studies applied 2D models to each individual axial slice.Other researchers extended 2D models to aggregate the results of groups of adjacent slices, which we denote as 2.5D models.
No significant improvement has been found in the performance of a particular choice of architecture and input image dimensions over another.Additionally, we have not observed any significant difference between the performance of models pre-trained on natural image datasets,    such as ImageNet 50 and COCO, 51 and the performance of the rest of the models.
From the results of the ROB assessment, we recommend that future researchers in this topic make sure to report the patient selection criteria in detail.We believe that the appropriate cohort for a rib fracture detection model is blunt chest trauma patients, that is, patients who are suspected of having rib fractures.In such a cohort, there is no need for a control group of healthy patients (which can lead to a higher number of FP and to an underestimation of precision and F1-score).If patients with healing and old rib fractures are included, such lesions should be annotated accordingly, and the performance of the model should be split into each type of fracture.In addition, we advise not using CT data with 5 mm slice thickness for the training of the models, as such images might blur and hide rib fracture features due to longitudinal partial volume effects. 52,53With the CAA assessment, we remind that the developed DL models should be shared as opensource projects, so the results can be reproduced on different datasets.
The focus of our meta-analysis is acute rib fracture detection, which is the main goal of a rib fracture detection DL model in the emergency department.However, from our point of view, such a model should also have the capacity to distinguish acute from healing and old rib fractures.Otherwise, the model can produce a higher number of FP on patients who had rib fractures previously.Similarly, if the model is trained with CT scans presenting acute, healing, and old rib fractures but only the acute rib fractures are labelled, the model is prone to produce more FN.
The main limitation of our meta-analysis is the reduced number of selected studies, which is a consequence of the fact that the majority of the studies in this systematic review did not report the 95% CI of their results.In particular, the subgroup of studies with ROB had only four points in each forest plot, and in the F1-score forest plot one of the points had to be simulated.In our opinion, performance metrics should be written with their corresponding standard deviations or their 95% CIs.
Finally, it is of particular interest for our team to highlight the opportunities of transfer learning between clinical and PM cases.An advantage of PMCT scans is the absence of imaging artefacts due to breathing and motion of the body.However, PM cases with a high radiological alteration index (RA index) 54 should be excluded from training datasets.This is because a body with a high RA index presents signs of decomposition, and its CT scan shows air bubbles in many organs and cavities, including the bone marrow.By removing such cases, a properly annotated PMCT dataset can be used to train a DL model for rib fracture detection in a clinical context, and vice versa.We remind that if the goal is to train a DL model for rib fracture detection in patients with suspected blunt chest trauma, such PMCT dataset should only contain PM cases with rib fractures from blunt trauma.
With this systematic review, we have studied DL models for rib fracture classification, detection, and segmentation in CT scans.We have found that many studies do not properly report patient inclusion criteria, and only a few models are available commercially or as open-source tools.Moreover, with our meta-analysis we conclude that low ROB studies have significantly better performance in acute rib fracture detection with DL models.

Figure 1 .
Figure 1.PRISMA flow diagram of included articles.DL: deep learning.

Figure 2 .
Figure 2. Summary plot of the risk of bias of the studies.

Figure 3 .
Figure 3. Summary plot of the concerns about applicability of the studies.

Figure 4 .
Figure 4. Forest plot of acute rib fracture detection sensitivity.Abbreviations: ROB ¼ risk of bias, RE ¼ random effects.

Table 1 .
Characteristics of the included studies.
Abbreviations: FR ¼ female ratio, C ¼ classification, D ¼ detection, S ¼ segmentation, DA ¼ data availability, CA ¼ code availability, Req ¼ available upon request, Com ¼ commercially available.aFrom testing dataset.b Currently not available.c Median, not average.

Table 2 .
Average model performance on acute rib fracture classification.
Abbreviations: Dim.¼ dimensions of the model, N annot.¼number of annotations of acute rib fractures.aScan-level performance.b Includes old rib fractures.
Abbreviations Dim.¼ dimensions of the model, N annot.¼ number of annotations of acute rib fractures.
a Postprocessing.b Includes healing and old rib fractures.c Monte Carlo simulated.

Table 4 .
Average model performance on acute rib fracture segmentation.
Abbreviations: Dim.¼ dimensions of the model, N annot.¼number of annotations of acute rib fractures, IOU ¼ intersection over union.aIncludes old rib fractures.