Commercially available artificial intelligence tools for fracture detection: the evidence

Abstract Missed fractures are a costly healthcare issue, not only negatively impacting patient lives, leading to potential long-term disability and time off work, but also responsible for high medicolegal disbursements that could otherwise be used to improve other healthcare services. When fractures are overlooked in children, they are particularly concerning as opportunities for safeguarding may be missed. Assistance from artificial intelligence (AI) in interpreting medical images may offer a possible solution for improving patient care, and several commercial AI tools are now available for radiology workflow implementation. However, information regarding their development, evidence for performance and validation as well as the intended target population is not always clear, but vital when evaluating a potential AI solution for implementation. In this article, we review the range of available products utilizing AI for fracture detection (in both adults and children) and summarize the evidence, or lack thereof, behind their performance. This will allow others to make better informed decisions when deciding which product to procure for their specific clinical requirements.


Introduction
Missed fractures impact both patients and healthcare providers.Between 2015 and 2018, the total cost of missed fracture claims in the NHS was over £1.1 million, with an average cost per claim of approximately £14 000. 1 Although the number of claims were relatively few (n ¼ 78) compared to over 1 million fracture attendances per year across the United Kingdom, they are seen as an avoidable cost, money which could be better spent improving other NHS services.Furthermore, missed fractures in young children pose an additional challenge, as these may reflect failed opportunities for safeguarding and referral to social services.
One means to reduce such errors may be to incorporate assistance from artificial intelligence (AI).Several systematic reviews and meta-analyses have investigated the use of AI for the detection of fractures.In adults, high sensitivity rates of 92% for plain radiographs and 89% for computerised tomography (CT) scans have been reported, with specificities of 91% for plain radiographs and 92% for CT scans [2][3][4] ; with accuracy rates of between 89% and 98% in children. 5Despite such promising results, over half of these studies had a high risk of bias, 2 were only used in a research setting, were not ready for clinical deployment, nor externally validated. 6This is particularly concerning given that 24% of AI solutions in one study yielded a substantial decline in performance when evaluated on external data (compared to their internal data set) and the majority (81% of algorithms) reported negative impact. 7onetheless, several AI vendors have now developed fracture detection models for routine practice (Figures 1-3), ready for commercial integration into radiological workflows.Although this brings cutting-edge technology a step closer to direct patient benefit, it is vital that such tools are independently evaluated prior to adoption.Of concern, van Leeuwen et al 8 found that in a review of 100 Conformit� e Europ� eenne (CE) marked AI tools for different radiology use cases, only 36% had any associated peer reviewed verification of performance, of which fewer than half were independent of the vendor (ie, without flagrant conflict of interest).Without independent evidence, it can be very challenging to know how well an AI product might perform in a given clinical setting, and whether it is "worth" purchasing. 9n this market research review, we assess the range of available commercial products utilizing AI for fracture detection (in both adults and children) and summarize the evidence behind the performance of these tools.This will allow others to make better informed decisions when deciding which AI product, if any, to procure for their specific clinical requirements.

Methods
A search of commercially available AI solutions for fracture detection using medical images was performed by the lead and senior author, using the methods detailed below to ensure as comprehensive a search as possible.
1) A search of both CE 8,10 certified and Food and Drug Administration (FDA) approved sites 11,12 were filtered for medical products that utilized machine learning (ML) algorithms, and then further refined to incorporate those that included the term "fractures" within the diseases targeted by the product.2) A review of AI exhibitors and sponsors at several large annual radiology conferences in 2022 were identified relating to general, paediatric, and musculoskeletal imaging (ie, Radiological Society of North America-RSNA, European Congress of Radiology-ECR, European Society of Paediatric Radiology-ESPR, European Society for Skeletal Radiology-ESSR) to determine if they provided solutions for fracture detection.3) A search of peer reviewed publications within the PubMed, Scopus, and EMBASE databases published between January 1, 2012 and December 31, 2022 using a Boolean search strategy to find those articles with keywords relating to "machine learning," "artificial intelligence," "imaging," and "fracture."This was then further manually reviewed for specific mention and evaluation of commercial AI solutions.
There were no restrictions placed on the type of imaging modality, body parts targeted, type of AI/ML methodology or the intended population for the AI tool.Products and applications returned through these search strategies were deemed eligible if they supported healthcare professionals at image diagnosis, triage, classification/detection, with respect to fracture detection.We excluded products that were advertised via software marketplace redistributions, and those that were categorized under medical image management and processing systems.Only software that was currently commercially available was included in the main analysis.Products which were still in development or at prototype stage were included separately in case of future relevance to readers.Applications that have been withdrawn from the market or no longer available were not included.
For each product, information was gathered through numerous sources.Company websites, FDA/CE certification documents, user manuals, and scientific articles were first collated to collect data about the developer, technical specifications, specific functionalities relating to clinical application, and any evidence that exists to support the product's performance.Levels of evidence for each AI solution were further classified into 6 levels according to an adapted hierarchical model of efficacy by Fryback and Thornbury, 13 and used in a prior article evaluating evidence for commercial AI products (Table 1). 8o ensure accuracy, timeliness and comprehensiveness of online information, all relevant AI vendors were contacted directly and further supplied a survey of questions (see Supplementary Material) to complete.A timeframe of 2 weeks was provided for return of the survey, with the option of having follow-up emails and online meetings to better discuss our survey queries, if preferred.We also contacted the MHRA directly with the vendor names and AI solutions to confirm whether United Kingdom Conformity Assessed (UKCA) certification had been awarded for the products in question.
The majority of the commercial AI products (14/21) were intended for fracture detection on plain radiographs (Figures 1-3), with the remainder (7/21) related to CT evaluation.Only 3 products specified they were intended for use in adults and children (all relating to radiographic interpretation), with the remainder intended solely for adult use.All products were intended to aid human interpretation or triage, not for autonomous usage (at this stage of their development or regulation).

Evidence levels
Predominantly, the AI products reviewed had evidence for their performance provided by the vendor for conformity certification, with 7 having independent, peer reviewed publications available (total publications ¼ 18), with the greatest number originating from Gleamer (n ¼ 8).The majority of the evidence levels for AI product performance were at Level 3 (ie, change in diagnosis with and without AI assistance) and with some products (eg, BoneView Gleamer, 36,38,39 Annalise Enterprise CXR v1.2 Annalise.AI, 46 Qure.AI qXR 48 ) potentially demonstrating evidence at Level 4 (ie, demonstrating improvement in time for diagnosis, which could be argued may lead to swifter treatment or follow-up for the patient 38 ).There was no evidence available to demonstrate a benefit in actual patient outcome (eg, reduced time for recovery, reduction in repeated hospital visits, etc.), nor any publications on the health economic cost savings.Only one external validation article specifically mentioned changes to patient management (for Annalise Enterprise CXR v1.2 Annalise.AI). 46A summary of the available evidence associated with each product is reviewed in Table 4.
The largest independently published study to demonstrate improvement in human diagnostic accuracy with AI for fracture detection, included 480 radiographs (60 radiographs across 8 body parts; 50% abnormal) across 24 readers (comprising radiologists, emergency physicians, orthopaedic surgeons, and other healthcare professionals). 38There was an overall improvement in sensitivity rates across all specialists with AI assistance of 10.4%, and shortened reading by 6.3 s per examination.
Only one of the AI products (HealthVCF, Nanox.Ai) was reviewed by the National Institute for Health and Care Excellence (NICE) in a Medtech innovation briefing document, 34 based on a published conference abstract 57 and one peer-reviewed article, 58 regarding the use of AI for the assessment of vertebral compression fractures on CT imaging.Whilst the NICE experts accepted that there would be clear patient benefit from the detection of vertebral compression fractures, and that the evidence was promising, it was nonetheless limited; the only published article was funded by the company.
Externally conducted studies that verify the performance of AI algorithms based on CT input are severely lacking.
Only 3 such studies were identified, of which 2 are for the same product (Aidoc C-Spine [CSF] 51,52 ).The study that included the largest number of cases involved 1904 CT scans and the performance of the AI algorithm was assessed against interpretation of the scans by a single attending neuroradiologist.The AI and neuroradiologist had agreeing reports in 91.5% of all cases.The AI was able to correctly identify 67 of 122 fracture cases (54.9%) and returned 106 cases that were false positives.The sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the AI algorithm were 54.9% (95% CI, 45.7-63.9),94.1% (95% CI, 92.9-95.1),38.7% (95% CI, 33.1-44.7),and 96.8% (95% CI, 96.2-97.4),respectively.The researchers also analysed the misdiagnosed fractures, finding that cases of chronic fractures were overrepresented, suggesting the AI algorithm is not well-adapted to handle this presentation. 52idence for usage in children All AI products were intended for use in adults, with independent peer-reviewed evidence for accuracy of the product in children (and younger adults) available for 2 vendors (AZMed and Gleamer).In one study evaluating the performance of the Rayvolve product (AZMed), 44 a retrospective review of 2634 radiographs across 2549 children (<18 years age) from one single French centre was performed.This demonstrated an overall sensitivity of 95.7%, specificity of 91.2%, and accuracy of 92.6% for presence/absence of a fracture (regardless of number and whether the fracture was correctly localized or not).There was some reduction in the accuracy of the product for children aged <4 years old and those within a cast.While there was a similar sensitivity of fracture detection in patients with cast compared to without cast of 95.3% and 93.9% (1.4% difference), respectively, the difference in specificity was significant at 30.0% and 89.5% (50.9% difference), respectively.The accuracy of fracture detection also decreased to 83.0% in the case of patients with casts, compared to 90.7% in patients without cast (a difference of 7.7%).Results in fracture detection from the 0-4 years and 5-18 years age subgroups also showed a difference with sensitivity of 90.5% (0-4 years), and 95.4% (5-18 years) (difference of 4.9%).The specificity did not demonstrate significant differences at 88.9% (0-4 years) and 88.8% (5-18 years), however, the accuracy did slightly decrease to 89.3% (0-4 years), compared to 90.7% (5-18 years) (difference of 1.4%).
Two publications evaluated the use of the BoneView product (Gleamer) in children and young adults 40,43 using the same data set of 300 musculoskeletal radiographs (half with fractures, in patients aged 2-21 years old) across 5 body parts acquired from a United States-based data provider.In the first study, 43 an external validation of the AI product alone was performed demonstrating sensitivity of 91.3%, specificity 90%, and patient-wise AUROC of 0.93.Avulsion fractures were noted to be challenging for the AI tool to detect (sensitivity per fracture of 72.7%).In the second publication, 40 differences in radiologist performance before and after AI assistance were evaluated across 8 radiologists (5 radiologists in training and 3 qualified paediatric radiologists).Across all 8 radiologists, the mean sensitivity was 73.3% without AI; and increased by almost 10% (P < .001) to 82.8% with AI.There was a statistically significant improvement in sensitivity for radiologists in training (by 10.3% [P < .001])compared to specialist paediatric radiologists (8.2% [P ¼ .08])demonstrating greater benefit for less experienced radiologists.

Conformity certification
For medical devices to be commercialized in different countries, different types and levels of conformity certification are mandatory.These do not necessarily guarantee the safety or efficacy of a product, more that it has been assessed and found to meet a certain minimum requirement.A list of the certifications for various products in this review is listed in Table 3.
Table 1.Hierarchical model of efficacy to assess the contribution of AI software to the diagnostic imaging process, reproduced from van Leeuwen et al, 8 originally adapted from Fryback and Thrornbury (1991) 13      In Europe and Northern Ireland, "CE certification" is required, however, recent changes to this regulation were introduced for medical devices.Prior to May 26, 2021, medical devices were CE certified under the "Medical Devices Directive" (MDD), however since then new standards known as the "Medical Devices Regulation" (MDR) have come into play.The MDR introduces more stringent requirements for clinical evidence, safety, post-market surveillance, and responsibilities of Notified Bodies (ie, the organizations designated to assess conformity with the regulations). 59CE certified medical devices under MDD will therefore be required to re-certify under MDR before December 31, 2028 (for medium and lower risk medical devices) 60 ; it is therefore important to know what CE certification a product has prior to purchase.In this review, 7/16 products have the more recent MDR CE certification, 6/16 have MDD CE certification, and 3 do not have current CE certification.
In Great Britain (ie, England, Wales, Scotland), the UKCA (UK Conformity Assessed) marking is a new regulatory marking that applied to medical devices following the end of the Brexit transition period (December 31, 2021), 61,62 although devices with CE marking will continue to be recognized until June 30, 2023 after which UKCA marking will be required for product use.At present, there is no central list of UKCA marked products, however, it is also possible to contact the relevant regulatory authority for further information (eg, the MHRA for UKCA).Although we contacted both bodies, we did not receive a timely response regarding which AI assisted fracture detection tools had this marking.
In the United States, FDA approval is required for medical devices and generally follows what is known as the "premarket notification (510(k)) process" for low to moderate risk devices (ie, Class 1 or 2 devices, which apply to the devices covered in this review).This pathway allows a vendor to demonstrate that their device is "substantially equivalent" to a legally marketed device (known as a "predicate device") already on the market and the vendor must include enough information (eg, intended use, comparison to predicate device, safety data) to prove that it can be marketed without requiring the more extensive "premarket approval" work-up (PMA). 63In this review, 11/16 products reported FDA certification.

Discussion
Our market review highlighted a range of commercially available AI products for fracture detection across a variety of body parts and imaging modalities, with most for radiographic assessment and intended for an adult population.Relatively few products have published independent peerreviewed evidence for their efficacy and diagnostic accuracy in children, although where tested, AI performance was found to be reduced for younger children.
For adults, there was a larger amount of peer-reviewed evidence across different body parts, with more studies evaluating the benefit of radiologists' imaging interpretation with and without AI and changes in speed of reporting for some products.It is hard to assess the best performing product purely through sensitivity and specificity due to varying levels and quality of evidence available; however, it is evident that Gleamer's BoneView is the most extensively externally validated product, whilst also reporting impressive sensitivities and specificities.Studies conducted on this product have also demonstrated reduced reading times of radiographs.This highlights possible future benefit for patients, especially if this leads to a faster referral for specialist care and treatment, although evidence demonstrating downstream improved patient outcomes and cost savings for a hospital department are yet to be evaluated (ie, Level 5 and 6 evidence).It is important however to understand the type of conformity certification an AI product has prior to purchase, and we have tried to be as comprehensive yet concise as possible in our review of the market status.There have been some notable changes in the CE certification regulations and also those for sale in the UK market.Many products which have previously been awarded CE certification under MDD will require re-certification under MDR soon, and those wishing to use a product in the United Kingdom will need to check that their vendor has/will receive the UKCA certification in the near future.
Through conducting this investigation into commercially available tools that leverage AI to perform detection and diagnosis of fractures, we have been able derive some key conclusions about this market.
First is the divide in modality, between plain X-ray radiographs and CT scans.The latter constitutes a significantly smaller portion of the products available for fracture detection and all found in this analysis only target section(s) of the axial skeleton for fractures.All CT-based products also specifically target one type of fracture, either relating to the spine or ribs, whereas products based on radiographs often tend to include many different body parts and different pathologies.Whilst the results we provide in this review are for overall summary diagnostic accuracy rates, it is important for readers to review the listed references and FDA documentation, where available, if they wish to garner more detailed accuracy rates for specific fracture locations and types.
Second, the market for children is still significantly behind adults in terms of range of available products, with 4 of the 16 products (25%) stating their applicability to children.There are understandably significant technical challenges in adapting AI solutions to be effective in paediatrics due to the variability in bone structure, predominantly between the ages of 0-16 years. 5Development in this more specialized field is also slowed and restricted by legal issues relating to the collection of data for training the algorithms and more complex and difficult procedures for obtaining certification or approval from respective legal bodies. 64hird, the amount of independent external validation is also significantly lacking for many products, with many only having vendor conducted validation for purposes of achieving FDA approval or CE certification.Future work in this domain that independently verifies the performance of specific commercially available products would provide a much clearer basis to evaluate which product is best for a clinical/ health institution.Of all the evidence discovered throughout this review, the highest level of evidence was Level 4 (according to the levels of evidence suggested by Fryback and Thornberry 13 ).This means study into the deeper impact of these tools is severely lacking, given that evidence Levels 5 and 6 assess the effect on patient outcomes through changes to quality of life and societal impact based on an economic analysis.As interest in this field continues to grow, such assessments will be fundamental in determining the greater value these products are able to provide.
We acknowledge that our review has limitations due to the ever-increasing number of commercial AI products coming to market and newer versions of existing tools being developed.It is likely that by the time of publication we may not have included some very recent tools, recent conformity accreditation and evidence to support usage in particular situations, which were unavailable at time of our search.We did contact as many AI companies as possible, including those that advertised only prototype versions of their software, to ensure we captured emerging products as well as those already established.We also offered the AI companies an opportunity to let us know of any updates in development.Some AI companies did not engage or respond to our request for information within the timeframe provided, including the MHRA regarding details on products with UKCA certification.

Conclusion
Overall, there is a scarcity of rigorous, independent evaluation of commercially available AI tools for fracture detection in adults and children, and some products will need to update their current conformity registration.The information in this article may help departmental and hospital leaders, as well as local AI champions, in understanding whether tools available are worth further investigation for their specific institution at this stage in their development.
Abbreviations: MDD ¼ Medical Devices Directive (pre May 26, 2021), MDR ¼ Medical Devices Regulation (post May 26, 2021).Deepnoid's Deep: Spine-CF-01 only has approval from the Korean Ministry of Food and Drug Safety (No. 19-550).a Previous version of the software from the company.

Figure 1 .
Figure1.This image demonstrates the results produced when using an artificial intelligence (AI) fracture detection tool by Gleamer, called BoneView.With this product, a summary image is sent to PACS, depicted in figure (A) which shows the number and type of pathologies detected on the radiograph.(B) A second image is also sent to PACS with bounding boxes and their associated labels (FRACT ¼ fracture, DIS ¼ dislocation) displayed as an "overlay" across the original radiographic image in question.In this example, the AI has flagged a fracture of the distal radius, with a scapholunate dislocation in a child.Image provided by Daniel Jones, Gleamer.

Figure 2 .
Figure2.This image demonstrates the results produced when using an artificial intelligence (AI) fracture detection tool by AZMed, called Rayvolve.In this example, an oblique left wrist view (A) and DP wrist view (C) have been submitted for AI interpretation.The AI has flagged a fracture of the scaphoid and ulnar styloid (B, D) in a child by displaying bounding boxes as an "overlay" across the respective radiographic images.These are also sent to PACS for radiology reporter and clinician review.Image provided by Liza Alem, AZmed.

Figure 3 .
Figure3.This image demonstrates an example of results produced when using an artificial intelligence (AI) fracture detection tool by Milvue, called Smarturgences.In this example, frog leg view of the pelvis in a child has been submitted for AI interpretation (A).The AI has correctly identified a fracture of the left anterior inferior iliac spine and placed a bounding box around the abnormality, as well as stating the pathology below the image (B).

Table 2 .
under the Creative Commons Attribution 4.0 International Licence.14Usecases and the intended population for commercial AI tools for fracture detection on medical imaging.

Table 3 .
Licencing and user details of commercial AI tools for fracture detection on medical imaging.

Table 4 .
Evidence for AI performance, based on FDA/CE conformity documentation or vendor endorsed studies.

data set Summary of evidence Level of evidence Ref.
Preferential evidence provided where MRMC (multireader, multicase) studies were conducted to evaluate clinical improvement (rather than standalone bench testing results).A hyphen denotes either unknown/not stated or not applicable information.Although the product for Annalise.aidoes have FDA approval, this is only for detection of pneumothoraces rather than fractures, therefore, the FDA clearance evidence is not included below.Similarly, Qure.AI has FDA approval for their qXR product for "breathing tube placement" analysis only, therefore, the FDA clearance evidence is not included below.
a Previous version of the software from the company.

Table 5 .
Evidence for AI performance, based on independent external peer reviewed publications.

Table 5 .
a Previous version of the software from the company.