Can artificial intelligence-driven cephalometric analysis replace manual tracing? A systematic review and meta-analysis

Abstract Objectives This systematic review and meta-analysis aimed to investigate the accuracy and efficiency of artificial intelligence (AI)-driven automated landmark detection for cephalometric analysis on two-dimensional (2D) lateral cephalograms and three-dimensional (3D) cone-beam computed tomographic (CBCT) images. Search methods An electronic search was conducted in the following databases: PubMed, Web of Science, Embase, and grey literature with search timeline extending up to January 2024. Selection criteria Studies that employed AI for 2D or 3D cephalometric landmark detection were included. Data collection and analysis The selection of studies, data extraction, and quality assessment of the included studies were performed independently by two reviewers. The risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies-2 tool. A meta-analysis was conducted to evaluate the accuracy of the 2D landmarks identification based on both mean radial error and standard error. Results Following the removal of duplicates, title and abstract screening, and full-text reading, 34 publications were selected. Amongst these, 27 studies evaluated the accuracy of AI-driven automated landmarking on 2D lateral cephalograms, while 7 studies involved 3D-CBCT images. A meta-analysis, based on the success detection rate of landmark placement on 2D images, revealed that the error was below the clinically acceptable threshold of 2 mm (1.39 mm; 95% confidence interval: 0.85–1.92 mm). For 3D images, meta-analysis could not be conducted due to significant heterogeneity amongst the study designs. However, qualitative synthesis indicated that the mean error of landmark detection on 3D images ranged from 1.0 to 5.8 mm. Both automated 2D and 3D landmarking proved to be time-efficient, taking less than 1 min. Most studies exhibited a high risk of bias in data selection (n = 27) and reference standard (n = 29). Conclusion The performance of AI-driven cephalometric landmark detection on both 2D cephalograms and 3D-CBCT images showed potential in terms of accuracy and time efficiency. However, the generalizability and robustness of these AI systems could benefit from further improvement. Registration PROSPERO: CRD42022328800.


Introduction
Cephalometric analysis provides important anatomical measurement data that is essential for orthodontic and craniomaxillofacial surgical workflows.It enables the morphometric quantification of craniofacial growth and the analysis of spatial relationships between hard and soft dentomaxillofacial structures for diagnostics, treatment planning, and outcome assessment [1,2].A standard cephalometric analysis is performed on two-dimensional (2D) lateral cephalograms or three-dimensional (3D) cone-beam computed tomography (CBCT) images [3].Both 2D and 3D cephalometry analyses require manual localization of anatomical landmarks, which is a time-consuming task that can take approximately 15 min per case for an orthodontist [4].Furthermore, the accuracy of landmark identification is subject to variability depending on the observer's experience and image quality [5,6].
Recently, solutions driven by artificial intelligence (AI), specifically machine learning (ML) and deep learning (DL), have been increasingly used to enhance the reliability, consistency, and accuracy of landmark placement for 2D and 3D cephalometric analyses [7,8].Machine learning, a subset of AI, creates algorithms that learn primarily from structured data, with decisions made based on intrinsic statistical patterns.Conversely, DL is a subset of ML that consists of convolutional neural networks (CNNs), a multilayer structure-learning algorithm that facilitates data processing through neural networks and automated data learning, akin to the functioning of human brain.In terms of performance, DL has demonstrated superiority over ML algorithms for various medical image analysis tasks.This is attributed to its capability to handle high-dimensional data of radiographic images with multiple predictor variables, and its ability to automatically and adaptively learn hierarchical features such as corners, shapes, and edges [9,10].
As the identification of landmarks is one of the primary causes of error in cephalometric analysis owing to observer variability [6,11], it is important to consider whether AI-driven solutions could serve as an accurate and timeefficient alternative to their traditional manual counterparts [12].Despite numerous studies on automated landmarking for both 2D and 3D cephalometric analyses, we believe a gap exists in literature related to the comprehensive review of the accuracy of these AI-driven solutions.In this context, the accumulation of evidence could enhance our understanding of the accuracy of AI-driven solutions.Existing systematic reviews on this topic have either restricted their investigation to deep learning alone [8,13], or exclusively focused on 3D images [13].
In the field of orthodontics, 2D landmarking and cephalometric analysis are often favoured due to their capacity to yield substantial data, which aids in devising the most effective treatment strategies for large portion of orthodontic patients.In these situations, 3D cephalometry derived from CBCT images is generally not advised, mainly because of the high radiation exposure risks [14,15].On the other hand, 3D cephalometry has advantages in terms of precise anatomical recognition and intricate structural assessment.This is particularly useful when more comprehensive treatment planning is required, such as in the digital planning processes of orthognathic surgery and implantology.In these cases, traditional 2D landmarking may not provide adequate information [16].Hence, both types of datasets are considered clinically significant, depending on the specific task [17].Despite the significant differences in AI methodologies and algorithms applied for automated 2D and 3D landmarking, a comprehensive review encompassing both types of datasets can offer an integrated view of the discipline.This approach could highlight progress in both dimensions and identify areas necessitating additional research and development.
Therefore, the aim of this systematic review and metaanalysis was to report the accuracy and efficiency of AI-driven automated landmark detection on 2D lateral cephalograms and 3D-CBCT images.

Protocol and registration
The study protocol was registered under the number CRD42022328800 in the PROSPERO (Prospective Register of Systematic Reviews) database.The title and research question of the review were modified from their original version, as documented in PROSPERO (Supplementary File 1).However, the rest of the methodology remained unchanged.The systematic review and meta-analysis were conducted following the PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines [18].

Review question
The review question was formatted according to the PICO (Population, Intervention, Comparison, and Outcome) framework, as follows: Patients (P): 2D lateral cephalograms or 3D-CBCT images of human subjects.
Intervention (I): AI-based algorithms for automated cephalometric landmarks identification.
Comparison (C): manual landmarking by experts (ground truth), where experts refer to either experienced dentists, clinicians, or orthodontists having expertise in cephalometric landmarking.
Review question: Does the AI-driven cephalometric analysis (I) on 2D cephalograms and 3D-CBCT images (P) offer improved accuracy and time-efficiency (O) compared to manual landmarking by an expert (C)?"

Eligibility criteria
The review included all full-text diagnostic accuracy studies evaluating the performance of AI-driven algorithms for the automated detection of landmarks.The studies were selected based on the following inclusion criteria: (i) training and testing on 2D lateral cephalograms or 3D-CBCT images (with sufficient detail e.g.dataset size, image modality, AI algorithm, etc.) for automated detection of relevant landmarks, which are commonly applied for performing cephalometric analysis, such as nasion, orbitale, menton, pogonion, and subnasale.(ii) reporting of results as success detection rate (SDR) or mean radial error (MRE) in millimetres (mm) to determine clinical applicability.(iii) studies comparing automated with manual landmarking as a clinical reference.No restrictions were applied regarding the year and language of the publication.
Case reports, review papers, book chapters, letters, conference papers, and commentaries were excluded from the review.Additionally, studies that solely included landmarks that do not contribute to standard cephalometric analysis, such as craniometric points (asterion, pterion, ophistion, etc.), were not considered for this review.

Information sources and search
An electronic search was performed in PubMed, Web of Science, and Embase up to the period of January 2024.A two-pronged search strategy was applied which consisted of combining the technique of interest (AI, ML, DL) and diagnostic target (landmark detection for cephalometric analysis).Each concept consisted of MeSH terms and keywords.The full search strategy is presented in Table 1.
A comprehensive grey literature search was executed using databases such as ProQuest, Google Scholar, OpenThesis, and OpenGrey to minimize the risk of selection bias.In addition, a thorough hand-search of references within original articles, reviews, and conference proceedings (collection of conference papers) was performed to identify any additional studies that were not retrieved from the chosen electronic databases.The articles identified were imported into Endnote X9 software (Thomson Reuters, Philadelphia, PA, USA) for the removal of duplicates and further selection.

Study selection and data extraction
Two reviewers (J.H., M.V.) independently screened the relevant articles based on their titles and abstracts, followed by full-text reading of the included studies against the eligibility criteria.Any disagreement was resolved through discussion.A third experienced reviewer (R.J.) was consulted if consensus could not be reached.
Data extracted from the selected articles included: title, author, year of publication, country of origin, aim of the study (algorithm's computational improvement or clinical validation), image type (2D lateral cephalograms or 3D-CBCT images), dataset source, total sample size, subsets (training, validation, test), characteristics of applied AI-based algorithm, number of landmarks and reported outcomes.The corresponding authors of the included studies were contacted for the provision of any further information or missing data.

Risk of bias assessment
The Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool was used to evaluate the risk of bias and applicability concerns.This tool was chosen due to its comprehensive coverage of aspects that need assessment in primary diagnostic accuracy studies, and its customizability, which allows for a more focused approach tailored to specific review.It served two purposes: first, to assess the impact of potential bias sources on test accuracy estimates, and second, to evaluate the influence of hypothesized sources of clinical heterogeneity on these estimates [19].
The tool consisted of a systematically developed checklist for determining the quality of diagnostic test accuracy studies (DTA).The checklist was divided into four domains for evaluating the risk of bias: (i) data selection (consecutive or random inclusions, no case-control design, no inappropriate exclusions); (ii) index test, i.e. test under evaluation (interpretation blinded for and independent of reference standard); (iii) reference standard, i.e. how was ground truth established (interpretation independent of and blinded for index test, valid reference test); (iv) flow and timing (sufficient time between index test and reference standard, did all data receive reference standard and the same one, all data included in the analysis).The first three domains were also evaluated in relation to concerns about applicability (does each domain match the research question) [19].The applicability concerns help to determine if the study's findings can be applied to real-life clinical scenarios.If significant concerns arise in any of the domains, it could impact the overall applicability of the study's results to a broader patient population or clinical setting [20].
Two reviewers (J.H., M.V.) independently assessed the risk of bias using the QUADAS-2 checklist.Discrepancies were resolved through discussion.If consensus could not be reached, third experienced reviewer (R.J.) was consulted.

Data analysis and synthesis
A meta-analysis was conducted using RStudio (version 2023.12.1, Posit Software, Boston, MA, USA) to evaluate the accuracy of 2D landmarks identification based on MRE and standard error (SE), where MRE value closer to zero corresponds to higher accuracy of automated landmarks identification.When multiple test datasets were used in the studies, they were assessed as separate groups to account for data variability.The summary measures included the MRE of test datasets with 95% confidence interval (CI).Heterogeneity was examined using Q-value and I 2 statistics.The choice of statistical model was determined by the I² statistics, a measure of heterogeneity.If the I² was less than 50%, indicating low heterogeneity, a fixed-effects model was employed.Conversely, if the I² exceeded 50%, suggesting substantial heterogeneity, a random-effects model was utilized.The selected model was then used to generate the forest plot.The number of radiographs and cephalometric landmarks evaluated in each test dataset was considered when determining the weights of each study in the meta-analysis.A P-value of less than 0.05 was deemed statistically significant.

Study selection
The electronic database search yielded 2082 articles.Of these, 1026 were duplicates and 971 did not meet the eligibility criteria based on their titles and abstracts.The full text of the remaining 76 articles was reviewed, resulting in further exclusion of 45 articles.Supplementary File 2 describes the reasons for exclusion.Ultimately, 34 studies were deemed eligible and included in the systematic review.The selection process is depicted in the PRISMA 2020 flow diagram (Fig. 1).

Study characteristics
The included studies covered a period of seven years, from 2017 until 2023.The majority of the studies originated from South Korea (n = 15), followed by China (n = 7), Japan (n = 3), USA (n = 3), Germany (n = 2), and one each from France, Hong Kong, Netherlands, and Turkey.Automated AI-based landmarks identification was applied on 2D lateral cephalograms in 27 studies and 3D-CBCT images in 7 studies.Most studies (27 studies) primarily investigated the computational improvement of algorithms for landmark detection, while seven studies focused on clinical validation of established methods.The characteristics of these 2D and 3D studies are summarized in Tables 2 and 3 respectively.Almost half of the 2D studies evaluated the accuracy of their AI algorithms using a public benchmark dataset from the IEEE International Symposium on Biomedical Imaging 2015 grand challenge [21].This dataset consisted of 400 high-resolution lateral cephalograms (training set = 150, test set 1 = 150, test set 2 = 100) with 19 manually annotated landmarks by two experts (1 junior and 1 senior orthodontic specialist) as the ground truth.These manually annotated landmarks serve as a reference against which the AI algorithm's performance is measured.
The original dimensions of the images were 1935 × 2400 pixels, with resolution of 0.1 mm per pixel in both horizontal and vertical directions.The average intra-observer variability for these landmark points was found to be 1.73 mm for the junior expert and 0.90 mm for the senior expert.On the other hand, the inter-observer variability between both experts was found to be 1.38 mm, suggesting reasonable accuracy target for automated landmark detection techniques.To compensate for any inter-observer variability, the mean position of the two points from both experts was used as the ground truth [21].Among the included studies, the total number of landmarks tested ranged from 7 [22] to 105 [23].The amount of data for training ranged from 15 [24] to 1983 images [25], while the test dataset ranged from 4 [26] to 400 images [2]. Figure 2 illustrates an AI-derived automated landmark identification on 2D cephalogram followed by manual correction by an expert, and manual identification on 3D-CBCT image.

Qualitative synthesis
A qualitative synthesis of all reported data related to automated 2D and 3D landmarks identification was conducted.The 2D studies, which used only the IEEE dataset, demonstrated that the accuracy of test set 1 ranged from 75.37% [27] to 87.61% [28], based on the SDR value within the 2 mm error threshold.Conversely, 18 studies that used their own datasets, either alone or in combination with the IEEE dataset, revealed SDR ranging from 62.0% [11,29] to 97.30% [30] within the clinically acceptable 2mm range.
Studies that applied automated landmarking on 3D-CBCT images reported their accuracy as either mean error (n = 7) or

Identification of studies via other methods
Studies sought for retrieval (n= ) 3 Studies not retrieved (n=0) Studies assessed for eligibility (n= ) 3 Studies excluded (n=0)  SDR (n = 2), with the highest observed error being 5.785 mm [31].Of all the landmarks on 2D and 3D images, gonion was generally the most challenging to locate automatically, where the lowest SDR was 38.0% [25], and the highest was 85.00% [32] within the 2 mm threshold.The computational time to automatically detect the landmarks was calculated in 11 articles, all of which reported a timing of less than 1 min.Table 4 presents list of cephalometric analysis that could potentially be performed using the automated landmarking proposed in the included studies.In terms of clinical applicability, the AI algorithms for automated landmark identification used in most studies could facilitate at least the Steiner and Down analyses.This was due to the algorithms' ability to identify the following landmarks: sella, nasion, point A, point B, pogonion, gnathion, menton, gonion, porion, orbitale, upper incisor, and lower incisor [33,34].

Quantitative synthesis
The meta-analysis was limited to the accuracy of 2D landmarks identification due to the diverse range of study designs and reported outcomes used in 3D cephalometry.The accuracy of AI-based 2D landmarks identification was evaluated in studies that reported the MRE and SE outcomes of test datasets.A total of 14 studies with 21 estimates were included, in which 3 studies tested their accuracy on 2 test sets and 2 studies on 3 sets.The statistical analysis revealed homogeneity among the included studies, as indicated by Q-value of 2.53 (P > 0.05) and I 2 of 0%, indicating no significant

Cephalometric Analysis
heterogeneity among the studies.A fixed-effects model was employed since the included studies demonstrated homogeneity with the same true effect size.The results indicated that the prediction of AI-based landmark placement generally fell below the 2 mm error threshold (1.43; 95% CI: 0.95-1.91),and only the results of two studies exceeded this threshold (Fig. 3).

Risk of bias assessment
When using the QUADAS-2 tool, 'AI-driven cephalometric landmark detection' acted as the index test domain and 'manual landmark placement by experts' was considered as the reference standard domain.Most studies had a high risk of bias associated with data selection (93%), primarily because the authors did not employ randomized selection process.Furthermore, high risk also existed based on the use of the reference standard.Generally, the applicability concern associated with the included studies was high, with the exception of the index test usage.Fig. 4 provides a comprehensive overview of the risk of bias and applicability concerns.

Discussion
In the digital age and the rise of precision dentistry, workflows in dentomaxillofacial practices are increasingly streamlined through the incorporation of AI-based technologies.This systematic review and meta-analysis were conducted to evaluate the accuracy of AI-powered tools in automating 2D and 3D cephalometric landmarking.A significant portion of the studies included in the review originated from East Asia (76.5%), with less representation from Europe and America.This trend can be attributed to various factors, such as the rapid advancement of technology and significant investment in AI research in East Asia, as emphasized in reports by the Organisation for Economic Co-operation and Development (OECD) and the Government AI Readiness Index.The position of East Asia as a leading global centre for AI innovation is evident from its extensive production of AI-related publications and its high ranking in the government AI readiness index These factors highlight East Asia's crucial role in propelling AI research and development [35,36].Nevertheless, it is crucial to ensure broad spectrum of viewpoints and contributions in AI research, as this can result in more holistic and inclusive solutions.Therefore, there is a call for international collaborative research to ensure the universal applicability of AI technologies across varied patient demographics.The findings of the review suggested variability in the accuracy of landmark detection amongst different studies.This could be attributed to the differences in the sample size used in the training set, where large heterogeneous samples with anatomical variabilities are expected to provide a more comprehensive learning process, thereby ensuring accuracy [37].Moreover, each study used a distinct dataset for testing, separate from the one used for training.This is a normal practice in the evaluation of AI models.It ensures that the models are tested on data not encountered during the training process, thus minimizing the chance of overfitting and providing solid assessment of their ability to generalize [38].However, it is important to note that these studies did not provide detailed descriptions of the specific methods employed to select the subjects included in the test dataset.
Most 2D cephalometric studies used the publicly accessible IEEE dataset to train AI algorithms, with the aim of enhancing the accuracy and efficiency of automatic landmark identification through computational improvements.Although the IEEE dataset offers the advantage of standardized performance comparability, it also introduces a challenge due to limited generalizability.Hence, making the clinical applicability of the AI tool questionable [39].This issue was corroborated by studies included in the review that emphasised on clinical validation.These studies used their own datasets and demonstrated lower accuracy compared to those that focused on computational enhancement using the IEEE dataset [29].Therefore, future research should utilize multi-centre datasets with varying acquisition parameters for clinical validation.This approach could enhance the consistency and robustness of AI-driven solutions and address generalisability issues, which are crucial for clinical applicability [40].
In the field of AI, particularly in the context of medical imaging and analysis, the complexity of a dataset is also determined by several other factors, such as size and shape of the anatomical structures, age, gender, type of malocclusion, ethnic background, and bone density [41,42].These characteristics introduce wide range of variations that the AI system must be able to recognize and interpret correctly [43].This requires sophisticated algorithms and large amount of diverse training data.The more complex the dataset, the more challenging it can be for the AI to learn and make accurate predictions [37,38,44].For instance, Tanikawa et al. [44] demonstrated that the performance of AI-driven automated landmarking was lower in patients with cleft palate compared to those without this condition.When training AI algorithms for cephalometric landmark detection, it is crucial to understand that the robustness and accuracy of the algorithm depend on its adaptability to variations encountered in clinical practice.
When comparing the included studies, a negative correlation was found between the reported accuracy and the size of training data.For instance, Hwang et al. trained the AI algorithm with 1983 2D images, each containing 19 annotated landmarks, and observed an accuracy of 73.2% [25].In contrast, Lee et al. used a training set of 150 images and achieved higher accuracy of 86% [22].This inconsistency could be associated with the variability of the training and testing sets, where Lee et al. relied on the IEEE dataset consisting of patients without any craniofacial deformities and with similar radiological patterns in both training and testing sets [43] On the other hand, Hwang et al.'s testing datasets included patients with variable heterogeneous radiological patterns, which the AI algorithm might not have accurately identified based on the homogeneous dataset used for training.
Most 2D studies reported an accuracy of more than 80% within 2 mm error threshold, while the mean error of 3D landmark detection approximately ranged between 1. 0mm and 5.8mm.It is important to note that the accuracy of landmarking cannot be directly compared between these two types of datasets.In 2D imaging, landmarks are projected onto a single plane, simplifying the identification process and often leading to higher reported accuracy rates within the given error threshold.Conversely, 3D landmark detection involves identifying points within a volumetric space, which introduces additional complexity and challenges [24,45].Despite achieving high accuracy level, the performance of AI has not yet reached the level of an expert, and further improvements are anticipated, especially in the realm of 3D landmark detection.This is an area where limited number of cases were used for the training and testing of AI algorithms in the reviewed studies.Given the challenge of accurately detecting landmarks in three dimensions using small datasets, while still maintaining high accuracy within 2 mm error threshold, it is advisable to conduct additional studies with a larger sample size [31].
The findings of the included studies were compared against a threshold of 2 mm, which is generally accepted as clinically acceptable for most cephalometric measurements [6].This tolerance for error is justified due to the inherent limitations of 2D imaging, which involves projected image of the majority of cephalometric points in the context of right-left asymmetry.Mostly, clinicians estimate a median between the projections of paired cephalometric points to establish the references for the cephalometric analysis.Although 3D imaging avoids geometrical distortion, precise segmentation from CBCT has not yet to be standardised in semiautomatic workflows [46,47].A discrepancy of even 2 mm can indeed have significant implications, especially when dealing with smaller patient sizes or specific landmarks.This is why it is crucial to strive for the highest accuracy possible in these situations.It is worth noting that while certain level of error might be deemed acceptable by clinical standards, the goal should always be to minimize this as much as possible to ensure the best patient outcomes.Moreover, clinicians are cognizant of potential errors in the placement of landmarks, which are typically taken into account subjectively during the clinical interpretation of the analysis and patient's treatment planning [6].
The selection of the cephalometric landmarks included in our review was primarily based on their widespread use in orthodontics and clinical relevance [32,33].Among the different annotated landmarks on both 2D and 3D images, gonion was generally one of the most difficult landmarks to localise automatically and had the lowest detection rate.The identification of gonion appears to pose a significant challenge not only for AI algorithms, but also for human observers.This is primarily due to the fact that this landmark is a constructed point on the 2D cephalogram, resulting from the imperfect overlay of the bilateral aspects of the mandible.Additionally, the 3D error may be a consequence of discrepancies in volumetric segmentation or difficulties in determining the definitive vertical position of gonion along broadly curved structures, a problem also commonly encountered by human observers [48].Hence, it is important to take these limitations into consideration while training an AI algorithm as such to improve its performance.It is worth noting that the accuracy of landmarks identification is heavily dependent on the expertise and anatomical knowledge of the experts [6].Consequently, the experts responsible for creating training datasets should have substantial experience in this field.A low detection rate of the gonion point might diminish the overall measured performance of the AI tool.Hence, it is essential to address this issue to ensure the effectiveness of the AI tool.
Regarding clinical applicability, the AI algorithms discussed in most studies have demonstrated the ability to identify key landmarks commonly used in two of the most prevalent cephalometric analyses, namely Steiner and Down's analyses [33,34].This suggests that the current AI tools could be considered clinically applicable for cases requiring orthodontic diagnosis and treatment planning.However, caution is advised as their accuracy has not yet reached the level of an expert, which could lead to errors in cephalometric analysis for diagnostics, planning, or outcome evaluation.Furthermore, time consumption is another important parameter to be considered in a clinical practice.While it takes an expert approximately 20 min to manually identify cephalometric landmarks [37], most AI-based algorithms can do so in less than a minute.Despite this, further research is needed to enhance AI's accuracy, as time efficiency alone is not sufficient justification until it can provide accuracy comparable to that of an expert.A beneficial addition to AI algorithms would be the ability to identify which landmarks increase time consumption and are incorrectly identified.This would allow for manual intervention to correct these errors and train the algorithm based on the corrected data [39].Incorporation of this human-AI collaboration for error correction should be considered in future studies.
This review encountered several limitations.First, the number of studies included was relatively limited, particularly those related to 3D landmarking.Second, due to the variability in the datasets, imaging parameters, and algorithms, the results of the quantitative synthesis should be interpreted with caution.Third, a significant risk of bias was observed in patient selection.While few studies provided detailed information about patient selection, the majority relied on IEEE dataset without explicitly outlining their sampling procedures [16,38].Finally, the manual identification of landmarks for the training set is subject to both inter-and intra-observer variability [49].Hence, it is advisable to specify the training and calibration protocol for landmark identification when creating the ground truth for training.Future studies should also adhere to AI standards such as CONSORT-AI and SPIRIT-AI [50].

Conclusions
The AI-driven cephalometric landmark detection on 2D and 3D images exhibited high accuracy and time efficiency.Although the majority of 2D studies indicated superior automated landmark detection performance, the error rates displayed by 3D studies were inconsistent, thus implying a need for further improvement.Moreover, clinicians are advised to remain vigilant due to the risk of inaccurate landmarks identification.To enhance the generalisability and clinical applicability of AI models, it is suggested that datasets be broadened to include a more diverse range of data.The incorporation of AI-driven landmarks identification in further studies could accelerate its refinement and overall development, thereby setting the stage for its potential to replace manual landmarking.

Figure 3 .
Figure 3. Forest plot of automated landmark identification on 2D cephalograms, reporting their accuracy in mean radial error (MRE) and standard error (SE) (mm).Studies using multiple test datasets are also indicated accordingly.Horizontal line indicates 95% confidence interval (CI), square shape indicates SE and diamond shape indicates pooled subtotal.

Figure 4 .
Figure 4. Risk of bias and applicability concerns based on Quality Assessment of Diagnostic Accuracy Studies-2 tool.

Table 1 .
Search strategy on each database.

Table 2 .
Characteristics of included studies using 2D lateral cephalograms.

Table 3 .
Characteristics of included studies using 3D-CBCT images.

Table 4 .
Potential cephalometric analysis using annotated landmarks.