Reporting characteristics of meta-analyses in orthodontics: methodological assessment and statistical recommendations

SUMMARy Ideally meta-analyses (MAs) should consolidate the characteristics of orthodontic research in order to produce an evidence-based answer. However severe flaws are frequently observed in most of them. The aim of this study was to evaluate the statistical methods, the methodology, and the quality characteristics of orthodontic MAs and to assess their reporting quality during the last years. Electronic databases were searched for MAs (with or without a proper systematic review) in the field of orthodontics, indexed up to 2011. The AMSTAR tool was used for quality assessment of the included articles. Data were analyzed with Student’s t -test, one-way ANOVA, and generalized linear modelling. Risk ratios with 95% confidence intervals were calculated to represent changes during the years in reporting of key items associated with quality. A total of 80 MAs with 1086 primary studies were included in this evaluation. Using the AMSTAR tool, 25 (27.3%) of the MAs were found to be of low quality, 37 (46.3%) of medium quality, and 18 (22.5%)


Introduction
Meta-analysis (MA) is a set of statistical techniques to combine results from two or more separate studies , while a systematic review consists of a clearly formulated question and explicit methods to identify, select, and critically appraise relevant research. Although the term "meta-analysis" is used interchangeably with "systematic review", strictly speaking, an MA is an (optional) component of a systematic review. Ideally articles summing evidence by means of MA should always be conducted within the framework of a proper systematic review, in order to identify and minimize the many sources of bias found, like reporting biases. However, this is not always the case (Chavalarias and Ioannidis, 2010;Polychronopoulou et al., 2010).
The evaluation of the reporting quality of published MAs is very useful, as it is directly related to the study's methodology and conclusions (Huwiler-Müntener et al., 2002;Moher et al., 1998). Attempts have been made over the years to assess the quality of systematic reviews and MAs by creating several relevant instruments for reporting and appraising them with varying efficacy (Sampson et al., 2008a). The QUOROM (Quality of Reporting of Meta-analyses; Moher et al., 1999) and subsequently the PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) statement (Liberati et al., 2009) incorporate items deemed important for the transparent reporting of an MA. In that direction, the AMSTAR (Assessment of Multiple Systematic Reviews; Shea et al., 2007b) tool was introduced, designed to critically appraise the methodological quality of systematic reviews, and has proved to be reliable and valid (Shea et al., 2007a(Shea et al., , 2009. Existing biases in systematic reviews and MAs may be examined in complex systematic reviews (Whitlock et al., 2008) or "meta-epidemiological" studies, in which the influence of specific characteristics on treatment effect estimates is explored (Sterne et al., 2002). The characteristics of orthodontic systematic reviews (without meta-analytical synthesis) as well as their shortcomings have been previously highlighted by the authors of the current article (Papageorgiou et al., 2011). These may originate from the suboptimal reporting quality of the reviews, from characteristics of the orthodontic literature itself associated with surrogate endpoints (Richards, 2005), or from studies of low power and reliability (Tulloch et al., 1989;Rinchuse et al., 2008). Indeed, MAs in orthodontics seem to present low reporting quality and include few high-quality primary studies (Papadopoulos and Gkiaouris, 2007). The methodology and reporting of orthodontic MAs have not been studied extensively so far.
The aims of this study were (a) to evaluate the methodology and quality characteristics of MAs (systematic review articles with meta-analytical data synthesis) related to orthodontics, (b) to assess whether their reporting quality has been improved over the years, and (c) to evaluate the appropriateness of the statistical methods being implemented in these MAs.

Study sample
Electronic search strategies were developed and executed to identify MAs relevant to orthodontics published in journals, dissertations, or conference proceedings (Supplementary Tables S1a and S1b). No restrictions were made concerning year, language, or publication status. The reference lists of acquired articles were also searched for relevant articles. Databases of research registers were searched to identify ongoing or unpublished reviews. All databases were searched on 4 March 2011 and were manually rechecked on July 2011. Relevant authors identified previously to have published multiple systematic reviews were contacted for additional articles (Papageorgiou et al., 2011). Two of the authors (S.N.P. and M.A.P.) screened independently the titles and abstracts of the retrieved citations to exclude non-eligible articles. A copy of the full text was obtained for the remaining articles. The same authors read each full-text article in order to determine whether it met the inclusion criteria. Additional material that was included as appendix in the original articles was acquired, when needed.
In this paper, the term "meta-analysis" was used for articles summing evidence from two or more separate sources in a single estimate. This must be distinguished from the statistical procedure of MA, since an article may report the results of more than one MA procedures. An article was considered as eligible for inclusion if it reviewed the literature appropriately in order to find relevant articles and combined data from two or more separate studies into a single estimate (number). These articles could be labelled as "meta-analyses" or "systematic reviews", although the latter were included only if they included a combined estimate. All other types of studies, such as case reports, case series without a control, case-control, cohort studies, other observational studies, randomized controlled clinical trials (RCTs), narrative reviews, as well as systematic reviews without meta-analytic procedures were excluded.

Data extraction
Data to be collected were defined a priori from pilot searching of the literature and discussion among the authors, based on a previous report (Papageorgiou et al., 2011). A number of general and specific reporting items (characteristics) for each review were assessed. Two of the reviewers (S.N.P. and M.A.P.) completed in duplicate data extraction in a predesigned collection form. The reviewers were not blinded to journal and author names, since masked assessment is inconsistent (Jadad et al., 1996). Any disagreement was resolved by consulting the last author (A.E.A.) until a final consensus was achieved. Inter-reviewer agreement on study selection and data extraction was assessed by Cohen's kappa coefficient.
Epidemiological characteristics were based on the first author of each article in order to extract the country's data, continent's data, and the "academic source" of each article (i.e. whether the first author originated from an orthodontic academic department, a non-orthodontic academic department, or not from an academic department). The Cochrane Database of Systematic Reviews was classified as a general journal. Journals having published less than five orthodontic MAs and the Cochrane Database of Systematic Reviews were unified in a journal category named "Other". The same was followed for quality assessment tools of primary studies that were used in less than five MAs. Journal impact factors (IFs) were acquired from ISI Journal Citation Reports Journal Citation Reports® Science Edition, 2011. H-indices were acquired from SCImago Journal & Country Rank (SCImago, 2007). Separate citation counts were acquired from Web of Science, Scopus, and Google Scholar, and their average value was used for analyses. All citation counts and journal metrics were acquired during the third week of August 2011.

Quality assessment
Articles were evaluated using the 11-item AMSTAR tool (Shea et al., 2007b). Each item was assessed using a fourpoint scale: "Yes", "Can't tell", "No", and "Not applicable". A criterion was defined as "Can't tell" if it was half met. For example, the fifth criterion, "A list of included and excluded studies should be provided", was scored as "Can't tell" if either the included or the excluded studies were listed. Nonapplicable items were excluded from the maximum scoring capability of each MA. Summary scores were extracted by giving one point for each "Yes" and half a point for each "Can't tell" (instead of just giving only one point for each "Yes") in an attempt to maximize data output. Summary scores are reported as percentages. The "acceptance-topublication time" was calculated as publication date minus acceptance date. Since the exact dates of acceptance and publication were not always available for all MAs, only the corresponding months were used for calculating the "acceptance-to-publication time".
An MA was considered as of "low quality" if the AMSTAR score was between 0 and 4 points, of "moderate quality" if the AMSTAR score was between 4.5 and 8 points, and of "high quality" if the AMSTAR score was between 8.5 and 11 points.

Statistical analysis
Modified AMSTAR score was tested for its normal distribution with visual histograms and the Kolmogorov-Smirnov test (P = 0.005). Square root transformations were applied to achieve a distribution as close as possible to normal (P = 0.079). The data are presented as medians, interquartile range (IQR), and 95% confidence intervals (CIs) in natural units. The Student's t-test and one-way ANOVA were used on the normalized data to identify differences.
To evaluate the association between the modified AMSTAR score and specific methodological characteristics, generalized linear models (GZLMs) were used. The natural modified AMSTAR score was used and the models were specified as having a normal distribution and a 0.5 power link. For bivariate comparisons, we constructed a separate model for each characteristic. A multivariate model was constructed to account for confounding factors, including independent variables with significant results from the bivariate models. Due to the large variables/studies ratio, instead of the usually used P ≤ 0.10 a value of P ≤ 0.05 was utilized for selecting variables for the multivariate model, and a post hoc Sidak correction for multiple comparisons was applied to the comparisons with reference categories.
Risk ratios (RRs) with 95% CIs were used as summary statistics to compare quality and reporting between specific  (Hopewell et al., 2010) in reporting of key items associated with quality of orthodontic MAs. The appropriateness of the statistical methods used was evaluated with items based on a relevant report of Cochrane reviews (Riley et al., 2011).
In reporting the data, an alpha level of 0.05 was used as the criterion for the statistical significance of the estimated effects. All statistical analyses were performed with the IBM SPSS software package (version 19.0; SPSS, Chicago, IL, USA) except for the forest plots, which were drawn with the RevMan software version 5.1 (Review Manager, 2011).

Literature search
The electronic search yielded 9070 initial citations (Supplementary Table S1b), while two additional papers were identified through communication with authors of systematic reviews. From the 5698 records that remained after duplicate exclusion, a total of 5618 records were excluded with reasons, and thus 80 MAs were deemed eligible for data extraction (Figure 1). The kappa scores before reconciliation for the selection and data extraction procedures were 0.910 and 0.867, respectively, thus indicating almost perfect agreement.

Publishing characteristics
The 80 MAs included 1086 primary studies, 62 214 251 patients, and were published between 1991 and 2011 in 39 distinct journals (Supplementary Table S2).   The MAs referred to a wide set of areas in the field of orthodontics (Table 1). The evaluation of the MAs according to AMSTAR is presented in Supplementary Table S3, while  their general characteristics and the corresponding modified  median AMSTAR scores are presented in Supplementary  Table S4.
Although a citation analysis is not formally displayed in the tables, the following information was obtained. Out of the 80 MAs, 66 (82.5%) were cited at least once in one of the three databases searched. For 78 MAs (97.5%), the citations from Google Scholar were equal or higher to those from Scopus or Web of Science. Also, the six most extreme citation counts (over 100 citations) for an MA were identified via Google Scholar. The average citations per MA from the three databases had a median of 13.5 citations (95% CI: 18.8-33.1, IQR = 35.8). MAs received a median of 1.7 citations (95% CI: 2.0-4.1, IQR = 3.8) per year. Chronologically, the highest citation counts per MA were found for those published in 1997 (61.0 citations), followed by 1999 (58.5 citations), and 1998 (42.0 citations), respectively. MAs from North America received more citations per MA than from other continents (26.3 citations), while MAs from the Netherlands received more citations per MA than other countries (38.3 citations). At journal level, the highest citation count per MA belonged to the Cleft Palate-Craniofacial Journal (28.3 citations), followed by the Angle Orthodontist (5.7 citations) and the American Journal of Orthodontics and Dentofacial Orthopedics (3.7 citations). MAs originating from university department (based on the first author) received on average more citations than non-academic ones (14.2 versus 2.0 citations, respectively). Through Web of Science, a total of 986 citations could be tracked for 45 MAs (19 non-indexed MAs and 16 non-cited MAs). At country level, the US contributed the greatest to the citing of orthodontic MAs (21.7%), followed by Canada (7.1%) and the UK (7.0%). At continent level, Europe contributed the greatest (41.8%), followed by North America (29.3%) and Asia (16.6%).

Methodological quality
Reporting quality varied among reviews ranging from 13.6% to 100.0%, with a median of 50.0% and an IQR of 40.9% (median = 5.5 and IQR = 4.0 AMSTAR points). Twenty-five MAs (31.3%) were of low quality, 37 (46.3%) of moderate quality, and 18 (22.5%) of high quality. Supplementary  Table S3 provides in summary the evaluation of the 80 MAs according to the AMSTAR tool. Twenty-two reviews (27.5%) clearly reported only the review question or only the inclusion criteria, while 25 MAs (31.3%) conducted in duplicate only study selection but not data extraction. Grey literature was not at all scanned for relevant articles in 55 reviews (68.8%). A list of excluded studies was not provided in 52 reviews (65.0%), while two reviews (2.5%) did not provide included or excluded studies in a list or a table at all. Table 2 provides the results from univariate and multivariate regression analyses used in order to explore MAs' characteristics possibly related to an increase of the modified AMSTAR score. Comparisons were made between baseline (reference category) and each of the remaining groups per characteristic.

Factors associated with reporting quality
After adjusting for potential confounding factors through the multivariate analysis, several characteristics remained significantly associated with quality score (Table 2). More specifically, the following statistically significant observations were made. Every passing year was associated with a 0.5% (95% CI: 0.1-0.9%) increase in AMSTAR score. Internal financial support was associated with increased quality score by 8.0% (95% CI: 2.1-13.9%) compared to no financial support. Use of a protocol seemed to increase AMSTAR score by 6.4% (95% CI: −1.7 to  11.0%). Presentation of the complete Boolean search strategy was associated with increased quality score by 5.9% (95% CI: 2.4-9.4%). The assessment of the reporting quality of included primary studies was associated with a net score increase of 13.0% (95% CI: 8.8-17.3%). Finally, the corresponding AMSTAR score was associated with a 1.5% (95% CI: 0.8-2.3%) increase for every additional IF point of the published journal.  Figure 3B).

Number of MAs
The statistical methods used in the included studies are presented in Table 3. The 80 MAs included in current investigation presented a large number of meta-analytic procedures for data combination. A median number of two forest plots per review (min = 0; max = 22) was given, while each plot presented one or more summary estimates. Such large numbers of forest plots per review can be interpreted by the variety of questions, outcomes, interventions, and subgroups tested. It also brings the concern of outcome hierarchy and multiple testing, which increase the risk of false significant results. Indeed, only 16 (20.0%) MAs did clearly define primary and secondary outcomes. A large number of MAs (48.8%) used sensitivity analyses or subgroup analyses, but only a part of them (22.5%) predefined them. One-fourth of the MAs (25%) planned to use a measure of heterogeneity to decide the statistical model used, and almost all of them did so (22.5%), while only 10% of them reported an estimate of the between-study variance ("tau-squared"). Out of the 49 reviews that applied the random-effects model to calculate one or more summary estimates, only eight of them (16.3%) justified why this was the appropriate method or interpreted the random-effects result as the average of the intervention effects across studies. Out of the 26 MAs with one or more fixed-effects model eight (30.8%) of them presented fixed-effects MA results even when potentially moderate (e.g. I 2 > 25%) or large (e.g. I 2 > 50%) heterogeneity was present, without justification of why the fixed-effects approach was still deemed appropriate. Overall, 21.3% of the MAs presented results with high heterogeneity without explanation, while four MAs (5.0%) used meta-regression to examine heterogeneity causes.

Publication bias
Only 26 MAs (32.5%) reported how they would assess the problem of publication bias, while 15 (18.8%) reported publication bias in the Discussion section. A total of 21 (26.3%) MAs drew a funnel plot in order to visually assess asymmetry, while 12 MAs (15.0%) used a statistical test for publication bias (i.e. a test of funnel plot asymmetry or "small study effects"). A total of six (7.5%) MAs found some evidence of asymmetry, by the visual inspection of funnel plots, the use of statistical tests, or both.

Discussion
This study provides a comprehensive assessment of the design and reporting characteristics of the largest cohort of orthodontic MAs until the collection of the present data and follows a previous evaluation of orthodontic systematic reviews without meta-analytical synthesis (Papageorgiou et al., 2011). The number of these reviews has increased over time with variability in the reporting quality. The examined MAs predominantly addressed questions about the effectiveness of therapeutic interventions and rare clinical entities (e.g. cleft lip and palate or obstructive sleep apnoea). Although MAs are considered by many clinicians as the top of the "evidence pyramid" and attract the highest relative citation counts (Patsopoulos et al., 2005), many of the identified reports did not report methods and bias assessment in sufficient detail. Certain characteristics of transparent design were not reported by a large number of reviews, including comprehensive literature search, validity of selection/extraction procedures, or methodological assessment of included studies, all of which are important for the replication and evaluation of the review (Sutton et al., 1998). Moreover, a suboptimal reporting of potential conflicts of interest was identified, despite the increasing concern that funding agencies influence the outcomes of biomedical research (Smith, 2005;Gøtzsche et al., 2009). Noteworthy,  articles disclosing sources of funding have been shown to be significantly more likely to be published than those without any disclosure (Lee et al., 2006), which indicates that transparent MAs could also be published easier. About 39% of the 80 MAs were published in orthodontic journals. Although North America is regarded as the most prolific continent regarding orthodontic literature (Kanavakis et al., 2006), the majority of orthodontic MAs were produced in Europe (42.5%) with North America coming second (36.3%). At the journal level, the American Journal of Orthodontics and Dentofacial Orthopedics and the Angle Orthodontist received MAs from three continents, and the Cleft Palate-Craniofacial Journal from two continents. The majority of MAs published by each of the three journals originated from Europe. The number of orthodontic journals has increased during the last years and the quality of the MAs accepted for publication by them was found higher than general biomedical journals, although not significantly. The IF characteristics of orthodontic journals have been previously discussed (Eliades and Athanasiou, 2001). In this study, scientific impact was measured both by the journal's IF and the h-index equivalent for journals (Braun et al., 2006), which seems to be quite robust (Vanclay, 2007). In this analysis, higher AMSTAR score was associated in some with both the journal's h-index (Supplementary Table S4) and IF (Table 2), showing a possible preference of submitting high-quality MAs to journals with major impact. However, MAs published in prestigious medical journals (such as the ones presenting higher IFs) have also been found to demonstrate exaggerated results compared with trials in other journals (Siontis et al., 2011).
The impact of orthodontic MAs was also assessed with the average of the citation counts from three databases. Citation counts differed among Google Scholar, Web of Science, and Scopus, possibly reflecting the quantitatively and qualitatively different coverage of each database (Kulkarni et al., 2009) and the small overlap among them (Meho et al., 2007). No association was observed between AMSTAR score and either average citations, or individual citation counts of each database. The same observation was made for systematic reviews in orthodontics (Papageorgiou et al., 2011) or original research articles in the field of psychiatrics (Nieminen et al., 2006). In the present study, selfcitations were not excluded. However, citation counts were not given overly much attention, as a citation does not guarantee the respect of the article referenced, but only that it is active in the scientific debate (Patsopoulos et al., 2005).
In this study, the reporting of orthodontic MAs was assessed with the AMSTAR tool, which is the most recent evidence-based appraisal instrument. AMSTAR has been validated to some extent (Shea et al., 2007a(Shea et al., , 2009Sampson et al., 2008a) and proposed by the Canadian Agency for Drugs and Technologies in Health and the World Health Organization as the best tool to critically appraise MAs (Proposed Evaluation Tools for COMPUS, 2005;Oxman et al., 2006;Pantoja and Campbell, 2009). The modifications made to the AMSTAR scoring system that were also used in a previous report (Papageorgiou et al., 2011) may possibly help to overcome assessment difficulties reported for it (Faggion et al., 2012). Nevertheless, like every instrument, AMSTAR does present weaknesses. For example, there is no recommendation on how the scientific quality of studies should be assessed. Other critics include the applicability of AMSTAR to reviews of non-randomized studies (Fedorowicz et al., 2011) or to mixed-methods reviews (Bouchard et al., 2011), which are reviews evaluating studies that employ qualitative, quantitative, and mixed methodology.
Improved quality was significantly related to certain MA characteristics. Financial support from the originating institution was associated with higher quality score. Also, MAs based on a protocol for guidance scored higher than others, something also observed for orthodontic systematic reviews (Papageorgiou et al., 2011).
Full provision of the Boolean search strategy was accompanied by an increase in AMSTAR score and is crucial for the reproducibility of the literature search (Maggio et al., 2011). Other reports have pointed out that the suboptimal reporting in orthodontics during the search and selection procedures needs improvement (Flores-Mir et al., 2006). Evaluation of electronic searches in dental systematic reviews has yielded similar results. A survey of systematic reviews' authors reported a lack of comprehensive literature searches (Major et al., 2009), which was also found in this study. Indeed, only 45% of the included MAs reported undertaking an extensive literature search according to the AMSTAR tool. A comprehensive assessment of systematic reviews in dentistry by Glenny et al. (2003) found that 8 out of 15 proposed key items were often not assessed, with literature search having the most problems. Better search and selection methodologies have been reported for certain dental specialties compared to others, although these specialties were also the most prolific ones in terms of publications (Major et al., 2007). Lately, as no consensus on search reporting methods exists (Sampson et al., 2008a), an evidence-based guideline for the peer review of electronic search strategies was developed (Sampson et al., 2009) in order to assure the quality of electronic searches and increase their effectiveness.
Methodological assessment of included studies was clearly associated with an increase in AMSTAR score. Regretfully, only half of the MAs formally assessed possible problems in the primary studies, while the most frequently used tool was the Cochrane risk of bias tool . Evaluating the quality, or preferably the "risk of bias", as described in the PRISMA statement (Liberati et al., 2009) and the Cochrane Handbook (Higgins and Green, 2011) is a crucial component of a systematic review. Poor primary study quality has been possibly associated with effect overestimation or underestimation (Hopewell et al., 2007), although its role is to a large extent uncertain (Balk et al., 2002). Under existing evidence, it may not be necessary for risk of bias assessments in a systematic review to be conducted under blinded conditions (Morissette et al., 2011). A wide variety of checklists and scales exist (Moher et al., 1995;Sanderson et al., 2007;Higgins et al., 2011), although the use of scales has been cautioned (Jüni et al., 1999;Greenland and O'Rourke, 2001). Systematic reviews should always evaluate and take into account the internal validity of included trials, but also their applicability and generalizability or external validity (Dekkers et al., 2010). However, adjustment of the MA on the basis of quality scores is considered inappropriate and should be avoided (Herbison et al., 2006). Some additional issues should be also discussed, although their influence was non-significant in the present analysis. Firstly, the superior reporting of Cochrane reviews has been noted previously (Moher et al., 2007a;Papageorgiou et al., 2011) and may be aided by the guidelines of the Cochrane Collaboration, although criticisms for these reviews have also been issued (Lang et al., 2007). In addition, the electronic publishing allows more freely the provision of details. Cochrane reviews provide more details concerning the inclusion and exclusion criteria and are updated more often (Jadad et al., 1998;Shea et al., 2007a,b). Indeed, once the specific characteristics of a proper systematic review are taken into account, Cochrane reviews were not associated with higher AMSTAR score (Table 2). Secondly, systematic reviews have to be up to date in order to be valid, but only six of the identified MAs were updates of previous ones. The low update rate of non-Cochrane MAs, which account for 80% of all systematic reviews (Moher et al., 2007a), may relate to the few methods or strategies currently existing for the actual updating (Moher et al., 2007b). The rapid dissemination of the informative value of a systematic review can deteriorate due to publication lag, which may account for up to 20% of an MA's life span (Sampson et al., 2008b). Even worst, systematic reviews without MA may be given even lower priority by editors. Thirdly, the quality improvement associated with the participation of statisticians or epidemiologists among authors, reported for controlled clinical trials (Delgado-Rodriguez et al., 2001) or orthodontic systematic reviews (Papageorgiou et al., 2011), was not statistically significant in this study. However, the use of statistical expertise is associated with higher publication acceptance rates  and advocated due to the moderate statistical skills frequently found among clinical medical researchers (Perneger et al., 2004) or orthodontic postgraduate students (Polychronopoulou et al., 2011).
Although an annual minimal increase of AMSTAR score (0.5%) was identified, reporting seems not to have significantly improved in some areas of methodological importance. Any tendencies found by comparing 2001 and 2006 are not discussed further and should be interpreted with caution, due to the non-significance and the wide 95% CIs. However, between the years 2006 and 2011, a tendency for improvement was found regarding Boolean strategy reporting, flow diagram reporting, and the quality assessment of included studies. A decrease was noted in the number of MAs that provided complete search dates or included a statistician/epidemiologist.
The appropriateness of statistical methods for the pooling of data is essential for the validity and consistency of the MAs and is often criticized (Feinstein, 1995;Sharpe, 1997). Although numerous articles assessing reports of MAs exist, only a few of them incorporate terms of actual statistical methods (Minelli et al., 2009;Jude-Eze, 2011;Korevaar et al., 2011;Riley et al., 2011;Melchiors et al., 2012). Regarding the distinction between primary and secondary outcomes, which was seldom found in this analysis (Table 3), it should be mentioned that this is best made in the Methods section. It is often impossible for the reader to distinguish primary from secondary outcomes from the data provided in the Results section or the forest plots.
Many issues were raised analysing the heterogeneity assessments of the MAs, which was often inadequate. A similar analysis of MAs of preclinical studies showed that the percentage of studies reporting on heterogeneity increased after 2005 (Korevaar et al., 2011). First of all, the authors should predefine how the choice between a fixedeffects and a random-effects model for the MA was made. Selection of one from these two methods is often based on either a "large" I 2 value or on the P-value for a chi-square test for heterogeneity. The size and impact of the betweenstudy heterogeneity are statistically properly measured by the between-study variance estimate ("tau-squared") and I 2 , respectively. It is recommended, however, that both statistical and clinical reasoning are considered for this decision (Higgins and Green, 2011). Also, the pooled result from the two MA models is not the same and should be interpreted accordingly. The result of a random-effects model produces the average of the intervention effects across studies, while the result of a fixed-effects model produces the best estimate of a common intervention effect across studies (Riley et al., 2011). Sensitivity analyses, subgroup analyses, meta-regressions can be prespecified to identify causes for heterogeneity, but potential pitfalls of these approaches should be reported (Brookes et al., 2004;Lambert et al., 2002). Finally, for a random-effects MA, the 95% prediction interval for the underlying intervention effect and the "tau-squared" provide invaluable information to the readers.
Although publication bias is known to endanger the validity of MA (Dwan et al., 2008) and lead to further emergence of biasing evidence (Ioannidis and Lau, 2001;Trikalinos et al., 2004), few MAs investigated their presence. A similar evaluation in the field of preclinical studies showed an improvement in reporting of publication bias (Korevaar et al., 2011). Firstly, publication bias (or file-drawer problem) must be considered during both the planning and interpretation of a review article. It should be noted that bias assessments are not always feasible (due to a small number of studies) or reliable (due to large heterogeneity), and that funnel plot asymmetry may exist due to reasons other than publication bias (Sterne et al., 2011). As suggested in the Cochrane Handbook (Higgins and Green, 2011), for each MA with at least 10 primary studies, authors should produce a funnel plot and visually assess plot asymmetry or "small study effects" (Harbord et al., 2006). This is the tendency that intervention effects estimated in smaller studies differ from those estimated in larger studies. Secondly, authors should report the results of the predefined statistical tests for publication bias (Begg and Mazumdar, 1994;Egger et al., 1997;Harbord et al., 2006;Peters et al., 2006). Bias identified in any way should be investigated by sensitivity analysis and placed in context of the MA results and conclusions.
Among the limitations of the present study is the fact that there is still a possibility of missing existing MAs, despite the extensive search performed. In addition, it is also possible that some reporting details and characteristics of the MAs have been discarded during the peer-review process. Finally, it is feasible that a research project may undergo substantial changes between different stages (e.g. protocol, execution, article publication).
Provision of reporting guidelines is a validated mean of improving the quality of published material, as the reporting quality of RCTs and MAs has clearly improved (Plint et al., 2006;Al Faleh and Al-Omran, 2009) following the introduction of the CONSORT (Consolidated Standards of Reporting Trials) and QUOROM statements (Moher et al., 1999(Moher et al., , 2001. A recent masked randomized trial identified that additional journal peer review based on reporting guidelines resulted in a moderate improvement in manuscript quality, although authors still had difficulties in adhering to high standards of reporting during the writing phase (Cobo et al., 2011). In addition, guidance from the GRADE (Grading of Recommendations Assessment, Development, and Evaluation) approach (Guyatt et al., 2008) or other actions  may improve the interpretation of systematic reviews. Regarding the evidence basis of the identified MAs, it should be emphasized that many lacked certain procedures of an extensive systematic review (PLoS Medicine Editors, 2007).

Conclusion
The critical appraisal of MAs in the field of orthodontics suggests that the average quality of MAs could be characterized as low to medium. Although a minimal trend for improvement characterized the last decade, significant flaws were found. Without complete and transparent reporting and appropriate use of statistical methods, it is difficult for readers to assess the validity of MAs or identify MAs with misleading conclusions. The endorsement (or even enforcement) of the PRISMA statement (Liberati et al., 2009) would hopefully improve the conduct and reporting of systematic reviews and consequently of the MAs in orthodontics.

Supplementary material
Additional data is available as supplement at European Journal of Orthodontics online.