Sample size in orthodontic randomized controlled trials: are numbers justified?

SUMMARY Sample size calculations are advocated by the Consolidated Standards of Reporting Trials (CONSORT) group to justify sample sizes in randomized controlled trials (RCTs). This study aimed to analyse the reporting of sample size calculations in trials published as RCTs in orthodontic speciality journals. The performance of sample size calculations was assessed and calculations verified where possible. Related aspects, including number of authors; parallel, split-mouth, or other design; single-or multi-centre study; region of publication; type of data analysis (intention-to-treat or per-protocol basis); and number of participants recruited and lost to follow-up, were considered. Of 139 RCTs identified, complete sample size calculations were reported in 41 studies (29.5 per cent). Parallel designs were typically adopted ( n = 113; 81 per cent), with 80 per cent ( n = 111) involving two arms and 16 per cent having three arms. Data analysis was conducted on an intention-to-treat (ITT) basis in a small minority of studies ( n = 18; 13 per cent). According to the calculations presented, overall, a median of 46 participants were required to demonstrate sufficient power to highlight meaningful differences (typically at a power of 80 per cent). The median number of participants recruited was 60, with a median of 4 participants being lost to follow-up. Our finding indicates good agreement between pro-jected numbers required and those verified (median discrepancy: 5.3 per cent), although only a minority of trials (29.5 per cent) could be examined. Although sample size calculations are often reported in trials published as RCTs in orthodontic speciality journals, presentation is suboptimal and in need of significant improvement.


Introduction
Randomized trials are considered the gold standard for assessment of the efficacy and safety of interventions and are established in the orthodontic literature. The precedence of randomized studies relates to the potential to limit bias and confounding effects, which are more likely in observational studies and non-randomized trials. However, randomized trials are expensive, as well as being time and labour intensive; it is imperative, therefore, that studies have adequate power to demonstrate a clinically important treatment difference if such a difference exists, while being able to correctly conclude that no such difference exists as indicated (Machin et al., 1997;Schulz and Grimes, 2005). Notwithstanding this, unjustifiably large studies carry additional expense, risk wasting resources, and may even be unethical, as patients can be unnecessarily exposed to potentially ineffective therapy.
Studies with small sample sizes tend to be less reliable and are more likely to be inconclusive due to inadequate statistical power (Freiman et al., 1978;Altman, 1980;Wooding, 1994;Halpern et al., 2002). There is a close relationship between power and sample size; as sample size increases, study power also rises. Recruitment of an appropriate sample involves a trade-off between power, feasibility of the study, ethics, and credibility of the findings. In view of these issues, it is recommended practice to include a sample size calculation and justification in both research protocols and in reports of randomized trials (Schulz et al., 2010a, b).
Power calculations should be considered during the design stage of clinical trials, being of little value after the trial is conducted. At the end of the trial, power may be assessed by examining the precision of the estimates by observing the width of associated confidence intervals. The chief components of the calculation include power (typically 80-90 per cent), type I error or alpha (usually 0.01 or 0.05), assumptions in the control group (including mean response and variance), and expected treatment effect. Assumptions pertaining to the control group are usually predetermined based on published results or piloting, with the expected treatment effect tantamount to a clinically meaningful effect. However, incorrect assumptions of expected treatment effects and their variance at the design stage may lead to insufficient power (Vickers, 2003;Schulz and Grimes, 2005). A study designed to detect a clinically important difference with a power of 80 per cent assumes an 80 per cent chance of correctly observing a difference if such a difference exists. Conversely, it also assumes a 20 per cent chance of failing to identify a difference (false negative) when such a difference does exist. Allowing a small likelihood (10-20 per cent) of a false-negative outcome (type II error or beta) is unavoidable because to guarantee 100 per cent power would necessitate an infinite number of participants. Type I error (or alpha) refers to false-positive results and indicates a willingness to accept a 5 per cent (P = 0.05) chance of observing a statistically significant difference when no such difference exists between the treatment groups.
The aim of this study was to assess the quality of reporting of sample size calculations in trials published as RCTs in eight leading orthodontic journals, to ascertain the number of participants typically recruited to clinical trials in orthodontics, to assess the accuracy of calculations, and to identify factors associated with correct performance of sample size calculations in orthodontic speciality journals.

Materials and methods
Reports of randomized controlled trials in eight leading orthodontic journals during a 20-year period up to Relevant articles were hand-searched from the chosen journals. Two authors (DK, PSF) screened the titles and abstracts of potentially relevant articles. A range of randomized trials, including those with two or more arms, parallel group, factorial, cluster, and crossover designs were all included. Follow-up studies of earlier reports were omitted, however.
The full reports and any supplementary material for all selected papers were accessed. A standard data collection form was piloted by 2 authors (DK, PSF) on 10 selected papers. Details of the a priori sample size calculation, as reported in the Materials and methods section, were recorded. In particular, the conduct of a sample size calculation was noted. If conducted, the target sample size, number of participants recruited, number of participants lost to follow-up, type of analysis (intention-to-treat or perprotocol basis), and details of the power calculation including power, type I error, assumptions in the control group (standard deviation for continuous outcomes, and proportion of events for dichotomous and time-to-event outcomes), and the expected treatment effect (mean difference for continuous outcomes, and difference in the proportion of events in the treatment group for dichotomous and time-toevent outcomes) were recorded. Additional general characteristics recorded included the following: (1)

Statistical analysis
Descriptive statistics were obtained for the total number of articles identified in each journal, location, number of researchers, and other characteristics, as well as conduct of sample size calculation in individual studies. Chi-square test and Fisher's exact test were used as required to test the association of trial characteristics such as journal, continent of publication, number of authors, design, number of research centres and arms, significance of results, and use of intention-to-treat (ITT) or other analysis, with sufficient reporting or otherwise of sample size calculation details. No further statistical analyses were undertaken because of the small number of reports with adequate sample size calculation details. All statistical analyses were conducted with statistical software (Stata 12.1, Stata Corp, College Station, Texas, USA).

Results
A total of 139 eligible RCTs were identified in eight leading orthodontic speciality journals (Table 1, Figure 1). The highest percentage of RCTs were published in AJODO (n = 61; 44 per cent), with the bulk of the remaining studies published in JO (n = 23; 17 per cent), EJO (n = 21; 15 per cent), and ANGLE (n = 18; 13 per cent). The majority of studies were undertaken in a single centre (n = 99; 71 per cent); most were published by European researchers (n = 97; 70 per cent), with parallel designs predominating (n = 113; 81 per cent). Eighty per cent (n = 111) of studies involved two arms, 16 per cent (n = 22) had three arms, whereas only 4 per cent had more than three groups. Statisticians or methodologists were involved in authorship in 26 studies (19 per cent), with a slight majority of studies reporting statistically significant main outcomes (n = 79; 57 per cent). Data analysis was conducted on an ITT basis in a small minority of studies (n = 18; 13 per cent). In half of the analysed RCTs (n = 70), it was unclear whether ITT or per-protocol analyses were intended. Sufficient information to permit verification of the sample size calculation was provided in only 41 (29.5 per cent) trials (Table 2, Figure 2). A power of 80 per cent was used in most of these studies (n = 21; 51 per cent), whereas 90 per cent power was pre-specified in 15 studies (37 per cent). Continuous outcomes predominated in these studies (n = 34; 83 per cent), with relatively few having either time-to-event (n = 5) or categorical (n = 2) outcomes (Table 2). Based on the complete calculations presented in these 41 RCTs, a median of 46 participants were required to demonstrate sufficient power; however, a median of 60 participants was recruited in each study suggesting over-recruitment of 30.4 per cent to offset attrition. Overall, the median number of participants lost to follow-up was four (Table 3). The cut-off point for statistical significance (alpha) was set at 0.05 in all of these trials. Sample size calculations were repeated in the 41 studies where it was possible to do so to assess their veracity (Tables 4 and 5, Figure 3); an overall median discrepancy of 5.3 per cent (per cent standardized difference) between presented and recalculated sample sizes was found. Median percentage difference between the presented and recalculated sample sizes ranged from -93.3 to 60.6 per cent. Studies where data analysis was undertaken explicitly on an intention-to-treat or per-protocol basis had similar calculations, with discrepancies of 6.1 and 8.7 per cent, respectively. More accurate sample size calculations were found in studies with power of 80 per cent (4.5 per cent) than in those reported to have either 85 (45 per cent) or 95 per cent (-30.0 per cent) power.

Discussion
The Consolidated Standards of Reporting Trials (CONSORT) group is clear that accurate and transparent reporting of the specifics of sample size estimation for RCTs is essential (Schulz et al., 2010a, b). This study is the first to analyse execution of sample size calculations in orthodontic journals. The overall results were disappointing, with most studies (70.6 per cent) failing to present a complete calculation permitting the calculation to be verified. For the small number of trials (29.4 per cent) in which sample recalculation was feasible, there was good overall agreement between recruited and required samples. However, this leaves a lot of room for speculation about the remaining 70.6 per cent of the trials. Our results are comparable to a similar research conducted in six general medical    journals, which demonstrated correct performance of sample size calculations in 34 per cent of studies, with sufficient data present to allow their replication (Charles et al., 2009). However, a similar study on surgical publications alluded to sample size estimation in just 19 per cent of studies (Ayeni et al., 2012). Therefore, it is important that sample size calculations are presented more thoroughly both in dentistry generally and in orthodontics. Failing that, if sample size calculations continue to be unclear or unreliable, it has been suggested that their use should be discontinued (Bacchetti, 2002).
According to the complete sample size calculations obtained, the median number of participants required in orthodontic research studies is 46. This number of subjects is generally realistic and achievable. It is impossible to speculate whether researchers genuinely arrived at these figures based on valid assumptions in the control group and truly important standardized differences. Clearly, however, there is a temptation for researchers to deduce a figure that is realistic, of reasonable cost and feasibility, yet of sufficient credibility to warrant publication of the research. It is known that sample size assumptions can be doctored when planning research studies, by retrofitting the effect sizes to the available sample; this technique has been referred to as 'sample size samba' (Schulz and Grimes, 2005). A limitation of the present study is that outcomes  were based on research reports in isolation. No attempts were made to liaise with researchers to ascertain whether assumptions had been manipulated to produce realistic sample sizes. Consequently, the results presented equate to a 'best-case scenario', with manipulation of sample sizes impossible to identify with the design used. Alterations in sample calculations can also be evaluated following completion of the study by comparing final publications to published protocols. Attempts were made to identify protocols for published studies; however, these were rarely accessible, as is often the case with orthodontic literature (Benson, 2011). This study confirms that sample size reporting is occurring more frequently, with just 4 per cent of reports alluding to a calculation in 1980 (Meinert et al., 1984). However, the trial instead demonstrates that calculations typically lack the requisite information to permit replication. The magnitude of miscalculation shown in the complete reviews was small. Although not specifically explored in the current study, common reasons for discrepancies may relate to incorrect statistical handling of nested designs (for example, multiple observations on multiple teeth within patients), which may be susceptible to clustering effects with outcomes more closely matched within clusters than between them (Kerry and Bland, 1998). Similarity within clusters decreases the amount of unique information compared with observations obtained without clustering; the required sample size in clustered designs increases accordingly (Kerry and Bland, 1998). The increase in sample size required in cluster-randomized designs can be determined using the design effect, which is related to the intra-class correlation coefficient (ICC) according to the following formula: D = 1 + (m − 1) ρ, where m is the number of observations per cluster and ρ = ICC. Higher ICC values require an increase in the necessary sample size in a clustered trial to maintain similar levels of power. Accurate sample calculations for clustered designs requires information relating to either the within-cluster correlation (ICC) or the betweencluster variability (coefficient of variation). However, this information is usually lacking.
ITT analysis is increasingly advocated in randomized trials, allowing the benefits of randomization to be preserved throughout the trial. The alternative approach is per-protocol analysis, where dropouts from clinical studies are ignored in the analysis. The latter approach is likely to represent a bestcase scenario and may risk susceptibility to attrition bias due to uneven loss to follow-up. Consequently, ITT principles are advocated for tests of effectiveness of interventions in real-world, pragmatic studies. The present study revealed that ITT analysis was clearly described and undertaken in only a minority of studies (13 per cent). Furthermore, although ITT analysis was referred to in a number of other studies, closer scrutiny of the flow of participants contradicted this assertion. This finding is in keeping with research on medical journals (Gravel et al., 2007) alluding to a much greater percentage of studies (62 per cent) using ITT, whereas a significant percentage of these violated the principles of ITT. Erroneous handling of dropouts in studies labelled as using ITT analyses ranged from 10 (Hollis and Campbell, 1999) to 58 per cent (Kruse et al., 2002). It is, therefore, important that data analysis is pre-specified at the protocol stage and undertaken and reported accordingly, allowing a more informed discussion of the implications of research findings. The present study has exposed a divide between the emphasis placed on sample size calculations by journals, ethical review committees, and funding agencies on the one hand and existential practice on the other. It appears that sample size calculations are reported less often than should be the case and, when presented, are sometimes inaccurate and often outlined in insufficient detail to permit verification. It is, therefore, important that, if the emphasis on sample size estimation is to continue, peer reviewers and editors be encouraged to scrutinize whether calculations are presented completely. To facilitate this, publishers may make available electronic software that may also permit quick and simple replication of the calculation to test its veracity. Unless sound practices are encouraged, underpowered trials will continue to be published. Although such studies may be combined in meta-analyses, there remains a body of opinion that underpowered studies are unethical and may lead to incorrect interpretation of results, particularly when clinical importance is confused with statistical significance. Consequently, the importance of improving the current methods of sample size estimation in published research is clear. Failing that, alternative approaches to sample size estimation should be considered.
Finally, it should be noted that conclusions from this study should be viewed with caution as the proportion of RCTs providing enough data for confirmation of sample calculations was low. Nevertheless, it should be borne in mind that RCTs are often of better methodological quality compared to non-randomized trials and other studies. It is therefore reasonable to speculate that a large body of existing evidence may be founded on incorrectly powered studies.

Conclusions
Although sample size calculations are typically reported in trials published as RCTs in orthodontic speciality journals, complete presentation is suboptimal. Consequently, reporting of sample size estimation is in need of significant improvement.