Practice of Epidemiology Estimating the Intracluster Correlation Coefficient for the Clinical Sign “Trachomatous Inflammation—Follicular” in Population-Based Trachoma Prevalence Surveys: Results From a Meta-Regression Analysis of 261 Standardized Preintervention Surveys Carried Out in Ethiopia, Moza

Colin K. Macleod∗, Robin L. Bailey, Michael Dejene, Oumer Shafi, Biruck Kebede, Nebiyu Negussu, Caleb Mpyet, Nicholas Olobio, Joel Alada, Mariamo Abdala, Rebecca Willis, Richard Hayes, and Anthony W. Solomon ∗ Correspondence to Dr. Colin Macleod, Department of Clinical Research, London School of Hygiene & Tropical Medicine, Keppel Street, Bloomsbury, London WC1E 7HT, United Kingdom (e-mail: colin.macleod@lshtm.ac.uk).

Trachoma is a blinding disease caused by infection with the bacterium Chlamydia trachomatis. Ocular infection is mostly found in young children, with repeated infections leading to chronic keratoconjunctivitis (1,2). Over a period of years, immunologically mediated scarring of the eyelid occurs, causing permanent changes in eyelid morphology and misdirection of the eyelashes so that they abrade the front surface of the eye, leading to permanent opacification of the cornea. Standardized clinical signs of trachoma, defined according to the World Health Organization's sim-plified trachoma grading system (3), are used to provide reproducibility in surveys. In this system, "trachomatous inflammation-follicular" (TF) is defined as the presence of 5 or more follicles, each greater than or equal to 0.5 mm in diameter, in the central part of the tarsal conjunctiva of the upper eyelid. Estimates of the prevalence of TF in children aged 1-9 years are used to guide intervention planning and, in particular, to decide where and for how long to implement annual mass distribution of azithromycin, the antibiotic used to treat trachoma.
From 2012 to 2015, standardized baseline prevalence surveys took place throughout Ethiopia, Nigeria, and Mozambique as part of the Global Trachoma Mapping Project (GTMP), with the aim of identifying districts that needed interventions in a push toward global trachoma elimination. These surveys provided data that have been made available to further analysis, to augment existing knowledge of trachoma epidemiology, and to refine future survey protocols for greatest efficiency and accuracy.
Trachoma is found in isolated, socioeconomically deprived rural areas. Population-based prevalence surveys are the gold standard for evaluating its prevalence (4). Although ideally one would select individuals to be examined at random from the target population, so that all residents were equally likely to be selected (simple random sampling), survey costs can be reduced by instead selecting clusters of individuals within geographical locales (cluster sampling). This increases fieldwork efficiency at the expense of the statistical independence of each result. To compensate for the relatedness of individuals within a given cluster and the resulting increased variance in estimates produced as a result of the cluster-sampled design, sample sizes must be increased. The parameter used to describe the correlation of results from individuals within a given cluster is known as the intracluster correlation coefficient (ρ), defined as the proportion of total variance accounted for by betweencluster variation. In infectious disease epidemiology, this coefficient is associated with transmission patterns and the natural history of infection and may depend on the particulars of survey design. An accurate estimate of ρ is needed to design future surveys and, in particular, to determine an appropriate sample size.
In this paper, we use parametric bootstrapping to estimate ρ with 95% confidence intervals for each of 261 trachoma prevalence surveys from Ethiopia, Nigeria, and Mozambique. These estimates are then used to conduct a metaregression analysis with survey-level covariates to explore variation across surveys and to investigate the influence of key factors on ρ.

Sampling design
All surveys were carried out using standardized methodology as part of the GTMP (5). A planned sample size of 1,019 children aged 1-9 years was used to estimate an expected TF prevalence of 10% with a precision of ±3% at the 95% confidence level, using a design effect (the ratio of the clustered sampling variance to simple random sampling variance) of 2.65, the latter being derived from surveys carried out prior to the GTMP.
At the first stage of sampling, primary sampling units (PSUs) were identified in each district. The number of households sampled per PSU (h) was set as that which a single survey team could anticipate being able to sample in 1 working day: 25 in Nigeria, 30 in Ethiopia, and 32 in Mozambique. The number of PSUs in each survey was then dependent on the mean number of children aged 1-9 years that were expected to be found in each household, n H , with the number of PSUs equal to 1,019/(h × n H ). This meant that 24-26 PSUs were planned per survey. Typically, existing census data were used to define the sampling frame for PSUs, the resolution being limited by the population size of the lowest administrative census units in the country. PSUs were villages, groups of villages, or other administrative areas. PSUs were sampled with a probability-proportionalto-size methodology, giving more weight to larger (more populous) PSUs. This provided self-weighting of samples so that, despite the clustered design, each individual in the evaluation unit had (as far as was practically possible) an equal likelihood of being sampled.
At the second stage of sampling, within the PSU, compact segment sampling (Ethiopia and Mozambique) or randomwalk sampling (Nigeria) was used to select households for inclusion. In Ethiopia and Mozambique, each PSU was divided into segments of 30 and 32 contiguous households, respectively, so that each household in the PSU belonged to a segment. One segment was then chosen at random by drawing lots. All individuals resident in the households of the chosen segment were visited by the survey team. In Nigeria, using random-walk sampling, a starting point in the center of the PSU was agreed upon and a pen was spun on the ground at that point to identify, in quasirandom fashion, a heading for the survey team to transect. A total of 25 households in that direction were enrolled.
In sampled households, all residents aged ≥1 year were eligible for inclusion, and all consenting individuals were examined for signs of trachoma using the World Health Organization's simplified trachoma grading system (3). For children under age 18 years, consent was obtained from the parent or guardian, and the children themselves gave assent where possible. Data were collected electronically on Android smartphones (Google, Inc., Mountain View, California) (5).

Ethical clearance
The overall GTMP protocol was approved by the ethics committee of the London School of Hygiene & Tropical Medicine. In Ethiopia, the protocol was approved by the ethics committee of each participating regional state. In Mozambique, the protocol was approved by the National Committee on Bio-Ethics and the Provincial Directorate of Health in each province. In Nigeria, the protocol was approved by the National Health Research Ethics Committee. The secondary analyses of anonymized data that underlie this paper were considered by the Ethics Review Committee of the World Health Organization to be exempt from full formal review.

Estimating ρ
The standard equation for the variance of a proportion achieved through simple random sampling (SRS) of N individuals is given by where p is the sample proportion of the outcome, π is the true proportion of the outcome in the whole population, and N is the total number of individuals examined. In cluster sampling, the increased variance arising from the clustered design is represented by the design effect (DE), so that Here n is the number of clusters in the survey and m is the average number of individuals examined per cluster. Hence, nm = N, the total number of individuals examined. Therefore,ρ = Var Cluster p where Var SRS (p) is approximated as Var SRS p . We therefore need to estimate Var Cluster p to calculateρ for a given survey.

Estimating the between-cluster variance in p
We used parametric resampling with replacement (parametric bootstrapping) to estimate Var Cluster p . Parametric resampling makes no assumptions about the underlying distribution of the data (16), but the resampling process should mirror, where possible, the sampling strategy that gave rise to the data (17,18).
The data can be represented as a vector of N independent observations, y obs . We wish to estimate the variance of the parameter p y obs by replicating the highest-level sampling strategy used in the surveys. In this secondary analysis of deidentified data sets, the underlying populations of selected clusters were not known, so equal weighting (rather than weighting proportional to size) was used.
For eachρ estimate, the following algorithm was used: 1. Determine the number of unique clusters in the survey, n, and sample n clusters randomly with replacement. All children aged 1-9 years examined in these clusters comprise the bootstrap data set Y * . Let i = 1, 2, . . . n. 6. For each survey,ρ is then estimated as 7. The variance ofρ is estimated by replicating steps 1-6 a total of 4,096 times.
In our analysis, bootstrap distributions approximated normal distributions, so 95% confidence intervals were calculated as the 2.5th and 97.5th percentiles of all ordered estimates for a given survey. The overall estimate for each survey was the mean value of these estimates. Bootstrap estimates were resampled 4,096 (2 12 ) times to obtain appropriate precision. A total of 4,096 2 replications were carried out for eachρ estimate. Estimation was carried out in RStudio (RStudio, Inc., Boston, Massachusetts).

Meta-analysis
Next, we conducted a meta-analysis to obtain pooled estimates of ρ across surveys. Pooled estimates were derived using a random-effects model, with survey weights obtained from the intrasurvey variance of each estimate (19,20). Natural log-transformed estimates were used to limit the effects of heteroscedasticity. Heterogeneity across survey estimates was investigated using the Q statistic, subgroup analysis, and meta-regression analyses (21). Random-effects meta-regression models were fitted to estimates using the "metareg" command in STATA 14 (StataCorp LLC, College Station, Texas). The standard error of each estimate was calculated as the difference between the 97.5th and 2.5th centile estimates divided by 3.92, assuming a normal distribution of bootstrap estimates. Estimates of ρ are reported on the original scale by exponentiating the pooled estimates from the model. Design effect estimates at given covariate values were estimated from pooled ρ estimates as 1+ m−1 ρ, with m set as 30 children per cluster. Forest plots were produced in STATA 14.
We excluded surveys in which the TF prevalence estimate was less than 2%, in the belief that below this level the data would be too sparse to reliably estimate ρ. We used univariate and multivariable meta-regression techniques to investigate possible sources of heterogeneity between estimates, using the following covariates: TF prevalence, country, mean distance between clusters, mean number of children examined per household, and mean number of children examined per cluster. Covariates were defined using data collected at the time of the survey. For each survey, the average distance between clusters was estimated as the difference between the respective Global Positioning System (GPS) coordinates of each cluster and the centroid GPS coordinates over all clusters, with estimates adjusted for latitude to convert decimal degrees to kilometers. We included this covariate to test the hypothesis that survey areas that covered larger distances were more likely to show a greater variance in TF estimates. We then conducted secondary analyses using ρ estimates stratified by associated covariates.
At the time of data collection, recorders entering data into smartphones were required to submit a unique identity code. This allowed the total number of data recorders to be defined for each survey. Because recorders were paired with graders performing clinical trachoma grading, we included this variable to investigate trachoma grader precision or consistency between graders in a given survey.

RESULTS
A total of 380 surveys from Ethiopia, Nigeria, and Mozambique were made available by the respective health ministries. We excluded 111 surveys because their TF prevalence was below the 2% threshold. We further excluded another 8 surveys because they had an estimated ρ value less than 0.0. Thus, 261 surveys were included in the analysis: 162 from Ethiopia, 44 from Mozambique, and 55 from Nigeria (see Web Table 1, available at https://academic.oup. com/aje). All included surveys used a 2-stage cluster sample survey design. All survey data were baseline trachoma prevalence estimates, with none of the surveyed populations having received previous mass azithromycin administration or other specific interventions deployed to reduce active trachoma prevalence by national elimination programs.
The TF prevalence in children aged 1-9 years was reported in the surveys as the mean of all cluster-level proportions. The median TF prevalence in children aged 1-9 years over all surveys was 16.5% (interquartile range, 4.5-27.5; range, 2.0-50.6). The breakdown of survey-level prevalence by country is shown in Web Table 2.

Number of children examined per cluster
The mean number of children aged 1-9 years examined per cluster was 36.6 in Ethiopia, 39.6 in Mozambique, and 69.1 in Nigeria. Full details of the breakdown of cluster sizes by country are shown in Web Table 3.

Number of children examined per household
The number of children examined per household was considered in the analysis because larger households may have an effect on trachoma transmission either through proximity and interpersonal interaction as a direct risk factor or through common exposures, such as the effect of poor communitylevel access to sanitation (22). The mean number of children aged 1-9 years examined per household was 2.0 in Ethiopia, 2.0 in Mozambique, and 3.1 in Nigeria (Web Table 4).

Initial meta-analysis
The meta-analysis included 261 estimates of ρ for the clinical sign TF in children aged 1-9 years. The region-level estimates across all surveys are shown in Figure 1. Estimates ranged from 0.0002 (95% confidence interval (CI): 0.0000, 0.0008) in a survey in Kano State, Nigeria, to 0.368 (95% CI: 0.348, 0.388) in a survey in the Southern Nations, Nationalities, and People's Region of Ethiopia. The overall pooled estimate for all surveys was 0.051 (95% CI: 0.047, 0.056), although there was a great deal of heterogeneity in ρ between surveys (heterogeneity χ 2 = 120,000; P < 0.0001).
In the univariate meta-regression analyses, a large proportion of variability across all 261 ρ estimates could be explained by country, TF prevalence, mean distance between clusters, number of recorders used in the survey, number of children examined per household, and number of children examined per cluster (Table 1). A larger ρ estimate was associated with a higher TF prevalence, a larger distance between clusters, a larger number of recorders used in the survey, a smaller number of children examined per household, and a smaller number of children examined per cluster. Estimates were generally highest in Ethiopia and lowest in Nigeria. The multivariable meta-regression analyses aimed to explain the heterogeneity between surveys, accounting for survey-level differences in associated variables. The country covariate was included in the model a priori. When controlling for all variables in the model, only country, TF prevalence category, and cluster size were associated with ρ (P < 0.001), explaining 69.8% of the variability. Ethiopia was independently associated with higher estimates (β = 2.39 (95% CI: 1.85, 3.07); P < 0.001), with no meaningful difference between Mozambique and Nigeria (P = 0.934). The "number of children examined per household" covariate was not included in the final model because of collinearity with the "number of children examined per cluster" covariate. The "number of recorders used per survey" covariate was not included because of collinearity with the country covariate (the Ethiopia and Nigeria surveys were perfectly collinear with number of recorders <5 and number of recorders ≥20, respectively). The final multivariable model accounted for 69.2% of the variance in estimates (Table 1).

DISCUSSION
In general, the intracluster correlation coefficient or the design effect is poorly represented in the public health liter-ature. Individual survey clustering estimates exist (24-27), but we have found only 1 other paper that covered clustering estimates derived from surveys carried out in multiple countries (28). We believe this to be the first time that estimates of ρ from standardized infectious disease surveys conducted internationally have been published together.
Surveys of a particular infectious disease are not always standardized, and as a result it has not previously been possible to amass large numbers of comparable pooled estimates of ρ in a single analysis. We have therefore had an opportunity to augment existing knowledge in a way that was not possible for trachoma prior to the implementation of the GTMP. We found marked heterogeneity in survey ρ estimates, and we explored possible sources of that heterogeneity which may be of use in planning future work.
In 1996, the World Health Organization targeted trachoma for elimination as a public health problem by the year 2020 (29). This was defined, in part, as an estimated TF prevalence in children aged 1-9 years of less than 5% in each formerly endemic district. An important aspect of validating that this goal has been reached is confidence in the method by which prevalence has been measured. Given the marked effect that the ρ estimate has on sample-size planning, it is crucial to have accurate estimates of its value. We have shown that ρ decreases sharply at low TF c Exponentiated meta-regression coefficient. d Full meta-regression model adjusting for country, number of children examined per cluster, and prevalence of TF in children aged 1-9 years (P < 0.0001). 69.2% of the variance in the ICC was explained by the full model. e TF in children aged 1-9 years, the primary clinical sign associated with ocular Chlamydia trachomatis infection used to guide intervention programs under current World Health Organization guidelines (23).
f Estimated as the number of unique recorder identification codes used in the survey. g Estimated as the square root of the variance of the distance of survey clusters from the geometric center of the GPS coordinates of all survey clusters, converted to kilometers and accounting for latitude.
prevalences, and so with the same absolute precision, accurate estimates of TF can be made using smaller sample sizes as the anticipated elimination endpoint approaches.
The converse of this statement is that for a given sample size, with increasing TF prevalence, the precision of a given estimate decreases. In trachoma elimination, the crucial TF thresholds are 5%, 10% and 30%: Where TF prevalence is less than 5.0%, azithromycin mass drug administration (MDA) is not indicated; where it is 5.0%-9.9%, a single round of MDA is recommended before resurvey; where it is 10.0%-29.9%, 3 annual rounds of MDA are recommended before resurvey; and where it is 30.0% or more, 5 annual rounds of MDA are recommended before resurvey. The required performance of a survey methodology for providing estimates around these thresholds depends on the implications of erroneous categorization to the population involved. Incorrect categorization may have significant implications around the 10% threshold, for example, where the cost difference between implementing 1 and 3 years of MDA and the political effect of delaying repeat surveys may each be substantial.
On univariate analysis, there was a suggestion that using fewer data recorders in a given survey was associated with greater concordance of cluster-level TF estimates, and so decreased ρ. However, this variable was not retained in the full multivariable model with the country variable included. It is possible that there was not enough variability in recorder numbers within countries to obtain accurate estimates independent of the overall country variable. From the data, it can be inferred that local logisticians used different field team deployment strategies for completing large numbers of surveys in a given area. One strategy was to use a single data recorder (and, generally, a single accompanying trachoma grader) for a whole survey, so that the individual worked in all clusters in the evaluation unit: If 26 clusters were required, the survey would take 26 team-days of fieldwork for that recorder and his or her trachoma grader. This strategy was used in the majority of surveys in Nigeria and Mozambique. The strategy at the other extreme would be to send 26 data recorders (and their accompanying graders) to 1 cluster each, so that the survey could in theory be completed in a single calendar day (still incorporating 26 team-days of fieldwork). The strategy used in Ethiopia was closer to this model. Intuitively, the trade-off between these strategies is the trade-off between accuracy and precision. One team might be inaccurate, but if so it might be reliably inaccurate and therefore give precision to estimates (and concordance between results). The mean of the cluster-level TF proportions might not necessarily be close to the true population estimate. On the other hand, multiple teams contributing to a single survey could all be inaccurate, but the mean of the cluster-level proportions derived from many hands might (or might not) be closer to the true population-level estimate of disease prevalence. Although the number of recorders was not included in the final model in this analysis, it is possible that this could be considered as a variable in future analyses.
A limitation of this analysis in guiding future surveys is that in the populations surveyed here, for districts in which the TF prevalence was at least 5%, interventions against active trachoma will have been deployed before impact surveys are conducted, and the degree to which the preintervention epidemiology of trachoma is representative of its postintervention epidemiology is unclear, as the varying interventions may have varying impacts on the epidemiology of the underlying disease. Equally uncertain is whether these data will be externally applicable in countries yet to complete baseline trachoma mapping of suspected trachomaendemic districts.
Overall, we found large variation in ρ estimates between surveys, and so we recommend that ρ estimates used for planning future surveys be conservative. In other words, overestimating the assumed value of ρ would be epidemiologically prudent.
It is hoped that these data can be used to guide future trachoma programs to aid elimination efforts. However, for programmatic use, the design effect is a more commonly cited parameter than ρ, as it is more intuitively useful for program managers, being the factor by which a simple random-sampling sample size should be multiplied to provide equivalent precision in a cluster random sample. Using equation 1, our analyses suggest that when carrying out surveys with more than 30 children examined per cluster, a design effect greater than 2.6 should be used when a TF prevalence close to 5% is expected, a design effect greater than 3.6 should be used when a TF prevalence close to 10% is expected, and a design effect greater than 5.0 should be used when a TF prevalence close to 30% is expected.