Abstract

The neonatal mortality rate in Rwanda remains above the United Nations Sustainable Development Goal 3 target of 12 deaths per 1000 live births. As part of a larger effort to reduce preventable neonatal deaths in the country, we conducted a study to examine risk factors for low birthweight. The data were collected via a cost-efficient cluster-based outcome-dependent sampling (ODS) scheme wherein clusters of individuals (health centers) were selected on the basis of, in part, the outcome rate of the individuals. For a given data set collected via a cluster-based ODS scheme, estimation for a marginal model may proceed via inverse-probability-weighted generalized estimating equations, where the cluster-specific weights are the inverse probability of the health center's inclusion in the sample. In this paper, we provide a detailed treatment of the asymptotic properties of this estimator, together with an explicit expression for the asymptotic variance and a corresponding estimator. Furthermore, motivated by the study we conducted in Rwanda, we propose a number of small-sample bias corrections to both the point estimates and the standard error estimates. Through simulation, we show that applying these corrections when the number of clusters is small generally reduces the bias in the point estimates, and results in closer to nominal coverage. The proposed methods are applied to data from 18 health centers and 1 district hospital in Rwanda.

1 Introduction

Neonatal mortality is defined as the probability of dying within the first 28 days of life. While the majority of neonatal deaths are preventable, due to slower progress in neonatal mortality reduction since 1990 compared to the advances made in child mortality reduction, neonatal deaths represented 47% of all under-five deaths in 2018 (WHO, 2019). The United Nations Sustainable Development Goal 3 aims to reduce neonatal mortality to 12 deaths per 1000 live births by 2030 (Hug et al., 2019). In Rwanda, however, the rate of neonatal deaths remains high at 20 deaths per 1,000 live deaths (Rwanda DHS, 2015).

Low birthweight is associated with a higher risk of neonatal death (Lubchenco et al., 1972), strategies that target reduction of low birthweight births may therefore help to reduce neonatal mortality. This motivated a study we recently conducted in Rwanda, with the goal to investigate risk factors for low birthweight in two northern districts of the country. Due to time and resource constraints, it was feasible to visit only 18 of the 44 health centers for collection of the individual-level data. The 18 health centers were sampled based on an outcome-dependent sampling (ODS) design, and in the summer of 2017, the paper's first author traveled to the sampled health centers and collected data on all live births recorded in the maternity registers between April and June 2017.

ODS is an indispensable tool for conducting research in resource-limited settings. Examples of ODS designs include the case-control, nested case-control, case-cohort, and two-phase designs (Prentice and Pyke, 1979; Breslow and Day, 1980; White, 1982; Langholz and Thomas, 1990; Wacholder, 1991). For the most part, the literature on ODS has focused on the setting where individuals are treated as independent, although methods have been recently proposed for longitudinal and cluster-correlated data settings (Neuhaus et al., 2002, 2006, 2014; Schildcrout and Rathouz, 2010; Schildcrout et al., 2013; Haneuse and Rivera-Rodriguez, 2018; Rivera-Rodriguez et al., 2019). However, in both the independent and correlated data settings, the majority of designs proposed involve sampling at the level of the individual (i.e., the study units within a cluster). In some settings, researchers may opt to use readily-available outcome information (either aggregated or at the level of the individual) to perform cluster-based sampling; that is, to sample clusters rather than individuals (Cai et al., 2001). Such sampling may be preferred when, for example, the costs associated with travel to a clinic are high compared to the cost of collecting information on individuals once at the clinic.

For data collected through a cluster-based ODS design, Cai et al. (2001) proposed to carry out estimation for a marginally specified regression model via inverse-probability-weighted generalized estimating equations (IPW-GEE), with the weights taken to be the inverse of the cluster-specific probabilities of being selected by the scheme. Furthermore, they proposed that inference be based on a sandwich estimator for the asymptotic variance. They did not, however, formally establish the asymptotic properties of the estimator for the regression coefficients nor did they provide explicit expressions for the variance that acknowledge the inherent negative correlation among the cluster-specific sampling indicators. In this paper, we resolve these gaps by using results by Xie and Yang (2003) for GEE in the complete data setting to establish the asymptotic properties of the IPW-GEE estimator for cluster-based ODS designs. Through this, we derive an expression for the asymptotic variance and propose a corresponding consistent estimator.

In contrast to Cai et al. (2001) where the focus was on settings with a large number of small clusters (specifically pairs of eyes within the Baltimore Eye Study), our study of low birthweight risk factors in Rwanda consists of a small number of relatively large clusters. That the number of clusters is small may be of concern, specifically in regard to small-sample bias in point estimates (Paul and Zhang, 2014) and undercoverage of confidence intervals based on the usual sandwich estimator (Fay and Graubard, 2001; Kauermann and Carroll, 2001; Mancl and DeRouen, 2001; Pan and Wall, 2002; Morel et al., 2003). To the best of our knowledge, however, no attempts have been made to investigate the extent to which these issues manifest in the cluster-based ODS designs that are the focus of this paper. Furthermore, while small-sample corrections have been proposed in the complete data setting, these have not been adapted to the ODS setting. Therefore, a final contribution of this paper is that we provide expressions for a bias correction to the point estimates, as well as several bias corrections for the variance estimator.

2 Risk Factors for Low Birthweight in Rwanda

The motivating study for this paper is one that we recently conducted on risk factors for low birthweight (<2500g) among facility-based births in two districts in northern Rwanda (Gakenke and Rulindo) between April and June in 2017. Within these districts, pregnant women may receive care at one of K = 44 health centers. While patient records, that include information on the mother, the pregnancy and the infant, are maintained locally at the health centers, aggregated information on a range of health indicators are tallied on a monthly basis by each of the health centers and entered into the Rwanda Health Management Information System (HMIS), a centralized database maintained by the Rwandan Ministry of Health. Individual-level data, however, is not readily-available through the HMIS and must, therefore, be obtained by traveling to a given health center and abstracting the relevant information.

2.1 Sampling Design and Data Collection

At the design phase of this study, for a number of logistical and financial reasons, an early decision was made that individual-level patient data would only be collected from 18 health centers. Toward selecting which health centers to include, health center-specific information on the total number of births (i.e., formula, formula) as well as the (unadjusted) prevalence of low birthweight (i.e., formula) during the 3-month study period was made available by HMIS. Figure 1 provides a scatterplot; across the 44 health centers formula ranged from 24 to 194 while formula ranged from 0 to 0.13.

Scatterplot of health center-specific number of live births and prevalence of low birthweight births between April and June 2017, at each of K = 44 health centers in two districts in northern Rwanda. Relative sizes of the circles are proportional to the probability of selection into the outcome-dependent scheme, . Black shading indicates which of the 44 health centers were ultimately selected
Figure 1

Scatterplot of health center-specific number of live births and prevalence of low birthweight births between April and June 2017, at each of K = 44 health centers in two districts in northern Rwanda. Relative sizes of the circles are proportional to the probability of selection into the outcome-dependent scheme, formula. Black shading indicates which of the 44 health centers were ultimately selected

The information available on formula and formula was then used to form the following selection strategy: (i) the six health centers with the highest prevalence were sampled with probability formula; (ii) the remaining 12 centers were sampled via Poisson sampling on the basis of selection probabilities determined by the model: formula, where formula is the maximum of clinic formulas standardized rank with respect to outcome prevalence and its rank with respect to size (in this case, 44 is the highest prevalence/largest size, 1 is the lowest prevalence/smallest size). The value of θ0 was set to 15/38 to partially control the number of clusters sampled, and the value of θ1 was set to 0.1.

Note, the relative sizes of the circles in Figure 1 are proportional to the value of formula, while the black shading indicates which of the 44 health centers were ultimately selected. Intuitively, this strategy was adopted to balance inflating the prevalence of low birthweight in the sample (i.e., to artificially increase the prevalence of the outcome, a key feature of ODS schemes) with maximizing the total sample size in the subsample used in the analysis. The particular design we used for sampling the health centers was selected based on the results of a simulation study conducted beforehand to compare the efficiency gains of various designs that seek to achieve these two objectives.

After visiting several health centers, we observed that the high-risk deliveries are referred to the district hospitals. In order to have representation of these births in the sample, yet with the constraint of only being able to visit one more location for data collection, we randomly sampled one of the four district hospitals for data collection. Following data abstraction, patient-level information was available on 1635 live facility-based births. The data were restricted to singleton births with complete data on the variables of interest, yielding 1572 observations. Table 1 describes the maternal and newborn characteristics of these births. The low birthweight prevalence is higher among mothers younger than 20 years of age, among mothers who weigh less than 56 kg, and among women in their first pregnancy. The low birthweight prevalence is also higher among mothers with a history of abortion, c-section, or stillbirth. The low birthweight prevalence is slightly higher among female newborns, and the majority of preterm births are also low birthweight births.

TABLE 1

Maternal and newborn characteristics

LBWNot LBWTotal% LBW
Mother's age
<201813515311.8
20–3574106011346.5
36–49172682856.0
Mother's weight
<56 kg4431135512.4
56–59 kg333443778.8
60–64 kg184274454.0
≥65 kg143813953.5
Birth order
15538243712.6
2–3396536925.6
4+154284433.4
HIV status at admission
Positive124254.0
Negative107143315406.9
Unknown16714.3
Previous abortion
Yes10981089.3
No99136514646.8
Previous C-section
Yes14203441.2
No95144315386.2
Previous stillbirth
Yes773808.8
No102139014926.8
District
Gakenke District737197929.2
Rulindo District367447804.6
Sex of newborn
Female627458077.7
Male477187656.1
Preterm birth
Preterm2532889.3
Not Preterm84146015445.4
Total109146315726.9
LBWNot LBWTotal% LBW
Mother's age
<201813515311.8
20–3574106011346.5
36–49172682856.0
Mother's weight
<56 kg4431135512.4
56–59 kg333443778.8
60–64 kg184274454.0
≥65 kg143813953.5
Birth order
15538243712.6
2–3396536925.6
4+154284433.4
HIV status at admission
Positive124254.0
Negative107143315406.9
Unknown16714.3
Previous abortion
Yes10981089.3
No99136514646.8
Previous C-section
Yes14203441.2
No95144315386.2
Previous stillbirth
Yes773808.8
No102139014926.8
District
Gakenke District737197929.2
Rulindo District367447804.6
Sex of newborn
Female627458077.7
Male477187656.1
Preterm birth
Preterm2532889.3
Not Preterm84146015445.4
Total109146315726.9
TABLE 1

Maternal and newborn characteristics

LBWNot LBWTotal% LBW
Mother's age
<201813515311.8
20–3574106011346.5
36–49172682856.0
Mother's weight
<56 kg4431135512.4
56–59 kg333443778.8
60–64 kg184274454.0
≥65 kg143813953.5
Birth order
15538243712.6
2–3396536925.6
4+154284433.4
HIV status at admission
Positive124254.0
Negative107143315406.9
Unknown16714.3
Previous abortion
Yes10981089.3
No99136514646.8
Previous C-section
Yes14203441.2
No95144315386.2
Previous stillbirth
Yes773808.8
No102139014926.8
District
Gakenke District737197929.2
Rulindo District367447804.6
Sex of newborn
Female627458077.7
Male477187656.1
Preterm birth
Preterm2532889.3
Not Preterm84146015445.4
Total109146315726.9
LBWNot LBWTotal% LBW
Mother's age
<201813515311.8
20–3574106011346.5
36–49172682856.0
Mother's weight
<56 kg4431135512.4
56–59 kg333443778.8
60–64 kg184274454.0
≥65 kg143813953.5
Birth order
15538243712.6
2–3396536925.6
4+154284433.4
HIV status at admission
Positive124254.0
Negative107143315406.9
Unknown16714.3
Previous abortion
Yes10981089.3
No99136514646.8
Previous C-section
Yes14203441.2
No95144315386.2
Previous stillbirth
Yes773808.8
No102139014926.8
District
Gakenke District737197929.2
Rulindo District367447804.6
Sex of newborn
Female627458077.7
Male477187656.1
Preterm birth
Preterm2532889.3
Not Preterm84146015445.4
Total109146315726.9

3 Analysis Based on Complete Data

3.1 Marginal Model Specification

Consider the setting in which the scientific question of interest concerns learning about the relationship between some outcome Y and p-vector of covariates, formula (which may include a 1.0 for the intercept). Furthermore, suppose the population of interest is naturally clustered, such as is the case where patients are clustered within health centers. Let K denote the number of clusters and formula the number of individuals in the kth cluster. In this paper, we assume that the scientific question at hand corresponds to an analysis where estimation and inference will be performed with respect to the following marginal mean model for the outcome of the ith individual in the kth cluster as a function of their covariates, formula:

(1)

where formula is a user-chosen link function and formula a p-vector of regression parameters.

3.2 Complete Data Analysis

Given complete data on (formula) for all formula individuals in the K clusters, Liang and Zeger (1986) proposed that estimation of formula can be carried out by solving the following generalized estimating equations:

(2)

where formula = formula, with formula = formula and formula = formula, and with formula = formula denoting the formula matrix of partial derivatives. Finally, formula, indexed by the unknown formula, is an formula working specification for formula. As will become clear below, it will be useful to write formula = formula where formula = formula is an formula matrix, where formula = formula, formula = formula, formula is an formula block-diagonal matrix, with the formula on the diagonal, and formula is the formula matrix obtained by stacking the Kformula matrices.

4 Cluster-Based ODS

In some settings, analysts may not have access to complete data on all elements of (formula) for all N individuals in the K clusters. They may, however, have access to resources that permit ascertainment of this information in a subsample of, say, formula individuals. Furthermore, they may have access to select components of (formula), as well as other variables/information that are not of direct relevance to the scientific question, denoted here by formula, that can, in principle, be used to make decisions regarding the subsampling. Moving forward we refer to this information as being available at the design stage. How researchers choose to make use of the information available at the design stage will depend, in part, on the precise nature of the information as well as on practical/logistical/financial considerations regarding how the otherwise unavailable data elements will be ascertained.

Motivated by the study we conducted in Rwanda, suppose the readily-available information is of the form formula where formula is a cluster-level summary of the outcomes (i.e., across the formula individuals in the kth cluster), formula is a cluster-level feature or a cluster-level summary of elements of formula that are readily-available at the outset, and formula is a cluster-level summary of formula. For example, if Y is binary, as is low birthweight in the study reported on in Section 2, then formula may be the proportion of babies born at the health center in the last 6 months with low birthweight. Furthermore, formula may be a feature of the cluster, such as whether the health center is in a rural or urban setting, and/or it may be an aggregated summary of individual-level data, such as the percentage of mothers who are less than 18 years of age. Finally, formula may be the prevalence of some other outcome or comorbidity that is routinely collected. Note, this data scenario is common in resource-limited settings where detailed information that is recorded and stored locally at health centers is only reported to a centralized agency/ministry after having been aggregated or otherwise summarized (Maokola et al., 2011; Nisingizwe et al., 2014; Haneuse et al., 2015).

The information represented by formula can, in principle, be used to inform a cluster-based ODS design (Cai et al., 2001). For this type of design, rather than selecting individuals directly, some subsample of formula clusters is initially selected. Then, the otherwise unavailable elements of formula are ascertained on all individuals within the sampled clusters. For the contexts we consider, and presented in Section 2, this would correspond to selecting a certain number of health centers, and then having a member of the research team travel to the health center to extract from the local records all relevant information on patients satisfying the study inclusion/exclusion criteria.

Let formula be a binary indicator of whether the kth cluster is selected and formula the corresponding (known) probability of being selected, where formula is the totality of the information available at the design stage. Toward operationalizing the sampling scheme, researchers may opt to use stratified sampling or Poisson sampling. For the first of these, the clusters are cross-classified on the basis of some variable S that is defined on the basis of one or more of the variables contained in formula and is assumed to take on one of J levels. From this, suppose formula clusters are classified as belonging to the jth stratum. We assume that the stratification scheme is specified such that formula for all formula. Then formula clusters are randomly selected from those in the jth stratum, such that formula = formula. Note, for each of the clusters in the jth stratum, we have that formula = formula. For those that are selected in this way we set formula = 1. Under the second option of Poisson sampling, one first prespecifies each of the formula as a function of elements of formula. For example, one could specify a logistic regression model for formula as a function of the clusters' outcome prevalence. Whether a cluster is selected is then determined by an independent Bernoulli trial with probability formula.

We conclude this section with a number of comments. First, we note that a key difference between the two approaches to selecting the subsample of clusters is that formula is predetermined under stratified sampling (and, therefore, fixed), but is random under Poisson sampling. Second, as we elaborate upon below, when the subsampling is based on stratification the formula indicators for clusters within a given stratum are negatively correlated (although they are independent across strata). This, in turn, has implications for the variance of the estimator of formula. Under Poisson sampling, however, since the trials are taken to be independent, the formula are also independent (i.e., both within and between strata). Finally, the framework described above is sufficiently general that formula need not necessarily be used to inform the stratification scheme or the prespecified model for formula. Moving forward we assume, however, that formula will be used at the design stage.

5 Analysis Based on Data from a Cluster-Based ODS Design

When the sampling of the clusters is based, in part, on information on the outcome, the usual generalized estimating equations given by expression (2) are no longer guaranteed to be unbiased (Qaqish et al., 1997). To resolve this, Cai et al. (2001) proposed that formula be estimated as the solution to the following weighted generalized estimating equations:

(3)

Note, following the development of Section 3.2, one can write formula, where formula is an N× N diagonal matrix with diagonal entries equal to the vector formula, with formula a vector of length formula with each element equal to formula. Letting formula denote the formula × 1 vector with all entries equal to formula, formula is the formula vector obtained by concatenating the formula.

Let formula denote the solution to (3). In the remainder of this section, we extend Cai et al. (2001) by: (i) formally establishing the asymptotic properties of formula; (ii) proposing an explicit formula for its asymptotic variance; and, (iii) proposing a suite of methods to correct for small-sample (i.e., when formula is “small”) bias in the point and standard error estimates.

5.1 Asymptotic Properties

To establish the asymptotic properties of formula, we consider the setting where formula while maxformula is bounded above, and assume missingness at random (MAR), in other words that formula, where formula are the individual-level covariate values for which only a summary is available at the design stage.

Detailed arguments in Web Appendix B, which build on those by Xie and Yang (2003), show that formula is consistent for formula, the true value of formula, and that

where formula = formula, formula = formula, formula = formula, and formula. Furthermore, from this we have that the asymptotic variance of formula is

(4)

Finally, formula can be reexpressed as formula, with formula, where formula represents the variance of the complete data estimating equations, and formula the additional variance that arises due to having only complete data on the individuals in the sampled clusters.

5.2 Inference

Inference can be carried out in practice through use of the consistent plug-in estimator for the asymptotic variance of formula:

(5)

where formula, and formula, and formula.

In the expression for formula, formula is an formula block-diagonal matrix where the entries of the kth formula block are all equal to formula, the inverse of the pairwise selection probability for any pair of individuals i and formula in cluster k: formula. In the expression for formula, formula is an formula matrix with all entries in the kth block along the diagonal equal to formula and all entries in the off-diagonal block corresponding to clusters k and formula equal to formula. Given data from a cluster-stratified design, if clusters k and formula are both in stratum j, then formula = formula, while if they belong to different strata j and formula, it follows that formula = formula. Under a Poisson sampling design, formula = formula for individuals from different clusters. In this case, formula is a block-diagonal matrix, where the kth block is an formula matrix with all values equal to formula.

In addition, we propose a second estimator of the variance motivated (in part) by the form proposed by Cai et al. (2001). Toward this, let formula = formula), where formula is an formula-vector with elements formula. We propose that the asymptotic variance of formula be estimated by:

(6)

where formula, with formula=formula. After some algebra, it can be shown that formula, with

where formula is an formula matrix with all entries formula. Thus, expression (6) differs from expression (5) in that the former ignores the impact of negative correlation between the selection indicators. Note that under Poisson sampling, formula.

5.3 Estimation and Inference in Small Samples

As in the complete data setting, estimation and inference for formula based on data from an outcome-dependent cluster-based design may be subject to small-sample bias. Indeed, it may be the norm for studies based on such a design to have a small number of clusters in the data set available for analysis. Motivated by this, we propose bias corrections to both the point estimates, formula, and to formula. For the former, the bias in the point estimates can be expanded to give formula and, building on the work of Paul and Zhang (2014) and Lunardon and Scharfstein (2017), we derive the form of formula, which can be estimated by formula. Since the resulting expressions are quite involved, we leave the details to Web Appendix D. The bias-corrected point estimate is then formula.

We consider four corrections to formula for small-sample settings, adapted from corrections proposed in the complete data setting (see Web Appendices E–G for details). The first is based on a simple “degrees-of-freedom” adjustment, to give formula. The second follows the approach taken by Mancl and DeRouen (2001), and is given by

(7)

where formula and

where formula and formula. The third is a Kauermann and Carroll (2001)-type correction, formula, which is the same as (7), except with formula. The fourth correction adapts the approach taken by Fay and Graubard (2001), and is given by

(8)

where formula and

where formula is a formula diagonal matrix with the formulath element equal to formula, and formula.

Finally, we make two additional comments. First, if the data arise from a cluster-stratified design then one could operationalize these corrections for the estimator of the variance that ignores the negative correlation in the sampling indicators (i.e., for formula given in Section 5.2), by dropping the formula term in the expressions presented. Second, while the bias-corrected variance estimators formula, formula, and formula are proposed for formula, the same form could be adopted in practice using the bias-corrected point estimates formula instead.

6 Simulation Study

To evaluate the operating characteristics of the methods proposed in Section 5, we performed two sets of simulation studies. The first set was designed using characteristics of the birth data set from Rwanda. The second was designed to evaluate the operating characteristics of the methods in a more general setting.

6.1 Data Generation

For the first simulation study, we suppose that interest lies in the following marginal logistic regression model for a binary response for the ith individual in the kth cluster:

where formula is a binary individual-level variable with a cluster-specific prevalence, formula, formula is a continuous cluster-level variable drawn from a Uniform[−2,2] distribution, and formula = (−1.6, 0.5, 0.3). The value of β0 was chosen so that the prevalence of the outcome is around 21%. Complete data sets with K = 44 clusters were generated with the same size distribution as the 44 health centers in the Rwanda study.

For the second set of simulations, we suppose that interest lies in the model:

with formula a binary individual-level variable with the same distribution as in Simulation 1, formula a continuous cluster-specific variable drawn from a N(0.2, 0.052) distribution, and formula a binary cluster-level variable with prevalence 0.20. Finally, formula is an individual-level variable generated from a Normal distribution with cluster-specific mean formula and a variance of 1.0. We set formula = (−3.1, 0.5, 0.5, 0.5, 1), where β0 was chosen so that the overall prevalence was about 13%. Complete data sets were generated with a total of K = 200 clusters with equal cluster sizes of formula = 40.

In all simulations, the true correlation structure was exchangeable; we induced correlation among the observations using the GenBinaryY() function in the MMLB package for R, which implements a method for marginally specified logistic-Normal models (Heagerty, 1999). Briefly, the latter permits analysts to specify a marginal mean for the response while inducing within-cluster correlation via cluster-specific random intercepts that are taken to arise from a formula distribution. For Simulation 1, we varied formula, and for Simulation 2, we set formula. The correlation parameter, α, was estimated for each of these scenarios by fitting the model of interest to the generated complete data sets using a working exchangeable correlation structure; this resulted in an estimated correlated parameter of formula for Simulation 1 and formula for Simulation 2.

6.2 Sampling Designs

For each of the simulation scenarios we generated 10,000 “complete” data sets. For each of these, we considered three cluster-based designs for selecting formula clusters: (#1) simple random sampling of the clusters; (#2) an outcome-dependent cluster-stratified design, where clusters were stratified on formula, the qth quantile of the distribution of the outcome prevalence across the clusters (with formula in Simulation 1 and formula in Simulation 2); and, (#3) an outcome-dependent cluster-stratified design where in Simulation 1 clusters were stratified on the basis of formula and formula, where formula = I(formula), while in Simulation 2 clusters were stratified on the basis of formula and X4. Both designs (#2) and (#3) were “balanced” with respect to the stratification; an equal number of clusters was sampled from each stratum, with the exception of the cases in which formula could not be evenly divided by the number of strata. In these cases, more clusters were sampled from the strata with higher outcome prevalence (formula = 1). In Simulation 1, we considered designs with formula, and in Simulation 2 we considered formula.

6.3 Analyses

For each data set generated through the processes described in Sections 6.1 and 6.2, we computed three estimates of formula: formula, based on (naïvely) solving the unweighted estimating equations given in Section 3.2 for the formula sampled clusters; formula, based on solving the weighted estimating equations given at the start of Section 5; and, the bias-corrected estimator, formula, defined in Section 5.3. Note, under simple random sampling of the clusters, formula and formula will be numerically the same. Furthermore, we note that the working independence correlation structure was adopted in the specification of formula when computing each of these estimators. The reason for this is that if formula includes at least one covariate that varies across units within a cluster (as will almost always be the case), the complete data estimating equations, given by expression (2), are not guaranteed to be unbiased unless working independence is adopted (Pepe and Anderson, 1994). We ran one scenario (Simulation 1, formula) with working exchangeable as well, in order to investigate the degree to which using a working correlation structure other than working independence affects the results.

To evaluate the proposed approaches to inference we focus attention on formula as a point estimate for formula, as it generally exhibits little-to-no bias even when formula is as small as 12 or 15. We then calculated six estimates of formula, and thus the standard errors, using the methods described in Sections 5.2 and 5.3. The first three were: formula, the unadjusted estimator; the degrees-of-freedom adjusted estimator, formula; and the Mancl and DeRouen-type estimator, formula. Another three were as these but ignoring the negative correlation in the sampling indicators: formula, formula, and formula. In Simulation 1, we also computed the Kauermann and Carroll-type estimator, formula/formula and the Fay and Graubard-type estimator, formula/formula, as well as all of the above variance estimates using the uncorrected point estimates, formula, in place of formula. The final approach toward carrying out inference that we considered was a smoothed bootstrap approach adapted from the work of Li and Wang (2008), with details given in Web Appendix H. Using each of the standard error estimates, we constructed 95% confidence intervals using both the normal distribution and the formula distribution, estimating coverage as the proportion of iterations in which the constructed confidence intervals contained the true parameter values.

6.4 Results: Point Estimation

Figures 2 and 3 summarize the mean absolute bias in the point estimates (across the 10,000 such estimates) for each of the three sampling designs, under the scenarios considered in Simulation 1 and Simulation 2. Based on these results, we make the following observations:

The absolute bias in the mean of the unweighted point estimates , weighted uncorrected point estimates , and the bias-corrected weighted point estimates  in Simulation 1, as a function of , under designs #1, #2, and #3. The degree of correlation was determined by . The true parameter values are given by ) = (–1.6, 0.5, 0.3). Under design #1,  and  are equivalent
Figure 2

The absolute bias in the mean of the unweighted point estimates formula, weighted uncorrected point estimates formula, and the bias-corrected weighted point estimates formula in Simulation 1, as a function of formula, under designs #1, #2, and #3. The degree of correlation was determined by formula. The true parameter values are given by formula) = (–1.6, 0.5, 0.3). Under design #1, formula and formula are equivalent

The absolute bias in the mean of the unweighted point estimates , weighted uncorrected point estimates , and the bias-corrected weighted point estimates  in Simulation 2, as a function of , under designs #1, #2, and #3. The degree of correlation was determined by . The true parameter values are given by ) = (−3.1, 0.5, 0.5, 0.5, 1). Under design #1,  and  are equivalent
Figure 3

The absolute bias in the mean of the unweighted point estimates formula, weighted uncorrected point estimates formula, and the bias-corrected weighted point estimates formula in Simulation 2, as a function of formula, under designs #1, #2, and #3. The degree of correlation was determined by formula. The true parameter values are given by formula) = (−3.1, 0.5, 0.5, 0.5, 1). Under design #1, formula and formula are equivalent

First, in general, formula exhibits bias except, as is to be expected, under simple random sampling when formula is relatively large. Second, for the two outcome-dependent schemes, the weighted estimator formula generally exhibits less bias than formula. While there appear to be some exceptions to this (see Figure 2, and Figures 3f, h, and i), the extent of bias is very small and it is unclear if any meaning can be attributed to the relative ordering.

Third, the bias in formula is generally lower than the bias in formula; the additional use of the bias-correction method proposed in Section 5.3 generally reduces bias, albeit to a small degree. An exception to this arises in Figure 2(b), although the difference is very small. Additional exceptions are in Figures 3(j), (k), (m), and (o), each of which correspond to one of the cluster-level covariates, X3 and X4. This may speak to a fundamental difficulty of estimating associations for cluster-level covariates when formula is small, one that may only be resolved through the collection of additional data (rather than through analytic means). From Figure 4, we also see that the extent of bias may depend on the prevalence of a cluster-level covariate that is used to define S in a cluster-stratified design.

The absolute bias in the mean of the unweighted point estimate , weighted uncorrected point estimate , and the bias-corrected weighted point estimate  in Simulation 2 as a function of the prevalence of , with the degree of correlation determined by .
Figure 4

The absolute bias in the mean of the unweighted point estimate formula, weighted uncorrected point estimate formula, and the bias-corrected weighted point estimate formula in Simulation 2 as a function of the prevalence of formula, with the degree of correlation determined by formula.

Finally, the bias-corrected weighted point estimates, formula, generally exhibit little-to-no bias across all parameters, even when formula is a low as 12 or 15.

6.5 Results: Coverage

Tables 2 and 3 report select results regarding coverage of Wald-based 95% confidence intervals in Simulations 1 and 2, respectively. Comprehensive results, specifically for coverage under a broader range of values for formula and formula, as well as coverage using the t distribution, use of the uncorrected point estimates in standard error estimation, and use of working exchangeable correlation matrix, are given in Web Appendix J.

TABLE 2

Estimated coverage of Wald-based 95% confidence intervals in Simulation 1 using methods proposed in Sections 5.2 and 5.3, using the bias-corrected weighted estimator, formula, as the point estimate

formula/
ParameterDesignformulaformulaformulaformulaformula
β012/ formula0.890.930.940.910.90
formula0.890.920.950.920.91
formula0.850.890.930.890.88
24/ formula0.920.940.950.930.93
formula0.920.940.950.940.93
formula0.910.920.940.920.92
β112/ formula0.900.940.940.920.92
formula0.890.930.940.920.92
formula0.870.910.930.900.90
24/ formula0.920.940.950.940.94
formula0.920.940.940.930.93
formula0.920.940.950.930.93
β212/ formula0.870.910.940.910.90
formula0.860.900.930.900.88
formula0.850.890.940.900.88
24/ formula0.910.930.940.930.92
formula0.910.930.950.930.92
formula0.910.930.950.930.92
formula/
ParameterDesignformulaformulaformulaformulaformula
β012/ formula0.890.930.940.910.90
formula0.890.920.950.920.91
formula0.850.890.930.890.88
24/ formula0.920.940.950.930.93
formula0.920.940.950.940.93
formula0.910.920.940.920.92
β112/ formula0.900.940.940.920.92
formula0.890.930.940.920.92
formula0.870.910.930.900.90
24/ formula0.920.940.950.940.94
formula0.920.940.940.930.93
formula0.920.940.950.930.93
β212/ formula0.870.910.940.910.90
formula0.860.900.930.900.88
formula0.850.890.940.900.88
24/ formula0.910.930.940.930.92
formula0.910.930.950.930.92
formula0.910.930.950.930.92

Note. Shown are results for estimators that acknowledge negative correlation in the selection indicators. See Section 6 for details.

TABLE 2

Estimated coverage of Wald-based 95% confidence intervals in Simulation 1 using methods proposed in Sections 5.2 and 5.3, using the bias-corrected weighted estimator, formula, as the point estimate

formula/
ParameterDesignformulaformulaformulaformulaformula
β012/ formula0.890.930.940.910.90
formula0.890.920.950.920.91
formula0.850.890.930.890.88
24/ formula0.920.940.950.930.93
formula0.920.940.950.940.93
formula0.910.920.940.920.92
β112/ formula0.900.940.940.920.92
formula0.890.930.940.920.92
formula0.870.910.930.900.90
24/ formula0.920.940.950.940.94
formula0.920.940.940.930.93
formula0.920.940.950.930.93
β212/ formula0.870.910.940.910.90
formula0.860.900.930.900.88
formula0.850.890.940.900.88
24/ formula0.910.930.940.930.92
formula0.910.930.950.930.92
formula0.910.930.950.930.92
formula/
ParameterDesignformulaformulaformulaformulaformula
β012/ formula0.890.930.940.910.90
formula0.890.920.950.920.91
formula0.850.890.930.890.88
24/ formula0.920.940.950.930.93
formula0.920.940.950.940.93
formula0.910.920.940.920.92
β112/ formula0.900.940.940.920.92
formula0.890.930.940.920.92
formula0.870.910.930.900.90
24/ formula0.920.940.950.940.94
formula0.920.940.940.930.93
formula0.920.940.950.930.93
β212/ formula0.870.910.940.910.90
formula0.860.900.930.900.88
formula0.850.890.940.900.88
24/ formula0.910.930.940.930.92
formula0.910.930.950.930.92
formula0.910.930.950.930.92

Note. Shown are results for estimators that acknowledge negative correlation in the selection indicators. See Section 6 for details.

TABLE 3

Estimated coverage of Wald-based 95% confidence intervals in Simulation 2 using methods proposed in Sections 5.2 and 5.3, using the bias-corrected weighted estimator, formula, as the point estimate

formula = 15formula = 50
ParameterDesignformulaformulaformulaformulaformulaformula
β0#10.860.920.940.920.930.94
#20.860.920.940.920.940.95
#30.800.870.940.910.930.94
β1#10.910.960.940.940.950.95
#20.910.960.940.940.950.95
#30.840.900.910.930.940.94
β2#10.910.960.940.940.950.95
#20.910.960.940.940.950.95
#30.840.900.910.930.940.94
β3#10.850.910.940.920.930.95
#20.850.910.940.920.940.95
#30.790.860.940.910.920.94
β4#10.780.850.880.920.930.95
#20.870.930.950.910.920.93
#30.840.900.950.930.940.95
formula = 15formula = 50
ParameterDesignformulaformulaformulaformulaformulaformula
β0#10.860.920.940.920.930.94
#20.860.920.940.920.940.95
#30.800.870.940.910.930.94
β1#10.910.960.940.940.950.95
#20.910.960.940.940.950.95
#30.840.900.910.930.940.94
β2#10.910.960.940.940.950.95
#20.910.960.940.940.950.95
#30.840.900.910.930.940.94
β3#10.850.910.940.920.930.95
#20.850.910.940.920.940.95
#30.790.860.940.910.920.94
β4#10.780.850.880.920.930.95
#20.870.930.950.910.920.93
#30.840.900.950.930.940.95

Note. Shown are results for estimators that acknowledge negative correlation in the selection indicators. See Section 6 for details.

TABLE 3

Estimated coverage of Wald-based 95% confidence intervals in Simulation 2 using methods proposed in Sections 5.2 and 5.3, using the bias-corrected weighted estimator, formula, as the point estimate

formula = 15formula = 50
ParameterDesignformulaformulaformulaformulaformulaformula
β0#10.860.920.940.920.930.94
#20.860.920.940.920.940.95
#30.800.870.940.910.930.94
β1#10.910.960.940.940.950.95
#20.910.960.940.940.950.95
#30.840.900.910.930.940.94
β2#10.910.960.940.940.950.95
#20.910.960.940.940.950.95
#30.840.900.910.930.940.94
β3#10.850.910.940.920.930.95
#20.850.910.940.920.940.95
#30.790.860.940.910.920.94
β4#10.780.850.880.920.930.95
#20.870.930.950.910.920.93
#30.840.900.950.930.940.95
formula = 15formula = 50
ParameterDesignformulaformulaformulaformulaformulaformula
β0#10.860.920.940.920.930.94
#20.860.920.940.920.940.95
#30.800.870.940.910.930.94
β1#10.910.960.940.940.950.95
#20.910.960.940.940.950.95
#30.840.900.910.930.940.94
β2#10.910.960.940.940.950.95
#20.910.960.940.940.950.95
#30.840.900.910.930.940.94
β3#10.850.910.940.920.930.95
#20.850.910.940.920.940.95
#30.790.860.940.910.920.94
β4#10.780.850.880.920.930.95
#20.870.930.950.910.920.93
#30.840.900.950.930.940.95

Note. Shown are results for estimators that acknowledge negative correlation in the selection indicators. See Section 6 for details.

From the first column of results in Tables 2 and 3, use of the plug-in estimator, formula, generally leads to undercoverage when formula or 15, respectively. This is especially the case for design #3 where the coverage ranges from 0.85 to 0.87 in Table 2, and is even more pronounced in the more complex setting of Simulation 2 with coverage ranging from 0.79 to 0.84. When formula is increased to 24 or 50 the performance of confidence intervals based on formula is improved, though there is still undercoverage for all parameters.

Use of any of the proposed small-sample corrections yields improved coverage compared to the unadjusted approach. Among these, the Mancl and DeRouen-type estimator, formula, generally exhibits the closest to nominal coverage when the standard normal distribution is used for confidence interval construction. When the formula distribution is used for confidence interval construction (see Web Appendix J), the Mancl and DeRouen-type estimator tends to be conservative, particularly when working exchangeable is used or formula, while the other bias-corrected variance estimators see improved coverage, though generally remain anticonservative. Across all variance estimators, regardless of whether the normal or t distribution is used for confidence interval construction, the coverage resulting from using estimators with formula is generally very similar to the coverage when using formula. Furthermore, the difference in coverage when using the variance estimators ignoring the negative correlation in the selection indicators generally decreases as formula increases, with the directionality of the difference for smaller formula depending on the parameter in question, design used, and the degree of within-cluster correlation. The findings also hold when working exchangeable correlation structure is used as opposed to working independence.

7 Data Application

7.1 Analysis

For the purpose of applying the methods proposed in Section 5, we focus attention on a logistic regression model for the binary outcome of low birthweight as a function of seven covariates: maternal age, in years (categorized as 1: <20; 2: 20–35; 3: 36–49); maternal weight, in kg (categorized as 1: <56; 2: 56–59; 3: 60–64; 4: ≥65); birth order (categorized as 1: first; 2: second or third; 3: fourth or higher); sex of the newborn (1: female; 0: male); whether the mother had a previous abortion (1: yes; 0: no), whether the woman had a previous stillbirth (1: yes; 0: no); and, the district the health center/district hospital is located in (1: Rulindo; 0: Gakenke). For this model, we estimated and report adjusted odds ratios (OR) based on: formula, the solution to the unweighted estimating equations given in Section 3.2; formula, based on solving the weighted estimating equations given at the start of Section 5; and, the bias-corrected estimator, formula, defined in Section 5.3. In addition, we calculated Wald-based 95% confidence intervals (CI) for the bias-corrected estimator of each OR based on formula, the plug-in estimator given by expression (5); the degrees-of-freedom adjusted estimator, formula; and the Mancl and DeRouen-type estimator, formula. Finally, as in the simulation studies, working independence was adopted throughout (see Section 6.3).

7.2 Results

Table 4 provides a summary of the results. We generally see a difference in the unweighted, weighted, and bias-corrected point estimates, though the story (direction) remains the same: among women weighing less than 56 kg, giving birth to a male as their first child in Gakenke district, with no history of abortion or previous stillbirth, younger women (<20) and older women (ages 36–49) have a higher odds of having a low birthweight baby than women ages 25–34. After adjusting for the other covariates in the model, the higher the mother's weight the lower the odds of having a low birthweight baby. Similarly, after adjusting for the other covariates in the model, as birth order increases women have lower odds of having a low birthweight baby than women having their first child, female newborns have a higher odds of being low birthweight compared to male newborns, and women who have had a previous abortion or have had a previous stillbirth have a higher odds of having a low birthweight birth compared to women who have no history of previous abortions or no history of previous stillbirths, respectively.

TABLE 4

Results from the analysis of the Rwandan birth data set

Point estimate95% CI for formula
formulaformulaformulaformulaformulaformula
Maternal age
<20 years1.151.111.12(0.78, 1.60)(0.62, 2.02)(0.78, 1.61)
36–49 years1.902.082.01(1.40, 2.88)(1.11, 3.63)(0.97, 4.18)
Maternal weight
56–59 kg0.700.600.59(0.36, 0.97)(0.26, 1.34)(0.26, 1.34)
60–64 kg0.320.260.26(0.14, 0.49)(0.09, 0.74)(0.10, 0.71)
≥65 kg0.290.230.24(0.11, 0.51)(0.07, 0.84)(0.08, 0.74)
Birth order
2–30.420.440.49(0.32, 0.75)(0.25, 0.99)(0.33, 0.73)
4+0.150.110.13(0.08, 0.22)(0.05, 0.31)(0.05, 0.36)
Female newborn1.401.681.72(1.07, 2.75)(0.73, 3.74)(0.93, 3.19)
Previous abortion1.952.252.17(1.32, 3.57)(0.96, 4.93)(0.97, 4.87)
Previous stillbirth1.982.512.31(1.18, 4.52)(0.76, 6.98)(1.05, 5.09)
Rulindo District0.510.280.26(0.08, 0.92)(0.03, 2.08)(0.05, 1.38)
Point estimate95% CI for formula
formulaformulaformulaformulaformulaformula
Maternal age
<20 years1.151.111.12(0.78, 1.60)(0.62, 2.02)(0.78, 1.61)
36–49 years1.902.082.01(1.40, 2.88)(1.11, 3.63)(0.97, 4.18)
Maternal weight
56–59 kg0.700.600.59(0.36, 0.97)(0.26, 1.34)(0.26, 1.34)
60–64 kg0.320.260.26(0.14, 0.49)(0.09, 0.74)(0.10, 0.71)
≥65 kg0.290.230.24(0.11, 0.51)(0.07, 0.84)(0.08, 0.74)
Birth order
2–30.420.440.49(0.32, 0.75)(0.25, 0.99)(0.33, 0.73)
4+0.150.110.13(0.08, 0.22)(0.05, 0.31)(0.05, 0.36)
Female newborn1.401.681.72(1.07, 2.75)(0.73, 3.74)(0.93, 3.19)
Previous abortion1.952.252.17(1.32, 3.57)(0.96, 4.93)(0.97, 4.87)
Previous stillbirth1.982.512.31(1.18, 4.52)(0.76, 6.98)(1.05, 5.09)
Rulindo District0.510.280.26(0.08, 0.92)(0.03, 2.08)(0.05, 1.38)

Note. See Section 7.1 for details.

TABLE 4

Results from the analysis of the Rwandan birth data set

Point estimate95% CI for formula
formulaformulaformulaformulaformulaformula
Maternal age
<20 years1.151.111.12(0.78, 1.60)(0.62, 2.02)(0.78, 1.61)
36–49 years1.902.082.01(1.40, 2.88)(1.11, 3.63)(0.97, 4.18)
Maternal weight
56–59 kg0.700.600.59(0.36, 0.97)(0.26, 1.34)(0.26, 1.34)
60–64 kg0.320.260.26(0.14, 0.49)(0.09, 0.74)(0.10, 0.71)
≥65 kg0.290.230.24(0.11, 0.51)(0.07, 0.84)(0.08, 0.74)
Birth order
2–30.420.440.49(0.32, 0.75)(0.25, 0.99)(0.33, 0.73)
4+0.150.110.13(0.08, 0.22)(0.05, 0.31)(0.05, 0.36)
Female newborn1.401.681.72(1.07, 2.75)(0.73, 3.74)(0.93, 3.19)
Previous abortion1.952.252.17(1.32, 3.57)(0.96, 4.93)(0.97, 4.87)
Previous stillbirth1.982.512.31(1.18, 4.52)(0.76, 6.98)(1.05, 5.09)
Rulindo District0.510.280.26(0.08, 0.92)(0.03, 2.08)(0.05, 1.38)
Point estimate95% CI for formula
formulaformulaformulaformulaformulaformula
Maternal age
<20 years1.151.111.12(0.78, 1.60)(0.62, 2.02)(0.78, 1.61)
36–49 years1.902.082.01(1.40, 2.88)(1.11, 3.63)(0.97, 4.18)
Maternal weight
56–59 kg0.700.600.59(0.36, 0.97)(0.26, 1.34)(0.26, 1.34)
60–64 kg0.320.260.26(0.14, 0.49)(0.09, 0.74)(0.10, 0.71)
≥65 kg0.290.230.24(0.11, 0.51)(0.07, 0.84)(0.08, 0.74)
Birth order
2–30.420.440.49(0.32, 0.75)(0.25, 0.99)(0.33, 0.73)
4+0.150.110.13(0.08, 0.22)(0.05, 0.31)(0.05, 0.36)
Female newborn1.401.681.72(1.07, 2.75)(0.73, 3.74)(0.93, 3.19)
Previous abortion1.952.252.17(1.32, 3.57)(0.96, 4.93)(0.97, 4.87)
Previous stillbirth1.982.512.31(1.18, 4.52)(0.76, 6.98)(1.05, 5.09)
Rulindo District0.510.280.26(0.08, 0.92)(0.03, 2.08)(0.05, 1.38)

Note. See Section 7.1 for details.

The confidence intervals using both the degrees-of-freedom adjusted standard errors and the Mancl and DeRouen-type adjusted standard errors generally result in wider confidence intervals than those constructed using the unadjusted standard errors. In some instances, one correction maintains significance while the other does not: for the effect of stillbirth, the confidence interval using the degrees-of-freedom adjusted standard errors contains 1, suggesting that the effect is not statistically significant, whereas the confidence interval using the Mancl and DeRouen-type adjustment does not contain 1, suggesting a statistically significant effect; conversely, for the effect of maternal age (36–39), the confidence interval using the degrees-of-freedom adjusted standard errors suggests a statistically significant effect, while that constructed using the Mancl and DeRouen-type adjustment does not. Given the conflicting results, we recommend the confidence intervals using the Mancl and DeRouen-type correction to the variance estimator based on the simulation study results. The simulations, in particular Simulation 2/formula=15, suggest that sometimes formula while at other times formula, but that in general, formula yields the closest to nominal coverage.

8 Discussion

In this paper, we consider cluster-based ODS in resource-limited settings. Within this context, we extend the work of Cai et al. (2001) by formally establishing the asymptotic properties of the IPW-GEE estimator, and provide an explicit expression for the asymptotic variance. In addition, motivated by our own study in Rwanda, we propose several small-sample bias corrections to both the point estimates and estimates of the asymptotic variance. Our simulation results suggest that there is no clear overarching story in terms of one approach consistently outperforming the others.

With regard to the point estimates, our simulations indicate that the bias-corrected point estimates generally reduce the bias in the point estimates, though the improvement is small in most cases. The impact of the bias-corrected standard errors, on the other hand, is much more substantial. The unadjusted standard errors result in severe undercoverage when formula is small, and all of the bias-corrected standard errors yield closer to nominal coverage. Among the bias corrections to the variance estimator that we considered, the Mancl and DeRouen-type correction generally yielded the closest to nominal coverage when the normal distribution was used for confidence interval construction. When the formula distribution was used for confidence interval construction, the Mancl and DeRouen-type correction could be conservative, particularly when the degree of correlation was increased. While the use of the formula distribution improved the coverage of the other bias corrections, there was no method that consistently performed better than the others, and these methods still suffered from some undercoverage and overcoverage, depending on the design and parameter. Other degrees of freedom have been proposed when using the t distribution for confidence interval construction, and may yield improved coverage in this setting as well. Furthermore, once formula was at least 50, the standard errors taking into account the covariance of the selection indicators and the standard errors ignoring the covariance of the selection indicators, generally resulted in similar coverage, indicating that asympototically, even the naïve method does not perform so poorly. This result held for both the unadjusted and the adjusted standard errors. When formula was small, however, the narrative was mixed.

For these reasons, while acknowledging that there is not one superior method, and that the decision regarding which approach to take is in the hands of the researcher, we offer our perspective on how we would proceed: bias-corrected point estimates, and confidence intervals constructed using the normal distribution together with the variance taking into account the covariance of the selection indicators, adjusted with the Mancl and DeRouen-type correction. We stress the importance of using a bias correction to the variance estimator, while using uncorrected (but weighted) point estimates will, based on our results, generally not have a substantial impact on the analysis. The observation that one approach does not consistently outperform all of the others is consistent with results from papers comparing different small-sample bias corrections to the robust sandwich estimator in the complete data setting—no single small-sample correction consistently outperforms other small-sample adjustments. Which correction performs better depends on a variety of factors, including the number of clusters, the size of the clusters, the degree of variability in the cluster sizes, the degree of correlation among the clusters, and the type (cluster-level vs. individual-level) and distribution of the covariates in the model of interest. The theory that is presented relies on asymptotics, though in small samples, we are not aware of theory establishing small-sample behavior. Researchers must make decisions on what to do based on the characteristics of their data, and the objectives of their analysis. Finally, the simulations in this paper solely consider binary outcomes; more research is needed to determine the performance of these methods for continuous or count outcomes.

The focus of this paper is how to carry out estimation and inference for a data set collected through a cluster-based ODS design, and on considerations that a researcher must make when the number of clusters in the subsample is small. An open question, however, is how to identify an optimal sampling design for selecting the clusters at the design stage. As explained in Section 2.1, the specific design we adopted in Rwanda arose through efforts to balance financial/logistic considerations with statistical power/efficiency, with the latter assumed to be driven, in part at least, by the prevalence of the outcome and the sample size. While this informal strategy may intuitively be reasonable, how to make optimal (or at the very least wise) choices regarding a stratification scheme or a choice of model for formula in this context is a topic of our ongoing research. Another approach to increasing efficiency is to use pieces of information known at the design stage to calibrate the weights used in the weighted estimating equations (Breidt et al., 2017; Rivera-Rodriguez et al., 2019); this, too, is an area for future research.

Data Availability Statement

The data that support the findings of this paper are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Acknowledgments

Ms. Sauer was funded by the Harvard T.H. Chan School of Public Health NIEHS-sponsored Environmental Training Grant T32ES007142, and the Rose Traveling Fellowship Program in Chronic Disease Epidemiology and Biostatistics. Dr. Haneuse was supported by NIH grant R01 HL094786. Dr. Rivera-Rodriguez was supported by University of Auckland-Science/FRDF New Staff - 3716994. The Rwanda example included data collected as part of the All Babies Count program funded by Grand Challenges Canada Saving Lives at Birth, implemented by Partners In Health/Inshuti Mu Buzima and the Ministry of Health. The authors would like to thank Alphonse Nshimyiryo, Catherine Kirk, Ibrahim Hakizimana, and Robert Mutsinzi for their support in coordination and implementation of data collection in Rwanda.

References

Breidt
,
F.J.
and
Opsomer
,
J.D.
(
2017
)
Model-assisted survey estimation with modern prediction techniques
.
Statistical Science
,
32
,
190
205
.

Breslow
,
N.E.
and
Day
,
N.E.
(
1980
)
Statistical Methods in Cancer Research
.
1
.
The Analysis of Case-Control Studies
.
Geneva, Switzerland
:
Distributed for IARC by WHO
.

Cai
,
J.
,
Qaqish
,
B.
and
Zhou
,
H.
(
2001
)
Marginal analysis for cluster-based case-control studies
.
Sankhyā, Series B
,
63
,
326
337
.

Fay
,
M.P.
and
Graubard
,
B.I.
(
2001
)
Small-sample adjustments for Wald-type tests using sandwich estimators
.
Biometrics
,
57
,
1198
1206
.

Haneuse
,
S.
,
Hedt-Gauthier
,
B.
,
Chimbwandira
,
F.
,
Makombe
,
S.
,
Tenthani
,
L.
and
Jahn
,
A.
(
2015
)
Strategies for monitoring and evaluation of resource-limited national antiretroviral therapy programs: the two-phase design
.
BMC Medical Research Methodology
,
15
,
31
.

Haneuse
,
S.
and
Rivera-Rodriguez
,
C.
(
2018
)
On the analysis of case–control studies in cluster-correlated data settings
.
Epidemiology
,
29
,
50
57
.

Heagerty
,
P.J.
(
1999
)
Marginally specified logistic-normal models for longitudinal binary data
.
Biometrics
,
55
,
688
698
.

Hug
,
L.
,
Alexander
,
M.
,
You
,
D.
and
Alkema
,
L.
for
UN Inter-agency Group for Child Mortality Estimation
. (
2019
)
National, regional, and global levels and trends in neonatal mortality between 1990 and 2017, with scenario-based projections to 2030: a systematic analysis
.
Lancet Global Health
,
7
,
e710
e720
.

Kauermann
,
G.
and
Carroll
,
R.J.
(
2001
)
A note on the efficiency of sandwich covariance matrix estimation
.
Journal of the American Statistical Association
,
96
,
1387
1396
.

Langholz
,
B.
and
Thomas
,
D.C.
(
1990
)
Nested case-control and case-cohort methods of sampling from a cohort: a critical comparison
.
American Journal of Epidemilogy
,
131
,
169
176
.

Li
,
Y.
and
Wang
,
Y.-G.
(
2008
)
Smooth bootstrap methods for analysis of longitudinal data
.
Statistics in Medicine
,
27
,
937
953
.

Liang
,
K.-Y.
and
Zeger
,
S.L.
(
1986
)
Longitudinal data analysis using generalized linear models
.
Biometrika
,
73
,
13
22
.

Lubchenco
,
L.O.
,
Searls
,
D.
and
Brazie
,
J.
(
1972
)
Neonatal mortality rate: relationship to birth weight and gestational age
.
Journal of Pediatrics
,
81
,
814
822
.

Lunardon
,
N.
and
Scharfstein
,
D.
(
2017
)
Comment on “small sample GEE estimation of regression parameters for longitudinal data”
.
Statistics in Medicine
,
36
,
3596
3600
.

Mancl
,
L.A.
and
DeRouen
,
T.A.
(
2001
)
A covariance estimator for GEE with improved small-sample properties
.
Biometrics
,
57
,
126
134
.

Maokola
,
W.
,
Willey
,
B.
,
Shirima
,
K.
,
Chemba
,
M.
,
Armstrong Schellenberg
,
J.
,
Mshinda
,
H.
,
et al
. (
2011
)
Enhancing the routine health information system in rural southern Tanzania: successes, challenges and lessons learned
.
Tropical Medicine & International Health
,
16
,
721
730
.

Morel
,
J.G.
,
Bokossa
,
M.
and
Neerchal
,
N.K.
(
2003
)
Small sample correction for the variance of GEE estimators
.
Biometrical Journal
,
45
,
395
409
.

Neuhaus
,
J.
,
Scott
,
A.
and
Wild
,
C.
(
2002
)
The analysis of retrospective family studies
.
Biometrika
,
89
,
23
37
.

Neuhaus
,
J.M.
,
Scott
,
A.J.
and
Wild
,
C.
(
2006
)
Family-specific approaches to the analysis of case–control family data
.
Biometrics
,
62
,
488
494
.

Neuhaus
,
J.M.
,
Scott
,
A.J.
,
Wild
,
C.J.
,
Jiang
,
Y.
,
McCulloch
,
C.E.
and
Boylan
,
R.
(
2014
)
Likelihood-based analysis of longitudinal data from outcome-related sampling designs
.
Biometrics
,
70
,
44
52
.

Nisingizwe
,
M.P.
,
Iyer
,
H.S.
,
Gashayija
,
M.
,
Hirschhorn
,
L.R.
,
Amoroso
,
C.
,
Wilson
,
R.
,
et al
. (
2014
)
Toward utilization of data for program management and evaluation: quality assessment of five years of health management information system data in Rwanda
.
Global Health Action
,
7
, 25829. https://www.tandfonline.com/doi/full/10.3402/gha.v7.25829

Pan
,
W.
and
Wall
,
M.M.
(
2002
)
Small-sample adjustments in using the sandwich variance estimator in generalized estimating equations
.
Statistics in Medicine
,
21
,
1429
1441
.

Paul
,
S.
and
Zhang
,
X.
(
2014
)
Small sample GEE estimation of regression parameters for longitudinal data
.
Statistics in Medicine
,
33
,
3869
3881
.

Pepe
,
M.
and
Anderson
,
G.L.
(
1994
)
A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data
.
Communications in Statistics
,
23
,
939
951
.

Prentice
,
R.L.
and
Pyke
,
R.
(
1979
)
Logistic disease incidence models and case-control studies
.
Biometrika
,
66
,
403
411
.

Qaqish
,
B.F.
,
Zhou
,
H.
and
Cai
,
J.
(
1997
)
On case-control sampling of clustered data
.
Biometrika
,
84
,
983
986
.

Rivera-Rodriguez
,
C.
,
Spiegelman
,
D.
and
Haneuse
,
S.
(
2019
)
On the analysis of two-phase designs in cluster-correlated data settings
.
Statistics in Medicine
,
38
,
4611
4624
.

Rwanda DHS
(
2015
)
National Institute of Statistics of Rwanda (NISR) [Rwanda]
,
Ministry of Health (MOH) [Rwanda], and ICF International. Rwanda Demographic and Health Survey 2014-15.

Schildcrout
,
J.S.
,
Garbett
,
S.P.
and
Heagerty
,
P.J.
(
2013
)
Outcome vector dependent sampling with longitudinal continuous response data: stratified sampling based on summary statistics
.
Biometrics
,
69
,
405
416
.

Schildcrout
,
J.S.
and
Rathouz
,
P.J.
(
2010
)
Longitudinal studies of binary response data following case–control and stratified case–control sampling: design and analysis
.
Biometrics
,
66
,
365
373
.

Wacholder
,
S.
(
1991
)
Practical considerations in choosing between the case-cohort and nested case-control designs
.
Epidemiology
,
2
,
155
158
.

White
,
J.E.
(
1982
)
A two stage design for the study of the relationship between a rare exposure and a rare disease
.
American Journal of Epidemiology
,
115
,
119
128
.

WHO
. (
2019
)
Newborns: reducing mortality. Available
at: https://www.who.int/news-room/fact-sheets/detail/newborns-reducing-mortality  
[Accessed 30 September 2019]
.

Xie
,
M.
and
Yang
,
Y.
(
2003
)
Asymptotics for generalized estimating equations with large cluster sizes
.
Annals of Statistics
,
31
,
310
347
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data