Summary

The Wilcoxon rank-sum test is a popular nonparametric test for comparing two independent populations (groups). In recent years, there have been renewed attempts in extending the Wilcoxon rank sum test for clustered data, one of which (Datta and Satten, 2005, Journal of the American Statistical Association  100, 908–915) addresses the issue of informative cluster size, i.e., when the outcomes and the cluster size are correlated. We are faced with a situation where the group specific marginal distribution in a cluster depends on the number of observations in that group (i.e., the intra-cluster group size). We develop a novel extension of the rank-sum test for handling this situation. We compare the performance of our test with the Datta–Satten test, as well as the naive Wilcoxon rank sum test. Using a naturally occurring simulation model of informative intra-cluster group size, we show that only our test maintains the correct size. We also compare our test with a classical signed rank test based on averages of the outcome values in each group paired by the cluster membership. While this test maintains the size, it has lower power than our test. Extensions to multiple group comparisons and the case of clusters not having samples from all groups are also discussed. We apply our test to determine whether there are differences in the attachment loss between the upper and lower teeth and between mesial and buccal sites of periodontal patients.

1 Introduction

Rank-based tests are very popular nonparametric methods for comparing two groups or populations. They are particularly useful when the underlying distributions are suspected to be non-normal. One such widely used test for comparing two groups is the Wilcoxon rank-sum test (Wilcoxon, 1945). One important assumption for applicability of Wilcoxon rank-sum test is that all the observations under the study are independent. However, this assumption may be violated under certain circumstances. In many practical situations, we have clustered data where the observations within the clusters are correlated. An example of such clustered data is the data on attachment loss measurement of different teeth of the same individual. Wilcoxon rank-sum test may not be a good option for this type of clustered data. Rosner, Glynn, and Lee (2003) proposed a rank sum test for clustered data for the cases where all the cluster members are from the same group and the correlation structure within a cluster is common across groups. But this approach would not work when the members from a single cluster do not necessarily belong to the same group. Also, this will not maintain the nominal size when the number of observations in a cluster (cluster size) is associated with the outcome of interest from that cluster in some way. This is a case of informative cluster size, where the informativeness comes from the fact that the number of observations in a given cluster (i.e., the cluster size) may be affected by some latent (cluster-specific) factor that affects the outcome variable in that cluster as well. Datta and Satten (2005) proposed a rank-sum test for clustered data that does not make any assumption on the nature of clustering and performs reasonably well in case of informative cluster sizes. But, even the test by Datta and Satten (2005) does not seem to perform well in situations where the outcome of interest belonging to a group in a given cluster appears to be correlated with the number of observations having the same group membership (i.e., the intra-cluster group size) within that cluster. This scenario, following the idea of informative cluster size, can be thought of as informative intra-cluster group (ICG) sizes. This notion of informative ICG sizes can occur in a dental study when one is interested in comparing the nature of attachment losses of teeth between the upper and lower jaws. This is because, the difference (if any) between the nature of attachment loss of the teeth of upper and lower jaws can be suspected to be associated with the difference between the number of teeth present in the upper and lower jaws. Another interesting situation, where informative ICG sizes come into play, can be found in studies relating to hereditary diseases. In many genetic studies, it has been observed that an inherited disease is often diagnosed at an younger age in a later generation than that in an earlier generation. This phenomenon of an earlier onset of a disease in each successive generation of a family, called anticipation, is prevalent in diseases like non-Hodgkins lymphoma, breast and ovarian cancer, Huntington's disease among others. In case of testing for this anticipation phenomenon of a disease, or in general, to test whether the age at onset of a disease differs in two different generations of large pedigrees, an interesting information might be the number of affected individuals, belonging to a certain age interval, that are present in each of the two generations under study. “Affected” individuals include subjects who are currently diseased at the time of the study as well as those who were known to be diseased at some point of time before the study. If we find that there is a large difference in the number of affected individuals (belonging to that certain age group) between the two generations, then one may relate this difference to be associated to the difference in the onset age between the two generations. So, this might be a case of informativeness in the number of subjects (affected individuals in a certain age interval) in a group (generation) within a cluster (a large pedigree). Motivated by these, we develop a rank-sum test for comparing the marginal distribution of outcomes from different groups under the cases of informative ICG sizes. We return to the example of tooth attachment loss in the application section.

Extending the idea of within-cluster resampling (Hoffman, Sen, and Weinberg, 2001; Williamson, Datta, and Satten, 2003; Datta and Satten, 2005; Datta and Satten, 2008), we obtain a rank-sum test for clustered data, with observations from both groups being present in every cluster. Our resampling scheme is an extension of the usual within-cluster resampling because instead of resampling one observation at random from each cluster, we first resample one group membership (out of the two possible groups) for a cluster and then resample an outcome from that group belonging to that cluster. We repeat this resampling for each cluster and obtain a rank-sum statistic based on the resampled observations. Then, following the approaches of Datta and Satten (2005), we derive our test statistic by averaging the rank sum statistic over all possible choices of the resampled observations given the data. After constructing our test, we compare it with three other existing tests, including the test by Datta and Satten (2005), under naturally occurring simulation scenarios of informative ICG sizes. We show that our test maintains the correct size under the null hypothesis of marginal symmetry, unlike the test by Datta and Satten (2005). Moreover, our test has better power performances that the three other tests under this simulation study. Besides, we show that our test also has acceptable size and power in simulation settings where we have informative cluster sizes but noninformative ICG sizes and also in simulation scenarios having both noninformative cluster and ICG sizes. Additionally, we extend our test statistic for two group comparison to the cases when some of the clusters may have observations from only one of the two groups (i.e., the intra-cluster group structures are incomplete). We present a simulation study to show that our test still maintains the appropriate size and has reasonable power under this scenario of incomplete ICG structures within a cluster. We also discuss an extension to our test where there are observations from more than two groups in every cluster.

The rest of the article is organized as follows. In Section 2, we introduce the necessary notations, formulate our testing problem, and develop a test statistic for comparing the outcomes from two groups when outcomes from both the groups are present in every cluster. This section also contains some variant forms of our test under different clustered data settings including expressions of the test statistic with other required quantities that generalize our new rank-sum test to more than two groups and an extension of our test statistic for two group comparison where some clusters may have observations from only one of the two groups. Section 3 contains simulation studies that evaluate the empirical performance of our test compared to three other tests on the basis of size and power. In Section 4, we return to the dental data discussed before and apply our testing procedure to compare the difference between the tooth attachment loss in the upper and lower jaws. Besides this, we apply our test for a different comparison in dental data where some other rank-sum tests can also be applied. The article ends with a discussion in Section 5. Detailed steps for deriving our test statistic are discussed in the Web Appendix (Supplementary Web Materials).

2 Notations, Formulation of the Problem, and Proposed Test Statistic

Let M denote the number of clusters and let forumla denote the forumla observation in the forumla cluster, forumla  forumla where forumla denotes the number of observations in the forumla cluster. Let forumla be the indicator denoting the binary group membership (0 or 1) of the forumla observation in the forumla cluster. Thus, the entire data set consists of forumlawith forumla  forumla corresponding to the forumla cluster. Also, let forumla and forumlabe the number of observations in the forumla cluster belonging to group 1 and group 0forumla respectively. Thus, we have forumla. We consider the possibility that the cluster size forumlaas well as the group memberships forumlaare random (and thus, so are the forumla). The members in a cluster could have an arbitrary dependence structure; however, members in different clusters are statistically independent and hence the entire forumlaand forumlaare independent. For mathematical convenience, we further assume that forumla, are independent and identically distributed (iid).

The null hypothesis we consider is that the observations from the two groups follow the same marginal distribution. Mathematically, it is written as

However, the empirical analogue of the above “group specific” (e.g., conditional) marginal distributions can be constructed in three possible ways resulting in three different statistical comparisons:

Note that forumlaiforumla represents the (empirical) distribution of group d  forumladata values in the entire sample irrespective of their cluster membership. Calculation (ii) is based on sampling a single paired (e.g., forumla observation from each cluster. In other words, forumla represents the conditional distribution of a typical outcome value forumla for a typical cluster forumla given the corresponding group membership forumla equals d. Here, forumla is a discrete uniform on forumla. Calculation (iii) is based on computing the proportion of outcomes belonging to group d in a typical cluster i which are less than or equal to x and then taking the average of these proportions over all the clusters. Each of quantities in the right-hand sides of (i), (ii), and (iii), can be written as an estimate of forumla, but the difference lies in construction of the estimates of the probabilities. In (i), the probabilities are estimated by pooling all the observations together irrespective of their cluster membership, while in (ii) and (iii), the estimates are constructed by conditioning on forumla and forumla, respectively. Every outcome, belonging to group d and having value less than x, contributes equally in the construction of forumla, but in constructions of forumla and forumla we have different contributions from the different outcomes depending on their cluster memberships.

Let forumla, forumla, forumla be the distribution functions which are estimated by forumla, forumla, forumla, respectively. When the cluster sizes as well as the ICG sizes formed by the two groups within each cluster are not suspected to be associated to the outcome variable in any way, then hypotheses involving forumla, forumla, forumla become equivalent and one can test any one of these three hypotheses. If there is some association between the cluster size and the outcome variable in that cluster, one can think of testing hypothesis involving forumla for appropriate comparison. This is a situation of informative cluster sizes. Again, if the ICG sizes formed by the two groups in a cluster appear to be correlated (even after conditioning on the overall cluster size) with the outcomes from the respective groups in that cluster, one may think of testing hypothesis comprising of distribution forumla instead of forumla and forumla to get more meaningful results. We can refer to this as a case of informative ICG sizes. In the absence of this informativeness in the ICG sizes, one can test the null hypotheses of equality of marginal distributions involving any one of the marginal distributions forumla and forumla, possibly leading to similar conclusion in each case.

In this article, we are interested in comparing forumla in the two groups when the ICG sizes are potentially informative. Currently, no rank-based tests are available for testing group differences for clustered data that takes into account the informativeness of the ICG sizes formed by the groups under study. We denote the common marginal distribution under the null hypothesis as forumla. It is perhaps worth pointing out that the estimation of the marginal regression parameters via weighted estimating equations in presence of informative ICG size has been considered by Huang and Leroux (2011).

2.1 Development of the Test Statistic

For the sake of simplicity, let us relabel the observations according to their group membership within each cluster in the following way. In theforumla cluster, let forumla represent the set of observations belonging to the group indexed by 1, while forumla represents the set of observations belonging to the group indexed by 0. We denote these sets as forumlaand forumla, respectively. Thus, forumla and forumla form a partition of forumla= {forumla, the set of all observations in cluster i. The number of observations belonging to the set forumla  forumla is the intra-cluster group size of group d in the forumla cluster. Till the end of the Section 2.1, we would assume that at least one observation from each group is present in every cluster. In this Section, this assumption means that forumla 0 and forumla 0 with probability one for every cluster i. A relaxation of this condition is discussed in Section 2.2.

Our test statistic, for testing the hypothesis involving marginal distributions forumla as estimated in forumla, can be generated from a resampling scheme which is an extension of the within-cluster resampling (WCR). An outline of the resampling scheme is as follows: For each cluster i, let us resample group membership as forumla, where forumla takes value 0 or 1 with equal probability forumla. If forumla = 1, we resample one observation for the forumlacluster from the set of observations forumla and name it forumla If forumla = 0, resample forumlafrom the set forumla.

The fact that the outcomes are resampled from the subsets formed by the two groups in a cluster and not from the whole cluster makes this resampling scheme different from the usual WCR technique. Now, this resampling gives us M pairs of independent observations forumla. If forumlabe the rank of forumla among the set forumla,forumla, i.e.,forumla= 1+forumla, then the Wilcoxon rank sum statistic based on these M pairs of resampled observations forumla would be of the form: forumla. One can use forumla as a valid test statistic and carry out the test based on forumla. But that test would be inefficient as the test statistic would depend too much on one particular observation chosen from each cluster. So to get rid of the imposed randomization due to resampling, we propose a test statistic based on earlier approaches of Williamson et al (2003), Datta and Satten (2005), and Datta and Satten (2008), that corresponds to averaging forumla over all possible choices of forumla, forumla values given the data.

Thus, our test statistic is forumla, where forumla  forumla and forumla. We can calculate the theoretical expression of T. After some necessary steps, a convenient expression of T (see Web Appendix A for the detailed steps) turns out to be

where forumla.

Besides T, we need to know its expected value forumla and its variance estimate forumla to properly carry out inference based on forumlaTo get forumla, we note that forumla. The unconditional expectation of forumlacan be calculated easily through conditioning on the vector of group membership indicator forumla. So we get, forumla.

The next step is to find a variance estimate forumla. To get the variance estimate of T, we employ the jackknife technique. Here, the clusters can be thought of as iid units and thus we can use a “delete-1-cluster” jackknife approach to get the necessary results. Mathematically, this can be formulated as follows. Let forumla be the value of the statistic T calculated after deleting the forumla cluster. Let us define, forumla. Then, the estimate of variance of T, which is the jackknife variance estimate, is given by

Now that we have the expressions for T, forumla, forumla, we can carry out the testing using the absolute value of the standardized statisticforumla

The asymptotic distribution of Z is established through the following theorem. An outline of its proof is given in the Web Appendix B.

THEOREM 1. (Asymptotic normality). Under forumla, as forumla under certain regularity conditions of a Lindeberg Central Limit Theorem.

The p-value for the test is computed as the probability that, under forumla, the absolute value of the Z-statistic exceeds its observed value in magnitude. We would reject the null hypothesis forumla at a 100forumla% level of significance if the p-value is less than forumla.

Till this point, we have assumed the existence of only two groups in every cluster. In Web Appendix C, we have discussed a more general situation where there are m groups in every cluster, such that forumla.

2.2 Extension to Incomplete Intra-Cluster Group Structure in One or More Clusters

In case of binary grouping, (i.e., forumla or 1), we have assumed that there is at least one observation from each group in every cluster. In practice, one may encounter a few clusters (not all) with one group of observations completely missing. In other words, there may be some clusters having outcomes from only one of the two possible groups. We call such a case as incomplete informative intra-cluster group structure within a cluster. The hypothesis of interest remains the same, viz., whether the marginal distributions of outcomes are same for the two groups. We cannot directly apply the test statistic in the form described in Section 2.1 to this setting. This is mainly because of the fact that the test statistic developed in Section 2.1 is only applicable under the assumption that outcomes from both groups are available within each cluster. We extend the approach described in Section 2.1, to get a valid test statistic in this setting.

Here, we follow the same notations as described in Section 2.1. In cases of incomplete ICG structures within a cluster, the empirical analogue of the “group specific” marginal distributions of our interest can be constructed as a modification of forumla as forumlawhere forumlaor forumla, or forumlaaccording to whether the forumla cluster has observations from both groupsforumlathe dth group only, or not.

We extend the idea of within cluster resampling also to this setting to get a valid test statistic. forumla If both forumla and forumla, group membership is resampled as forumla, where forumla takes value 0 or 1, with equal probability forumla. If forumla= 0, resample forumlafrom forumla; otherwise, if forumla= 1, resample forumla from forumla. forumla If forumla and forumla, we resample forumla from forumla and have forumla= 0. Here, forumla is same as forumla as the set forumla is an empty set. forumla If forumla and forumla, we resample forumlafrom forumla and have forumla= 1. Here, forumla is same as forumla as the set forumla is an empty set.

To obtain our test statistic T in this case, we proceed in the same way as in Section 2.1. With forumla being the rank of forumla among the set forumla, forumla, we obtain forumla, the Wilcoxon rank sum statistic based on the M pairs of resampled observations forumla. Then, our proposed test statistic T is calculated as forumla. After some algebra (see Web Appendix D), we obtain T as

where

The expected value of the test statistic is estimated to be

Now, to find the estimated variance forumlaof forumla, we use the same “delete-1-cluster” jackknife approach described in Section 2.1. Finally, as in Section 2.1, we carry out the testing using the standardized Z-statistic forumla, that has asymptotic forumla distribution under forumla.

3 Simulation Results

In this Section, we present three simulation studies corresponding to the tests discussed in the Sections 2.1 and 2.2. In the simulation scenario 1, we consider clustered observations such that every cluster has outcomes from both the groups. In each cluster, the number of observations belonging to group 1 and the number of observations belonging to group 0forumlathat is the two ICG sizes, are both influenced by some latent factor, that also influences the outcomes in that cluster. Also, the distributions of the two ICG sizes, within each cluster, differ between themselves. So, there is some association between the ICG sizes and outcomes in a given cluster (even after conditioning on the overall cluster size) and we can think of this as informative ICG sizes. Under this simulation scenario, we compare the performances of four tests, namely, (1) our new rank sum test developed in Section 2.1, (2) the test by Datta and Satten (2005), (3) the naive Wilcoxon rank sum test assuming all the observations as iid and ignoring their cluster membership, and (4) the signed rank test taking cluster averages for each group of observations. Further, each test was carried out under three different choices of the number of clusters (M), namely, 30, 50, and 150. In simulation scenario 2, we generate a setting that closely represents the dental setting discussed in Section 1. Basically, the idea is to have a clustered data with informative ICG sizes, where the number of units belonging to each group in a cluster cannot exceed a certain value. Under this setting, we compare the four tests (1)–(4) for 50 clusters. In scenario 3, we again consider informativeness in the ICG sizes, but we do not restrict ourselves to the condition that observations from both the groups have to be present in each cluster. In other words, we include the cases of incomplete ICG structures within a cluster for which our test statistic developed in Section 2.2 looks appropriate. We investigate the performance of this new test for a simulation model with 30 clusters under scenario 4.

Additionally, in Web Appendix E we consider two more simulation scenarios (Scenarios 4 and 5), where we compare the four tests (1)–(4) under situations such that either the ICG sizes or both the ICG sizes and the cluster sizes are noninformative.

Performances of all the tests are evaluated on the basis of their sizes (nominal forumla) and power values. These are estimated by the proportion of 3000 Monte Carlo iterates in which null hypothesis is rejected.

3.1 Simulation Scenario 1

Let M be the number of clusters (fixed). For a typical cluster i, we define, forumla as the number of observations from group 1 in the forumla cluster, forumla as the number of observations from group 0 in the forumla cluster, forumla as the random cluster effect due to the forumla cluster. In the forumla cluster, we generate forumla from Normal(0,forumla) distribution, forumla from Poisson(10 + 5forumla) distribution where forumla=forumla+ 1, forumla is generated from Poisson(10 + 5forumla) such that forumla=forumla+ 1. Also, we know that forumla  forumla. Let forumlabe the group indicator of the forumla observation in the forumla cluster. We assign forumla for 1forumla, while forumla for forumla. We generate forumla, the forumla outcome in the forumla cluster, through a random effects model as forumla = forumla, such that if forumla, then forumlaNormalforumla, while if forumla, then forumlaNormalforumla. Under the null model, forumla.

Performances of the four tests (1)–(4) are summarized in Table 1 for three choices of M, namely, 30, 50, and 150.

Table 1

Size, along with a 95% confidence interval, and power comparisons of four tests (nominal forumla) under Simulation Scenario 1. The empirical calculations are based on 3000 replicates each

Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.060 (0.052, 0.068)0.3190.8331.000
DS0.132 (0.120, 0.144)0.0500.2030.500
W0.159 (0.146, 0.172)0.0580.2630.645
CA0.055 (0.047, 0.063)0.2960.8140.985
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.053 (0.045, 0.061)0.5000.9601.000
DS0.199 (0.185, 0.213)0.0500.3100.730
W0.215 (0.200, 0.230)0.0610.3900.830
CA0.051 (0.043, 0.059)0.4600.9501.000
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.055 (0.047, 0.063)0.9101.0001.000
DS0.508 (0.490, 0.526)0.0520.6990.900
W0.528 (0.510, 0.546)0.0730.7780.993
CA0.050 (0.042, 0.058)0.8961.0001.000
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.060 (0.052, 0.068)0.3190.8331.000
DS0.132 (0.120, 0.144)0.0500.2030.500
W0.159 (0.146, 0.172)0.0580.2630.645
CA0.055 (0.047, 0.063)0.2960.8140.985
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.053 (0.045, 0.061)0.5000.9601.000
DS0.199 (0.185, 0.213)0.0500.3100.730
W0.215 (0.200, 0.230)0.0610.3900.830
CA0.051 (0.043, 0.059)0.4600.9501.000
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.055 (0.047, 0.063)0.9101.0001.000
DS0.508 (0.490, 0.526)0.0520.6990.900
W0.528 (0.510, 0.546)0.0730.7780.993
CA0.050 (0.042, 0.058)0.8961.0001.000

New test forumla Test developed in Section 2.1, DS forumla rank-sum test by Datta and Satten, W forumla Wilcoxon rank-sum test, CA forumla signed rank test with cluster averages

Table 1

Size, along with a 95% confidence interval, and power comparisons of four tests (nominal forumla) under Simulation Scenario 1. The empirical calculations are based on 3000 replicates each

Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.060 (0.052, 0.068)0.3190.8331.000
DS0.132 (0.120, 0.144)0.0500.2030.500
W0.159 (0.146, 0.172)0.0580.2630.645
CA0.055 (0.047, 0.063)0.2960.8140.985
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.053 (0.045, 0.061)0.5000.9601.000
DS0.199 (0.185, 0.213)0.0500.3100.730
W0.215 (0.200, 0.230)0.0610.3900.830
CA0.051 (0.043, 0.059)0.4600.9501.000
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.055 (0.047, 0.063)0.9101.0001.000
DS0.508 (0.490, 0.526)0.0520.6990.900
W0.528 (0.510, 0.546)0.0730.7780.993
CA0.050 (0.042, 0.058)0.8961.0001.000
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.060 (0.052, 0.068)0.3190.8331.000
DS0.132 (0.120, 0.144)0.0500.2030.500
W0.159 (0.146, 0.172)0.0580.2630.645
CA0.055 (0.047, 0.063)0.2960.8140.985
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.053 (0.045, 0.061)0.5000.9601.000
DS0.199 (0.185, 0.213)0.0500.3100.730
W0.215 (0.200, 0.230)0.0610.3900.830
CA0.051 (0.043, 0.059)0.4600.9501.000
Power (under effect size forumla)
forumla
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.055 (0.047, 0.063)0.9101.0001.000
DS0.508 (0.490, 0.526)0.0520.6990.900
W0.528 (0.510, 0.546)0.0730.7780.993
CA0.050 (0.042, 0.058)0.8961.0001.000

New test forumla Test developed in Section 2.1, DS forumla rank-sum test by Datta and Satten, W forumla Wilcoxon rank-sum test, CA forumla signed rank test with cluster averages

Table 1 illustrates a number of points. Our new test closely maintains the nominal size and is sufficiently strong in terms of power even under small effect sizes. The rank sum test proposed by Datta and Satten (2005) and the standard Wilcoxon rank sum test have grossly inflated size and very low power compared to our test for all three choices of the number of clusters. The size of the cluster average signed rank test tends to be close to the nominal size under this simulation scenario. Its power is also close to our test, though a bit less in almost all cases. Although the clustered average signed rank test appears to be a good competitor of our test in this simulation scenario, one can acknowledge the fact that the distribution of the average of independent and identical random variables is not always same as that of the individual variables. Thus, it is expected that the cluster average signed rank test is not a good choice for testing the hypothesis of our interest and this fact might be evident if we have widely different ICG sizes within each cluster.

3.2 Simulation Scenario 2

This simulation setting is carried out to mimic the setting of dental study mentioned in Section 1, where the number of units (teeth) within a cluster (mouth of an individual) cannot exceed 32. This can be generalized for any study where the cluster sizes or the ICG sizes are bounded.

This simulation scenario is almost same as that described in Section 3.1, the only difference being that both the ICG sizes within each cluster are less than or equal to 16, such that the cluster size cannot exceed 32. Following the same notations for the quantities in 3.1, in the forumla cluster, we generate forumla from Normal(0,0.25), forumla from Poisson(10 + 5forumla) such that forumla, forumla from Poisson(10 + 5forumla) such that forumla. So, we have forumla  forumla  forumla. Apart from these, forumla, forumla, and the outcome forumla are generated in the same manner as in simulation scenario 1. Table 2 compares the four tests (1)–(4) under this simulation scenario with the number of clusters (M) as 50, and the results are similar to the results obtained from simulation scenario 1. Table 2 shows that our new test closely maintains the nominal size and has substantial power under a variety of effect sizes. The rank-sum test proposed by Datta and Satten (2005), as well as the standard Wilcoxon rank sum test, has highly inflated size. The clustered average signed rank test, just like in simulation scenario 1, apparently maintains the nominal size and has substantial power. But, as mentioned before in Section 3.1, theoretically it is not a good choice for testing the hypothesis of our interest.

Table 2

Size, along with a 95% confidence interval, and power comparisons of four tests (nominal forumla) under Simulation Scenario 2. The number of clusters, forumla equals 50. The empirical calculations are based on 3000 replicates each

Power (under effect size forumla)
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.054 (0.046, 0.062)0.4650.9601.000
DS0.146 (0.133, 0.159)0.0710.4420.845
W0.136 (0.124, 0.148)0.0730.5010.916
CA0.047 (0.039, 0.055)0.4450.9601.000
Power (under effect size forumla)
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.054 (0.046, 0.062)0.4650.9601.000
DS0.146 (0.133, 0.159)0.0710.4420.845
W0.136 (0.124, 0.148)0.0730.5010.916
CA0.047 (0.039, 0.055)0.4450.9601.000

New test forumla Test developed in Section 2.1, DS forumla rank-sum test by Datta and Satten, W forumla Wilcoxon rank-sum test, CA forumla signed rank test with cluster averages

Table 2

Size, along with a 95% confidence interval, and power comparisons of four tests (nominal forumla) under Simulation Scenario 2. The number of clusters, forumla equals 50. The empirical calculations are based on 3000 replicates each

Power (under effect size forumla)
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.054 (0.046, 0.062)0.4650.9601.000
DS0.146 (0.133, 0.159)0.0710.4420.845
W0.136 (0.124, 0.148)0.0730.5010.916
CA0.047 (0.039, 0.055)0.4450.9601.000
Power (under effect size forumla)
TestSize (CI)forumla = 0.05forumla = 0.10forumla = 0.15
New test0.054 (0.046, 0.062)0.4650.9601.000
DS0.146 (0.133, 0.159)0.0710.4420.845
W0.136 (0.124, 0.148)0.0730.5010.916
CA0.047 (0.039, 0.055)0.4450.9601.000

New test forumla Test developed in Section 2.1, DS forumla rank-sum test by Datta and Satten, W forumla Wilcoxon rank-sum test, CA forumla signed rank test with cluster averages

3.3 Simulation Scenario 3

This simulation scenario is almost similar to that described in Section 3.1, the only difference being that the ICG sizes within each cluster are not restricted to be strictly positive always. Following the same notations for the quantities in 3.1, in the forumla cluster, we generate forumla from Normal(0,0.25), forumla from Poisson(10 + 5forumla), forumla from Poisson(10 + 5forumla). We have forumla  forumla. Apart from these, forumla, forumla, and the outcome forumla are generated in the same manner as in simulation scenario 1. Evidently, observed values of any of the ICG sizes forumla and forumla in the forumla cluster can be 0, as long as forumla 0.

In Table 3, we evaluate the empirical size and power of our test, developed in Section 2.2, with the choice of forumla. Thus, from Table 3, we see that our test closely mimics the nominal size and has moderate to high power under different effect sizes.

Table 3

Size, along with a 95% confidence interval, and power calculations (nominal forumla) of the new test developed in Section 2.2 under Simulation Scenario 3. Note that the CA test statistic is not computable in this situation. The number of clusters, forumla equals 30. The empirical calculations are based on 3000 replicates each

Power (under effect size forumla)
Sizeforumla = 0.05forumla = 0.10forumla = 0.15
0.053forumla0.2750.7430.964
Power (under effect size forumla)
Sizeforumla = 0.05forumla = 0.10forumla = 0.15
0.053forumla0.2750.7430.964
Table 3

Size, along with a 95% confidence interval, and power calculations (nominal forumla) of the new test developed in Section 2.2 under Simulation Scenario 3. Note that the CA test statistic is not computable in this situation. The number of clusters, forumla equals 30. The empirical calculations are based on 3000 replicates each

Power (under effect size forumla)
Sizeforumla = 0.05forumla = 0.10forumla = 0.15
0.053forumla0.2750.7430.964
Power (under effect size forumla)
Sizeforumla = 0.05forumla = 0.10forumla = 0.15
0.053forumla0.2750.7430.964

4 Application to Dental Data

We consider data from the Piedmont 65+ Dental study by Beck et al. (1990). This study examined two older populations, urban whites and urban and rural blacks. The Piedmont Health Study of the Elderly by Blazer and George (2004), which was the parent study for this Piedmont 65+ Dental Study, was a longitudinal study of the health status of a stratified, clustered, random sample of people aged 65 and over in five contiguous North Carolina counties. The Piedmont 65+ Dental Study used the data available from the parent study while collecting additional information. For the Piedmont 65+ Dental Study, we have the gingival recession and pocket depth measures for all teeth present in the mouth, at baseline, 18, 36, and 60 months, respectively. Attachment level scores (attachment losses) were computed from the gingival recession and pocket depth measures. Also, all these clinical measures were computed for two sites, buccal and mesial, for every tooth measured. A number of additional covariates were also available which are ignored for the present marginal analyses. The number of subjects observed varied across the four data points. This may be because, being a study involving elderly population, many subjects who were reported at the beginning of the study failed to come back at later time points of the study. For our illustration, we investigate the baseline and 18 month data cross-sectionally.

Attachment loss is a common problem associated with periodontal diseases in elderly population, often indicating the severity of certain diseases. It has been suggested in some studies that the nature of attachment loss varies across the different surfaces of a tooth. Suspecting one such possibility, it may be interesting to identify whether the distributions of attachment loss scores are same for the buccal and mesial surfaces of teeth. Since the outcomes (attachment level scores) from the units (teeth surfaces) within a cluster (individual) are correlated, while that from the units between different clusters are independent, the data fall into the category of the type of clustered data we are interested in. In addition, since the cluster size (number of teeth surfaces an individual has) may indicate the overall oral health, the cluster size might be associated to the outcome of interest (attachment loss score). We apply our new test and the test of Datta and Satten to investigate possible differences in the distributions of attachment loss at the buccal and mesial sites (the two groups under study) to data at baseline involving 697 subjects with at least one tooth. A significant difference was obtained for the novel test (Z =forumla, p-value =forumla) and for the Datta and Satten test (Z =forumla, p-value = 5.56forumla). So, our new test and the test by Datta and Satten lead to the same conclusion but with different p-values. We then consider the same testing problem but with the data for 18 months (with 496 available subjects) where, again, significant difference was obtained using the new test (Z =forumla, p-value = 1.48forumla) as well as the test by Datta and Satten (Z =forumla, p-value =forumla). Overall, we conclude that the distribution of the attachment loss of teeth differs between the mesial and buccal sites. Also, we see that our new test gives consistent result in a situation where the test by Datta and Satten appears to be valid as well. Plots of the empirical cumulative distribution functions forumla of attachment scores in the two groups (buccal and mesial) are shown for both the baseline data and the 18-month data in Figures 1 and 2, respectively. Some indications regarding the significant difference in the distributions of attachment scores between buccal and mesial sites can be obtained from these figures. In addition, plots of the empirical mass functions for mesial and buccal attachment loss scores at baseline study are given in Web Figure 1 (in Web Appendix F) that explain the substantial differences between the mesial and buccal attachment loss scores at the low score values of 1 and 2. Incidentally, these two scores together constitute more than half of the observed scores for the population under study. To calculate the effect size, we use the following approach: if forumlaand forumla denote the sets of mesial and buccal attachment scores, such that the test statistic T=forumla, and forumla be a real number such that forumla then the effect size is estimated by the absolute value of forumla, where forumlasup forumla. For both the baseline and 18 months data, unstandardized effect size turns out to be approximately 0.5.

Figure 1

Plot of empirical cumulative distribution functions forumla of attachment scores in buccal and mesial sites at baseline study.

Figure 2

Plot of empirical cumulative distribution functions forumla of attachment scores in buccal and mesial sites at 18 months.

Another interesting question, as discussed previously in Section 1, would be whether the distributions of attachment loss scores differ between the teeth of upper and lower jaws. To investigate this fact using the same data, we have considered attachment loss at the mesial site of tooth, although one can also pose the same question with the buccal site. The null hypothesis here is that the distribution of attachment loss at the mesial site of a tooth is the same for the upper and lower jaws. Here, the setting for this problem is quite similar to that of the previous problem. The difference is that in this setting the mesial site attachment loss score (outcome) of a tooth (unit) in any particular jaw (group) of an individual (cluster) may be related to the number of teeth present in that jaw of that individual. So, we may have some informativeness in the ICG size (number of teeth present in a jaw of an individual) even after conditioning on the cluster sizes. We consider the 60 month data for this analysis with 292 available subjects at that point. This data falls under the category of clustered data with some clusters having incomplete ICG structures, as described in Section 2.2, because there are a few subjects (clusters) who have teeth (units) in only one of the two jaws (groups). Our new test, developed in Section 2.2, is the only test that can be used to test the hypothesis under this setting and it gives a p-value of forumla. Thus, we conclude that there is a significant difference between the distributions of the attachment loss at the mesial sites of the upper and lower jaws. The estimated effect size, estimated like before, comes out to be around 3.0 units for this data. Web Figure 2 (in Web Appendix F) shows the empirical cumulative distribution functions forumlafor the attachment loss scores of upper and lower sets of teeth.

5 Discussions

For clustered data with informative cluster sizes, the ordinary rank-sum test assuming independent observations can be biased as indicated in a simulation study in Section 3. The rank-sum test by Datta and Satten 2005, which compares group-specific marginal distributions forumla, appears to be a valid test under informative cluster sizes. But when an outcome from a group d  forumla in a typical cluster depends on the number of observations from the group d in that cluster, we have informativeness in the ICG sizes formed by the two groups. As discussed earlier in Sections 1 and 4, this type of clustered data with informative ICG sizes are common in dental studies. Simulation studies from Section 3 indicate that even the rank-sum test by Datta and Satten (2005) has inflated size under this scenario of informative ICG size. There are no rank-based tests in the current literature that address this issue of informative ICG sizes. Thus, our main focus was to develop a rank-sum test for clustered data which works under this scenario of informative ICG sizes. This has led us to compare group-specific marginal distribution forumla that gives equal weights to each cluster (treating cluster as the basic sampling unit), but the weight given to an outcome from group d in a cluster depends on the number of observations from group d in that cluster. This is in contrast with forumla where the weight given to an outcome from a typical cluster depends on the number of outcomes in that cluster ignoring the information on the group membership of that outcome. Thus, the question of importance is which marginal distribution should be considered in testing hypothesis. It appears that comparing forumla may be more meaningful under informative ICG sizes and through a number of simulation settings, we have showed that our test maintains the nominal size and has substantial power in clustered data with informative ICG sizes. Even when the ICG sizes are not informative, simulation studies from Web Appendix E reveal that our test closely maintains the nominal size and has acceptable power when compared to other rank tests based on forumla or forumla.

As we consider clustered data, we may, in practice, encounter a few clusters which have outcomes from only one of the two groups under study. In that case, there are two possible ways of addressing this issue. One simple way is to ignore the clusters which do not have outcomes from both the groups and carry out the test, developed in Section 2.1, based on the remaining clusters. But, oftentimes, it is suspected that the information on the outcome of interest may be different between clusters with incomplete ICG structures (i.e., clusters with observations from one of the two groups) and clusters having both groups of observations. Keeping this in mind, we extended our test, in Section 2.2, to account for the clusters with incomplete ICG structures, so that we effectively use all the information present in the data. A simulation study showed that our test has the correct size and substantial power for a model accommodating incomplete ICG structures with informative ICG sizes. But, one can expect the power of this test to be low compared to that of the test involving only clusters with complete ICG structures. Therefore, in presence of a few clusters with incomplete ICG structures among a large number of clusters, it might be important to decide beforehand whether to apply the test developed in Section 2.1 ignoring a few clusters or to use the test from Section 2.2 keeping the full data. In case of clustered data where the outcomes within the same cluster belong to the same group, our test statistic reduces to that of Datta and Satten (2005), and, thus, will have superior size and power performance than the rank-sum test by Rosner et al. (2003) when the correlation structure within a cluster depends on the group membership.

Sometimes, when testing for group effect in outcomes from clustered data, one can expect the presence of some additional covariate(s) unrelated to the grouping factor. In such cases, these additional covariates (confounders) may act as nuisance factors in comparing the group-specific marginal distributions of the outcomes. For example, suppose we have a linear regression of the form

Here forumla is the outcome of the forumla observation in the forumla cluster, forumla is the binary indicator variable taking value 1 or 0 according to the group membership, forumla and forumla are the confounders (unrelated to the group membership) and forumla is the random error following some unknown distribution forumla. To compare the group-specific marginal distributions of the outcomes, one may want to test the null hypothesis forumla against the alternative hypothesis forumlaBut, if the distributions (unknown) of the confounders are different from that of the random error and also among themselves, then the rank tests based on the outcome Y can be misleading. This is, in general, true for any regression model involving confounders. To overcome this, one, often, uses aligned rank tests (see, e.g., Hájek, Šidák, and Sen, 1999, Section 10.1.2). The basic idea involves estimation of the (nuisance) parameters relating to the confounders through some appropriate rank statistics, formation of aligned observations (residuals) by plugging in the estimates, and then developing a rank test based on the aligned observations. In presence of informative ICG size, one can extend the resampling technique discussed in this article to formulate suitable rank-based statistics for estimating the nuisance parameters and testing the appropriate (sub)hypothesis under aligned rank tests.

6 Supplementary Materials

Web Appendices referenced in Sections 2, 3, 4, and an R code for implementing the novel rank-sum test are available with the paper at the Biometrics website on Wiley Online Library.

Acknowledgements

This research was supported by NIH grants 1R03DE020839 and 1R03DE022538. The authors would like to thank Jim Beck and Kevin Moss in the School of Dentistry at the University of North Carolina for providing the data set on periodontal disease from the Piedmont 65+ Dental study. We also thank the editor, the associate editor, and a referee for their constructive comments.

References

Beck
,
J. D.
,
Koch
,
G. G.
,
Rozier
,
R. G.
, and
Tudor
,
G. E.
(
1990
).
Prevalence and risk indicators for periodontal attachment loss in a population of older community-dwelling blacks and whites
.
Journal of Periodontology
 
61
,
521
528
.

Blazer
,
D. G.
and
George
,
L. K.
(
2004
).
Established Populations for Epidemiologic Studies of the Elderly, 1996–1997: Piedmont Health Survey of the Elderly, Fourth In-Person Survey [Durham, Warren, Vance, Granville, and Franklin Counties, North Carolina] [Computer file]. ICPSR02744-v1
.
Ann Arbor, MI
:
Inter-university Consortium for Political and Social Research [distributor]
, doi: 10.3886/ICPSR02744.

Datta
,
S.
and
Satten
,
G. A.
(
2005
).
Rank-sum tests for clustered data
.
Journal of the American Statistical Association
 
100
,
908
915
.

Datta
,
S.
, and
Satten
,
G. A.
(
2008
).
A signed-rank test for clustered data
.
Biometrics
 
64
,
501
507
.

Hájek
,
J.
,
Šidák
,
Z.
, and
Sen
,
P. K.
(
1999
).
Theory of Rank Tests
.
San Diego, California
:
Academic Press
.

Hoffman
,
E. B.
,
Sen
,
P. K.
, and
Weinberg
,
C. R.
(
2001
).
Within-cluster resampling
.
Biometrika
 
88
,
1121
1134
.

Huang
,
Y.
, and
Leroux
,
B.
(
2011
).
Informative cluster sizes for subcluster-level covariates and weighted generalized estimating equations
.
Biometrics
 
67
,
843
851
.

Rosner
,
B.
,
Glynn
,
R. J.
, and
Ting Lee
,
M. L.
(
2003
).
Incorporation of clustering effects for the Wilcoxon rank sum test: A large-sample approach
.
Biometrics
 
59
,
1089
1098
.

Williamson
,
J. M.
,
Datta
,
S.
, and
Satten
,
G. A.
(
2003
).
Marginal analyses of clustered data when cluster size is informative
.
Biometrics
 
59
,
36
42
.

Wilcoxon
,
F.
(
1945
).
Individual comparisons by ranking methods
.
Biometrics Bulletin
 
1
,
80
83

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)