-
PDF
- Split View
-
Views
-
Cite
Cite
Sandipan Dutta, Somnath Datta, A Rank-Sum Test for Clustered Data When the Number of Subjects in a Group within a Cluster is Informative, Biometrics, Volume 72, Issue 2, June 2016, Pages 432–440, https://doi.org/10.1111/biom.12447
Close - Share Icon Share
Summary
The Wilcoxon rank-sum test is a popular nonparametric test for comparing two independent populations (groups). In recent years, there have been renewed attempts in extending the Wilcoxon rank sum test for clustered data, one of which (Datta and Satten, 2005, Journal of the American Statistical Association 100, 908–915) addresses the issue of informative cluster size, i.e., when the outcomes and the cluster size are correlated. We are faced with a situation where the group specific marginal distribution in a cluster depends on the number of observations in that group (i.e., the intra-cluster group size). We develop a novel extension of the rank-sum test for handling this situation. We compare the performance of our test with the Datta–Satten test, as well as the naive Wilcoxon rank sum test. Using a naturally occurring simulation model of informative intra-cluster group size, we show that only our test maintains the correct size. We also compare our test with a classical signed rank test based on averages of the outcome values in each group paired by the cluster membership. While this test maintains the size, it has lower power than our test. Extensions to multiple group comparisons and the case of clusters not having samples from all groups are also discussed. We apply our test to determine whether there are differences in the attachment loss between the upper and lower teeth and between mesial and buccal sites of periodontal patients.
1 Introduction
Rank-based tests are very popular nonparametric methods for comparing two groups or populations. They are particularly useful when the underlying distributions are suspected to be non-normal. One such widely used test for comparing two groups is the Wilcoxon rank-sum test (Wilcoxon, 1945). One important assumption for applicability of Wilcoxon rank-sum test is that all the observations under the study are independent. However, this assumption may be violated under certain circumstances. In many practical situations, we have clustered data where the observations within the clusters are correlated. An example of such clustered data is the data on attachment loss measurement of different teeth of the same individual. Wilcoxon rank-sum test may not be a good option for this type of clustered data. Rosner, Glynn, and Lee (2003) proposed a rank sum test for clustered data for the cases where all the cluster members are from the same group and the correlation structure within a cluster is common across groups. But this approach would not work when the members from a single cluster do not necessarily belong to the same group. Also, this will not maintain the nominal size when the number of observations in a cluster (cluster size) is associated with the outcome of interest from that cluster in some way. This is a case of informative cluster size, where the informativeness comes from the fact that the number of observations in a given cluster (i.e., the cluster size) may be affected by some latent (cluster-specific) factor that affects the outcome variable in that cluster as well. Datta and Satten (2005) proposed a rank-sum test for clustered data that does not make any assumption on the nature of clustering and performs reasonably well in case of informative cluster sizes. But, even the test by Datta and Satten (2005) does not seem to perform well in situations where the outcome of interest belonging to a group in a given cluster appears to be correlated with the number of observations having the same group membership (i.e., the intra-cluster group size) within that cluster. This scenario, following the idea of informative cluster size, can be thought of as informative intra-cluster group (ICG) sizes. This notion of informative ICG sizes can occur in a dental study when one is interested in comparing the nature of attachment losses of teeth between the upper and lower jaws. This is because, the difference (if any) between the nature of attachment loss of the teeth of upper and lower jaws can be suspected to be associated with the difference between the number of teeth present in the upper and lower jaws. Another interesting situation, where informative ICG sizes come into play, can be found in studies relating to hereditary diseases. In many genetic studies, it has been observed that an inherited disease is often diagnosed at an younger age in a later generation than that in an earlier generation. This phenomenon of an earlier onset of a disease in each successive generation of a family, called anticipation, is prevalent in diseases like non-Hodgkins lymphoma, breast and ovarian cancer, Huntington's disease among others. In case of testing for this anticipation phenomenon of a disease, or in general, to test whether the age at onset of a disease differs in two different generations of large pedigrees, an interesting information might be the number of affected individuals, belonging to a certain age interval, that are present in each of the two generations under study. “Affected” individuals include subjects who are currently diseased at the time of the study as well as those who were known to be diseased at some point of time before the study. If we find that there is a large difference in the number of affected individuals (belonging to that certain age group) between the two generations, then one may relate this difference to be associated to the difference in the onset age between the two generations. So, this might be a case of informativeness in the number of subjects (affected individuals in a certain age interval) in a group (generation) within a cluster (a large pedigree). Motivated by these, we develop a rank-sum test for comparing the marginal distribution of outcomes from different groups under the cases of informative ICG sizes. We return to the example of tooth attachment loss in the application section.
Extending the idea of within-cluster resampling (Hoffman, Sen, and Weinberg, 2001; Williamson, Datta, and Satten, 2003; Datta and Satten, 2005; Datta and Satten, 2008), we obtain a rank-sum test for clustered data, with observations from both groups being present in every cluster. Our resampling scheme is an extension of the usual within-cluster resampling because instead of resampling one observation at random from each cluster, we first resample one group membership (out of the two possible groups) for a cluster and then resample an outcome from that group belonging to that cluster. We repeat this resampling for each cluster and obtain a rank-sum statistic based on the resampled observations. Then, following the approaches of Datta and Satten (2005), we derive our test statistic by averaging the rank sum statistic over all possible choices of the resampled observations given the data. After constructing our test, we compare it with three other existing tests, including the test by Datta and Satten (2005), under naturally occurring simulation scenarios of informative ICG sizes. We show that our test maintains the correct size under the null hypothesis of marginal symmetry, unlike the test by Datta and Satten (2005). Moreover, our test has better power performances that the three other tests under this simulation study. Besides, we show that our test also has acceptable size and power in simulation settings where we have informative cluster sizes but noninformative ICG sizes and also in simulation scenarios having both noninformative cluster and ICG sizes. Additionally, we extend our test statistic for two group comparison to the cases when some of the clusters may have observations from only one of the two groups (i.e., the intra-cluster group structures are incomplete). We present a simulation study to show that our test still maintains the appropriate size and has reasonable power under this scenario of incomplete ICG structures within a cluster. We also discuss an extension to our test where there are observations from more than two groups in every cluster.
The rest of the article is organized as follows. In Section 2, we introduce the necessary notations, formulate our testing problem, and develop a test statistic for comparing the outcomes from two groups when outcomes from both the groups are present in every cluster. This section also contains some variant forms of our test under different clustered data settings including expressions of the test statistic with other required quantities that generalize our new rank-sum test to more than two groups and an extension of our test statistic for two group comparison where some clusters may have observations from only one of the two groups. Section 3 contains simulation studies that evaluate the empirical performance of our test compared to three other tests on the basis of size and power. In Section 4, we return to the dental data discussed before and apply our testing procedure to compare the difference between the tooth attachment loss in the upper and lower jaws. Besides this, we apply our test for a different comparison in dental data where some other rank-sum tests can also be applied. The article ends with a discussion in Section 5. Detailed steps for deriving our test statistic are discussed in the Web Appendix (Supplementary Web Materials).
2 Notations, Formulation of the Problem, and Proposed Test Statistic
Let M denote the number of clusters and let
denote the
observation in the
cluster,
where
denotes the number of observations in the
cluster. Let
be the indicator denoting the binary group membership (0 or 1) of the
observation in the
cluster. Thus, the entire data set consists of
with
corresponding to the
cluster. Also, let
and
be the number of observations in the
cluster belonging to group 1 and group 0
respectively. Thus, we have
. We consider the possibility that the cluster size
as well as the group memberships
are random (and thus, so are the
). The members in a cluster could have an arbitrary dependence structure; however, members in different clusters are statistically independent and hence the entire
and
are independent. For mathematical convenience, we further assume that
, are independent and identically distributed (iid).
The null hypothesis we consider is that the observations from the two groups follow the same marginal distribution. Mathematically, it is written as

However, the empirical analogue of the above “group specific” (e.g., conditional) marginal distributions can be constructed in three possible ways resulting in three different statistical comparisons:

Note that
i
represents the (empirical) distribution of group d
data values in the entire sample irrespective of their cluster membership. Calculation (ii) is based on sampling a single paired (e.g.,
observation from each cluster. In other words,
represents the conditional distribution of a typical outcome value
for a typical cluster
given the corresponding group membership
equals d. Here,
is a discrete uniform on
. Calculation (iii) is based on computing the proportion of outcomes belonging to group d in a typical cluster i which are less than or equal to x and then taking the average of these proportions over all the clusters. Each of quantities in the right-hand sides of (i), (ii), and (iii), can be written as an estimate of
, but the difference lies in construction of the estimates of the probabilities. In (i), the probabilities are estimated by pooling all the observations together irrespective of their cluster membership, while in (ii) and (iii), the estimates are constructed by conditioning on
and
, respectively. Every outcome, belonging to group d and having value less than x, contributes equally in the construction of
, but in constructions of
and
we have different contributions from the different outcomes depending on their cluster memberships.
Let
,
,
be the distribution functions which are estimated by
,
,
, respectively. When the cluster sizes as well as the ICG sizes formed by the two groups within each cluster are not suspected to be associated to the outcome variable in any way, then hypotheses involving
,
,
become equivalent and one can test any one of these three hypotheses. If there is some association between the cluster size and the outcome variable in that cluster, one can think of testing hypothesis involving
for appropriate comparison. This is a situation of informative cluster sizes. Again, if the ICG sizes formed by the two groups in a cluster appear to be correlated (even after conditioning on the overall cluster size) with the outcomes from the respective groups in that cluster, one may think of testing hypothesis comprising of distribution
instead of
and
to get more meaningful results. We can refer to this as a case of informative ICG sizes. In the absence of this informativeness in the ICG sizes, one can test the null hypotheses of equality of marginal distributions involving any one of the marginal distributions
and
, possibly leading to similar conclusion in each case.
In this article, we are interested in comparing
in the two groups when the ICG sizes are potentially informative. Currently, no rank-based tests are available for testing group differences for clustered data that takes into account the informativeness of the ICG sizes formed by the groups under study. We denote the common marginal distribution under the null hypothesis as
. It is perhaps worth pointing out that the estimation of the marginal regression parameters via weighted estimating equations in presence of informative ICG size has been considered by Huang and Leroux (2011).
2.1 Development of the Test Statistic
For the sake of simplicity, let us relabel the observations according to their group membership within each cluster in the following way. In the
cluster, let
represent the set of observations belonging to the group indexed by 1, while
represents the set of observations belonging to the group indexed by 0. We denote these sets as
and
, respectively. Thus,
and
form a partition of
= {
, the set of all observations in cluster i. The number of observations belonging to the set
is the intra-cluster group size of group d in the
cluster. Till the end of the Section 2.1, we would assume that at least one observation from each group is present in every cluster. In this Section, this assumption means that
0 and
0 with probability one for every cluster i. A relaxation of this condition is discussed in Section 2.2.
Our test statistic, for testing the hypothesis involving marginal distributions
as estimated in
, can be generated from a resampling scheme which is an extension of the within-cluster resampling (WCR). An outline of the resampling scheme is as follows: For each cluster i, let us resample group membership as
, where
takes value 0 or 1 with equal probability
. If
= 1, we resample one observation for the
cluster from the set of observations
and name it
If
= 0, resample
from the set
.
The fact that the outcomes are resampled from the subsets formed by the two groups in a cluster and not from the whole cluster makes this resampling scheme different from the usual WCR technique. Now, this resampling gives us M pairs of independent observations
. If
be the rank of
among the set
,
, i.e.,
= 1+
, then the Wilcoxon rank sum statistic based on these M pairs of resampled observations
would be of the form:
. One can use
as a valid test statistic and carry out the test based on
. But that test would be inefficient as the test statistic would depend too much on one particular observation chosen from each cluster. So to get rid of the imposed randomization due to resampling, we propose a test statistic based on earlier approaches of Williamson et al (2003), Datta and Satten (2005), and Datta and Satten (2008), that corresponds to averaging
over all possible choices of
,
values given the data.
Thus, our test statistic is
, where
and
. We can calculate the theoretical expression of T. After some necessary steps, a convenient expression of T (see Web Appendix A for the detailed steps) turns out to be

where
.
Besides T, we need to know its expected value
and its variance estimate
to properly carry out inference based on
To get
, we note that
. The unconditional expectation of
can be calculated easily through conditioning on the vector of group membership indicator
. So we get,
.
The next step is to find a variance estimate
. To get the variance estimate of T, we employ the jackknife technique. Here, the clusters can be thought of as iid units and thus we can use a “delete-1-cluster” jackknife approach to get the necessary results. Mathematically, this can be formulated as follows. Let
be the value of the statistic T calculated after deleting the
cluster. Let us define,
. Then, the estimate of variance of T, which is the jackknife variance estimate, is given by

Now that we have the expressions for T,
,
, we can carry out the testing using the absolute value of the standardized statistic
The asymptotic distribution of Z is established through the following theorem. An outline of its proof is given in the Web Appendix B.
THEOREM 1. (Asymptotic normality). Under
, as
under certain regularity conditions of a Lindeberg Central Limit Theorem.
The p-value for the test is computed as the probability that, under
, the absolute value of the Z-statistic exceeds its observed value in magnitude. We would reject the null hypothesis
at a 100
% level of significance if the p-value is less than
.
Till this point, we have assumed the existence of only two groups in every cluster. In Web Appendix C, we have discussed a more general situation where there are m groups in every cluster, such that
.
2.2 Extension to Incomplete Intra-Cluster Group Structure in One or More Clusters
In case of binary grouping, (i.e.,
or 1), we have assumed that there is at least one observation from each group in every cluster. In practice, one may encounter a few clusters (not all) with one group of observations completely missing. In other words, there may be some clusters having outcomes from only one of the two possible groups. We call such a case as incomplete informative intra-cluster group structure within a cluster. The hypothesis of interest remains the same, viz., whether the marginal distributions of outcomes are same for the two groups. We cannot directly apply the test statistic in the form described in Section 2.1 to this setting. This is mainly because of the fact that the test statistic developed in Section 2.1 is only applicable under the assumption that outcomes from both groups are available within each cluster. We extend the approach described in Section 2.1, to get a valid test statistic in this setting.
Here, we follow the same notations as described in Section 2.1. In cases of incomplete ICG structures within a cluster, the empirical analogue of the “group specific” marginal distributions of our interest can be constructed as a modification of
as
where
or
, or
according to whether the
cluster has observations from both groups
the dth group only, or not.
We extend the idea of within cluster resampling also to this setting to get a valid test statistic.
If both
and
, group membership is resampled as
, where
takes value 0 or 1, with equal probability
. If
= 0, resample
from
; otherwise, if
= 1, resample
from
.
If
and
, we resample
from
and have
= 0. Here,
is same as
as the set
is an empty set.
If
and
, we resample
from
and have
= 1. Here,
is same as
as the set
is an empty set.
To obtain our test statistic T in this case, we proceed in the same way as in Section 2.1. With
being the rank of
among the set
,
, we obtain
, the Wilcoxon rank sum statistic based on the M pairs of resampled observations
. Then, our proposed test statistic T is calculated as
. After some algebra (see Web Appendix D), we obtain T as

where

The expected value of the test statistic is estimated to be

Now, to find the estimated variance
of
, we use the same “delete-1-cluster” jackknife approach described in Section 2.1. Finally, as in Section 2.1, we carry out the testing using the standardized Z-statistic
, that has asymptotic
distribution under
.
3 Simulation Results
In this Section, we present three simulation studies corresponding to the tests discussed in the Sections 2.1 and 2.2. In the simulation scenario 1, we consider clustered observations such that every cluster has outcomes from both the groups. In each cluster, the number of observations belonging to group 1 and the number of observations belonging to group 0
that is the two ICG sizes, are both influenced by some latent factor, that also influences the outcomes in that cluster. Also, the distributions of the two ICG sizes, within each cluster, differ between themselves. So, there is some association between the ICG sizes and outcomes in a given cluster (even after conditioning on the overall cluster size) and we can think of this as informative ICG sizes. Under this simulation scenario, we compare the performances of four tests, namely, (1) our new rank sum test developed in Section 2.1, (2) the test by Datta and Satten (2005), (3) the naive Wilcoxon rank sum test assuming all the observations as iid and ignoring their cluster membership, and (4) the signed rank test taking cluster averages for each group of observations. Further, each test was carried out under three different choices of the number of clusters (M), namely, 30, 50, and 150. In simulation scenario 2, we generate a setting that closely represents the dental setting discussed in Section 1. Basically, the idea is to have a clustered data with informative ICG sizes, where the number of units belonging to each group in a cluster cannot exceed a certain value. Under this setting, we compare the four tests (1)–(4) for 50 clusters. In scenario 3, we again consider informativeness in the ICG sizes, but we do not restrict ourselves to the condition that observations from both the groups have to be present in each cluster. In other words, we include the cases of incomplete ICG structures within a cluster for which our test statistic developed in Section 2.2 looks appropriate. We investigate the performance of this new test for a simulation model with 30 clusters under scenario 4.
Additionally, in Web Appendix E we consider two more simulation scenarios (Scenarios 4 and 5), where we compare the four tests (1)–(4) under situations such that either the ICG sizes or both the ICG sizes and the cluster sizes are noninformative.
Performances of all the tests are evaluated on the basis of their sizes (nominal
) and power values. These are estimated by the proportion of 3000 Monte Carlo iterates in which null hypothesis is rejected.
3.1 Simulation Scenario 1
Let M be the number of clusters (fixed). For a typical cluster i, we define,
as the number of observations from group 1 in the
cluster,
as the number of observations from group 0 in the
cluster,
as the random cluster effect due to the
cluster. In the
cluster, we generate
from Normal(0,
) distribution,
from Poisson(10 + 5
) distribution where
=
+ 1,
is generated from Poisson(10 + 5
) such that
=
+ 1. Also, we know that
. Let
be the group indicator of the
observation in the
cluster. We assign
for 1
, while
for
. We generate
, the
outcome in the
cluster, through a random effects model as
=
, such that if
, then
Normal
, while if
, then
Normal
. Under the null model,
.
Performances of the four tests (1)–(4) are summarized in Table 1 for three choices of M, namely, 30, 50, and 150.
Size, along with a 95% confidence interval, and power comparisons of four tests (nominal
) under Simulation Scenario 1. The empirical calculations are based on 3000 replicates each
| . | . | Power (under effect size )
. | ||
|---|---|---|---|---|
| . |
. | . | . | . |
| Test . | Size (CI) . | = 0.05
. | = 0.10
. | = 0.15
. |
| New test | 0.060 (0.052, 0.068) | 0.319 | 0.833 | 1.000 |
| DS | 0.132 (0.120, 0.144) | 0.050 | 0.203 | 0.500 |
| W | 0.159 (0.146, 0.172) | 0.058 | 0.263 | 0.645 |
| CA | 0.055 (0.047, 0.063) | 0.296 | 0.814 | 0.985 |
Power (under effect size ) | ||||
![]() | . | . | ||
| Test | Size (CI) | = 0.05 | = 0.10 | = 0.15 |
| New test | 0.053 (0.045, 0.061) | 0.500 | 0.960 | 1.000 |
| DS | 0.199 (0.185, 0.213) | 0.050 | 0.310 | 0.730 |
| W | 0.215 (0.200, 0.230) | 0.061 | 0.390 | 0.830 |
| CA | 0.051 (0.043, 0.059) | 0.460 | 0.950 | 1.000 |
Power (under effect size ) | ||||
![]() | . | . | ||
| Test | Size (CI) | = 0.05 | = 0.10 | = 0.15 |
| New test | 0.055 (0.047, 0.063) | 0.910 | 1.000 | 1.000 |
| DS | 0.508 (0.490, 0.526) | 0.052 | 0.699 | 0.900 |
| W | 0.528 (0.510, 0.546) | 0.073 | 0.778 | 0.993 |
| CA | 0.050 (0.042, 0.058) | 0.896 | 1.000 | 1.000 |
| . | . | Power (under effect size )
. | ||
|---|---|---|---|---|
| . |
. | . | . | . |
| Test . | Size (CI) . | = 0.05
. | = 0.10
. | = 0.15
. |
| New test | 0.060 (0.052, 0.068) | 0.319 | 0.833 | 1.000 |
| DS | 0.132 (0.120, 0.144) | 0.050 | 0.203 | 0.500 |
| W | 0.159 (0.146, 0.172) | 0.058 | 0.263 | 0.645 |
| CA | 0.055 (0.047, 0.063) | 0.296 | 0.814 | 0.985 |
Power (under effect size ) | ||||
![]() | . | . | ||
| Test | Size (CI) | = 0.05 | = 0.10 | = 0.15 |
| New test | 0.053 (0.045, 0.061) | 0.500 | 0.960 | 1.000 |
| DS | 0.199 (0.185, 0.213) | 0.050 | 0.310 | 0.730 |
| W | 0.215 (0.200, 0.230) | 0.061 | 0.390 | 0.830 |
| CA | 0.051 (0.043, 0.059) | 0.460 | 0.950 | 1.000 |
Power (under effect size ) | ||||
![]() | . | . | ||
| Test | Size (CI) | = 0.05 | = 0.10 | = 0.15 |
| New test | 0.055 (0.047, 0.063) | 0.910 | 1.000 | 1.000 |
| DS | 0.508 (0.490, 0.526) | 0.052 | 0.699 | 0.900 |
| W | 0.528 (0.510, 0.546) | 0.073 | 0.778 | 0.993 |
| CA | 0.050 (0.042, 0.058) | 0.896 | 1.000 | 1.000 |
New test
Test developed in Section 2.1, DS
rank-sum test by Datta and Satten, W
Wilcoxon rank-sum test, CA
signed rank test with cluster averages
Size, along with a 95% confidence interval, and power comparisons of four tests (nominal
) under Simulation Scenario 1. The empirical calculations are based on 3000 replicates each
| . | . | Power (under effect size )
. | ||
|---|---|---|---|---|
| . |
. | . | . | . |
| Test . | Size (CI) . | = 0.05
. | = 0.10
. | = 0.15
. |
| New test | 0.060 (0.052, 0.068) | 0.319 | 0.833 | 1.000 |
| DS | 0.132 (0.120, 0.144) | 0.050 | 0.203 | 0.500 |
| W | 0.159 (0.146, 0.172) | 0.058 | 0.263 | 0.645 |
| CA | 0.055 (0.047, 0.063) | 0.296 | 0.814 | 0.985 |
Power (under effect size ) | ||||
![]() | . | . | ||
| Test | Size (CI) | = 0.05 | = 0.10 | = 0.15 |
| New test | 0.053 (0.045, 0.061) | 0.500 | 0.960 | 1.000 |
| DS | 0.199 (0.185, 0.213) | 0.050 | 0.310 | 0.730 |
| W | 0.215 (0.200, 0.230) | 0.061 | 0.390 | 0.830 |
| CA | 0.051 (0.043, 0.059) | 0.460 | 0.950 | 1.000 |
Power (under effect size ) | ||||
![]() | . | . | ||
| Test | Size (CI) | = 0.05 | = 0.10 | = 0.15 |
| New test | 0.055 (0.047, 0.063) | 0.910 | 1.000 | 1.000 |
| DS | 0.508 (0.490, 0.526) | 0.052 | 0.699 | 0.900 |
| W | 0.528 (0.510, 0.546) | 0.073 | 0.778 | 0.993 |
| CA | 0.050 (0.042, 0.058) | 0.896 | 1.000 | 1.000 |
| . | . | Power (under effect size )
. | ||
|---|---|---|---|---|
| . |
. | . | . | . |
| Test . | Size (CI) . | = 0.05
. | = 0.10
. | = 0.15
. |
| New test | 0.060 (0.052, 0.068) | 0.319 | 0.833 | 1.000 |
| DS | 0.132 (0.120, 0.144) | 0.050 | 0.203 | 0.500 |
| W | 0.159 (0.146, 0.172) | 0.058 | 0.263 | 0.645 |
| CA | 0.055 (0.047, 0.063) | 0.296 | 0.814 | 0.985 |
Power (under effect size ) | ||||
![]() | . | . | ||
| Test | Size (CI) | = 0.05 | = 0.10 | = 0.15 |
| New test | 0.053 (0.045, 0.061) | 0.500 | 0.960 | 1.000 |
| DS | 0.199 (0.185, 0.213) | 0.050 | 0.310 | 0.730 |
| W | 0.215 (0.200, 0.230) | 0.061 | 0.390 | 0.830 |
| CA | 0.051 (0.043, 0.059) | 0.460 | 0.950 | 1.000 |
Power (under effect size ) | ||||
![]() | . | . | ||
| Test | Size (CI) | = 0.05 | = 0.10 | = 0.15 |
| New test | 0.055 (0.047, 0.063) | 0.910 | 1.000 | 1.000 |
| DS | 0.508 (0.490, 0.526) | 0.052 | 0.699 | 0.900 |
| W | 0.528 (0.510, 0.546) | 0.073 | 0.778 | 0.993 |
| CA | 0.050 (0.042, 0.058) | 0.896 | 1.000 | 1.000 |
New test
Test developed in Section 2.1, DS
rank-sum test by Datta and Satten, W
Wilcoxon rank-sum test, CA
signed rank test with cluster averages
Table 1 illustrates a number of points. Our new test closely maintains the nominal size and is sufficiently strong in terms of power even under small effect sizes. The rank sum test proposed by Datta and Satten (2005) and the standard Wilcoxon rank sum test have grossly inflated size and very low power compared to our test for all three choices of the number of clusters. The size of the cluster average signed rank test tends to be close to the nominal size under this simulation scenario. Its power is also close to our test, though a bit less in almost all cases. Although the clustered average signed rank test appears to be a good competitor of our test in this simulation scenario, one can acknowledge the fact that the distribution of the average of independent and identical random variables is not always same as that of the individual variables. Thus, it is expected that the cluster average signed rank test is not a good choice for testing the hypothesis of our interest and this fact might be evident if we have widely different ICG sizes within each cluster.
3.2 Simulation Scenario 2
This simulation setting is carried out to mimic the setting of dental study mentioned in Section 1, where the number of units (teeth) within a cluster (mouth of an individual) cannot exceed 32. This can be generalized for any study where the cluster sizes or the ICG sizes are bounded.
This simulation scenario is almost same as that described in Section 3.1, the only difference being that both the ICG sizes within each cluster are less than or equal to 16, such that the cluster size cannot exceed 32. Following the same notations for the quantities in 3.1, in the
cluster, we generate
from Normal(0,0.25),
from Poisson(10 + 5
) such that
,
from Poisson(10 + 5
) such that
. So, we have
. Apart from these,
,
, and the outcome
are generated in the same manner as in simulation scenario 1. Table 2 compares the four tests (1)–(4) under this simulation scenario with the number of clusters (M) as 50, and the results are similar to the results obtained from simulation scenario 1. Table 2 shows that our new test closely maintains the nominal size and has substantial power under a variety of effect sizes. The rank-sum test proposed by Datta and Satten (2005), as well as the standard Wilcoxon rank sum test, has highly inflated size. The clustered average signed rank test, just like in simulation scenario 1, apparently maintains the nominal size and has substantial power. But, as mentioned before in Section 3.1, theoretically it is not a good choice for testing the hypothesis of our interest.
Size, along with a 95% confidence interval, and power comparisons of four tests (nominal
) under Simulation Scenario 2. The number of clusters,
equals 50. The empirical calculations are based on 3000 replicates each
| . | . | Power (under effect size )
. | ||
|---|---|---|---|---|
| Test . | Size (CI) . | = 0.05
. | = 0.10
. | = 0.15
. |
| New test | 0.054 (0.046, 0.062) | 0.465 | 0.960 | 1.000 |
| DS | 0.146 (0.133, 0.159) | 0.071 | 0.442 | 0.845 |
| W | 0.136 (0.124, 0.148) | 0.073 | 0.501 | 0.916 |
| CA | 0.047 (0.039, 0.055) | 0.445 | 0.960 | 1.000 |
| . | . | Power (under effect size )
. | ||
|---|---|---|---|---|
| Test . | Size (CI) . | = 0.05
. | = 0.10
. | = 0.15
. |
| New test | 0.054 (0.046, 0.062) | 0.465 | 0.960 | 1.000 |
| DS | 0.146 (0.133, 0.159) | 0.071 | 0.442 | 0.845 |
| W | 0.136 (0.124, 0.148) | 0.073 | 0.501 | 0.916 |
| CA | 0.047 (0.039, 0.055) | 0.445 | 0.960 | 1.000 |
New test
Test developed in Section 2.1, DS
rank-sum test by Datta and Satten, W
Wilcoxon rank-sum test, CA
signed rank test with cluster averages
Size, along with a 95% confidence interval, and power comparisons of four tests (nominal
) under Simulation Scenario 2. The number of clusters,
equals 50. The empirical calculations are based on 3000 replicates each
| . | . | Power (under effect size )
. | ||
|---|---|---|---|---|
| Test . | Size (CI) . | = 0.05
. | = 0.10
. | = 0.15
. |
| New test | 0.054 (0.046, 0.062) | 0.465 | 0.960 | 1.000 |
| DS | 0.146 (0.133, 0.159) | 0.071 | 0.442 | 0.845 |
| W | 0.136 (0.124, 0.148) | 0.073 | 0.501 | 0.916 |
| CA | 0.047 (0.039, 0.055) | 0.445 | 0.960 | 1.000 |
| . | . | Power (under effect size )
. | ||
|---|---|---|---|---|
| Test . | Size (CI) . | = 0.05
. | = 0.10
. | = 0.15
. |
| New test | 0.054 (0.046, 0.062) | 0.465 | 0.960 | 1.000 |
| DS | 0.146 (0.133, 0.159) | 0.071 | 0.442 | 0.845 |
| W | 0.136 (0.124, 0.148) | 0.073 | 0.501 | 0.916 |
| CA | 0.047 (0.039, 0.055) | 0.445 | 0.960 | 1.000 |
New test
Test developed in Section 2.1, DS
rank-sum test by Datta and Satten, W
Wilcoxon rank-sum test, CA
signed rank test with cluster averages
3.3 Simulation Scenario 3
This simulation scenario is almost similar to that described in Section 3.1, the only difference being that the ICG sizes within each cluster are not restricted to be strictly positive always. Following the same notations for the quantities in 3.1, in the
cluster, we generate
from Normal(0,0.25),
from Poisson(10 + 5
),
from Poisson(10 + 5
). We have
. Apart from these,
,
, and the outcome
are generated in the same manner as in simulation scenario 1. Evidently, observed values of any of the ICG sizes
and
in the
cluster can be 0, as long as
0.
In Table 3, we evaluate the empirical size and power of our test, developed in Section 2.2, with the choice of
. Thus, from Table 3, we see that our test closely mimics the nominal size and has moderate to high power under different effect sizes.
Size, along with a 95% confidence interval, and power calculations (nominal
) of the new test developed in Section 2.2 under Simulation Scenario 3. Note that the CA test statistic is not computable in this situation. The number of clusters,
equals 30. The empirical calculations are based on 3000 replicates each
| . | Power (under effect size )
. | ||
|---|---|---|---|
| Size . | = 0.05
. | = 0.10
. | = 0.15
. |
0.053![]() | 0.275 | 0.743 | 0.964 |
| . | Power (under effect size )
. | ||
|---|---|---|---|
| Size . | = 0.05
. | = 0.10
. | = 0.15
. |
0.053![]() | 0.275 | 0.743 | 0.964 |
Size, along with a 95% confidence interval, and power calculations (nominal
) of the new test developed in Section 2.2 under Simulation Scenario 3. Note that the CA test statistic is not computable in this situation. The number of clusters,
equals 30. The empirical calculations are based on 3000 replicates each
| . | Power (under effect size )
. | ||
|---|---|---|---|
| Size . | = 0.05
. | = 0.10
. | = 0.15
. |
0.053![]() | 0.275 | 0.743 | 0.964 |
| . | Power (under effect size )
. | ||
|---|---|---|---|
| Size . | = 0.05
. | = 0.10
. | = 0.15
. |
0.053![]() | 0.275 | 0.743 | 0.964 |
4 Application to Dental Data
We consider data from the Piedmont 65+ Dental study by Beck et al. (1990). This study examined two older populations, urban whites and urban and rural blacks. The Piedmont Health Study of the Elderly by Blazer and George (2004), which was the parent study for this Piedmont 65+ Dental Study, was a longitudinal study of the health status of a stratified, clustered, random sample of people aged 65 and over in five contiguous North Carolina counties. The Piedmont 65+ Dental Study used the data available from the parent study while collecting additional information. For the Piedmont 65+ Dental Study, we have the gingival recession and pocket depth measures for all teeth present in the mouth, at baseline, 18, 36, and 60 months, respectively. Attachment level scores (attachment losses) were computed from the gingival recession and pocket depth measures. Also, all these clinical measures were computed for two sites, buccal and mesial, for every tooth measured. A number of additional covariates were also available which are ignored for the present marginal analyses. The number of subjects observed varied across the four data points. This may be because, being a study involving elderly population, many subjects who were reported at the beginning of the study failed to come back at later time points of the study. For our illustration, we investigate the baseline and 18 month data cross-sectionally.
Attachment loss is a common problem associated with periodontal diseases in elderly population, often indicating the severity of certain diseases. It has been suggested in some studies that the nature of attachment loss varies across the different surfaces of a tooth. Suspecting one such possibility, it may be interesting to identify whether the distributions of attachment loss scores are same for the buccal and mesial surfaces of teeth. Since the outcomes (attachment level scores) from the units (teeth surfaces) within a cluster (individual) are correlated, while that from the units between different clusters are independent, the data fall into the category of the type of clustered data we are interested in. In addition, since the cluster size (number of teeth surfaces an individual has) may indicate the overall oral health, the cluster size might be associated to the outcome of interest (attachment loss score). We apply our new test and the test of Datta and Satten to investigate possible differences in the distributions of attachment loss at the buccal and mesial sites (the two groups under study) to data at baseline involving 697 subjects with at least one tooth. A significant difference was obtained for the novel test (Z =
, p-value =
) and for the Datta and Satten test (Z =
, p-value = 5.56
). So, our new test and the test by Datta and Satten lead to the same conclusion but with different p-values. We then consider the same testing problem but with the data for 18 months (with 496 available subjects) where, again, significant difference was obtained using the new test (Z =
, p-value = 1.48
) as well as the test by Datta and Satten (Z =
, p-value =
). Overall, we conclude that the distribution of the attachment loss of teeth differs between the mesial and buccal sites. Also, we see that our new test gives consistent result in a situation where the test by Datta and Satten appears to be valid as well. Plots of the empirical cumulative distribution functions
of attachment scores in the two groups (buccal and mesial) are shown for both the baseline data and the 18-month data in Figures 1 and 2, respectively. Some indications regarding the significant difference in the distributions of attachment scores between buccal and mesial sites can be obtained from these figures. In addition, plots of the empirical mass functions for mesial and buccal attachment loss scores at baseline study are given in Web Figure 1 (in Web Appendix F) that explain the substantial differences between the mesial and buccal attachment loss scores at the low score values of 1 and 2. Incidentally, these two scores together constitute more than half of the observed scores for the population under study. To calculate the effect size, we use the following approach: if
and
denote the sets of mesial and buccal attachment scores, such that the test statistic T=
, and
be a real number such that
then the effect size is estimated by the absolute value of
, where
sup
. For both the baseline and 18 months data, unstandardized effect size turns out to be approximately 0.5.
Plot of empirical cumulative distribution functions
of attachment scores in buccal and mesial sites at baseline study.
Plot of empirical cumulative distribution functions
of attachment scores in buccal and mesial sites at 18 months.
Another interesting question, as discussed previously in Section 1, would be whether the distributions of attachment loss scores differ between the teeth of upper and lower jaws. To investigate this fact using the same data, we have considered attachment loss at the mesial site of tooth, although one can also pose the same question with the buccal site. The null hypothesis here is that the distribution of attachment loss at the mesial site of a tooth is the same for the upper and lower jaws. Here, the setting for this problem is quite similar to that of the previous problem. The difference is that in this setting the mesial site attachment loss score (outcome) of a tooth (unit) in any particular jaw (group) of an individual (cluster) may be related to the number of teeth present in that jaw of that individual. So, we may have some informativeness in the ICG size (number of teeth present in a jaw of an individual) even after conditioning on the cluster sizes. We consider the 60 month data for this analysis with 292 available subjects at that point. This data falls under the category of clustered data with some clusters having incomplete ICG structures, as described in Section 2.2, because there are a few subjects (clusters) who have teeth (units) in only one of the two jaws (groups). Our new test, developed in Section 2.2, is the only test that can be used to test the hypothesis under this setting and it gives a p-value of
. Thus, we conclude that there is a significant difference between the distributions of the attachment loss at the mesial sites of the upper and lower jaws. The estimated effect size, estimated like before, comes out to be around 3.0 units for this data. Web Figure 2 (in Web Appendix F) shows the empirical cumulative distribution functions
for the attachment loss scores of upper and lower sets of teeth.
5 Discussions
For clustered data with informative cluster sizes, the ordinary rank-sum test assuming independent observations can be biased as indicated in a simulation study in Section 3. The rank-sum test by Datta and Satten 2005, which compares group-specific marginal distributions
, appears to be a valid test under informative cluster sizes. But when an outcome from a group d
in a typical cluster depends on the number of observations from the group d in that cluster, we have informativeness in the ICG sizes formed by the two groups. As discussed earlier in Sections 1 and 4, this type of clustered data with informative ICG sizes are common in dental studies. Simulation studies from Section 3 indicate that even the rank-sum test by Datta and Satten (2005) has inflated size under this scenario of informative ICG size. There are no rank-based tests in the current literature that address this issue of informative ICG sizes. Thus, our main focus was to develop a rank-sum test for clustered data which works under this scenario of informative ICG sizes. This has led us to compare group-specific marginal distribution
that gives equal weights to each cluster (treating cluster as the basic sampling unit), but the weight given to an outcome from group d in a cluster depends on the number of observations from group d in that cluster. This is in contrast with
where the weight given to an outcome from a typical cluster depends on the number of outcomes in that cluster ignoring the information on the group membership of that outcome. Thus, the question of importance is which marginal distribution should be considered in testing hypothesis. It appears that comparing
may be more meaningful under informative ICG sizes and through a number of simulation settings, we have showed that our test maintains the nominal size and has substantial power in clustered data with informative ICG sizes. Even when the ICG sizes are not informative, simulation studies from Web Appendix E reveal that our test closely maintains the nominal size and has acceptable power when compared to other rank tests based on
or
.
As we consider clustered data, we may, in practice, encounter a few clusters which have outcomes from only one of the two groups under study. In that case, there are two possible ways of addressing this issue. One simple way is to ignore the clusters which do not have outcomes from both the groups and carry out the test, developed in Section 2.1, based on the remaining clusters. But, oftentimes, it is suspected that the information on the outcome of interest may be different between clusters with incomplete ICG structures (i.e., clusters with observations from one of the two groups) and clusters having both groups of observations. Keeping this in mind, we extended our test, in Section 2.2, to account for the clusters with incomplete ICG structures, so that we effectively use all the information present in the data. A simulation study showed that our test has the correct size and substantial power for a model accommodating incomplete ICG structures with informative ICG sizes. But, one can expect the power of this test to be low compared to that of the test involving only clusters with complete ICG structures. Therefore, in presence of a few clusters with incomplete ICG structures among a large number of clusters, it might be important to decide beforehand whether to apply the test developed in Section 2.1 ignoring a few clusters or to use the test from Section 2.2 keeping the full data. In case of clustered data where the outcomes within the same cluster belong to the same group, our test statistic reduces to that of Datta and Satten (2005), and, thus, will have superior size and power performance than the rank-sum test by Rosner et al. (2003) when the correlation structure within a cluster depends on the group membership.
Sometimes, when testing for group effect in outcomes from clustered data, one can expect the presence of some additional covariate(s) unrelated to the grouping factor. In such cases, these additional covariates (confounders) may act as nuisance factors in comparing the group-specific marginal distributions of the outcomes. For example, suppose we have a linear regression of the form

Here
is the outcome of the
observation in the
cluster,
is the binary indicator variable taking value 1 or 0 according to the group membership,
and
are the confounders (unrelated to the group membership) and
is the random error following some unknown distribution
. To compare the group-specific marginal distributions of the outcomes, one may want to test the null hypothesis
against the alternative hypothesis
But, if the distributions (unknown) of the confounders are different from that of the random error and also among themselves, then the rank tests based on the outcome Y can be misleading. This is, in general, true for any regression model involving confounders. To overcome this, one, often, uses aligned rank tests (see, e.g., Hájek, Šidák, and Sen, 1999, Section 10.1.2). The basic idea involves estimation of the (nuisance) parameters relating to the confounders through some appropriate rank statistics, formation of aligned observations (residuals) by plugging in the estimates, and then developing a rank test based on the aligned observations. In presence of informative ICG size, one can extend the resampling technique discussed in this article to formulate suitable rank-based statistics for estimating the nuisance parameters and testing the appropriate (sub)hypothesis under aligned rank tests.
6 Supplementary Materials
Web Appendices referenced in Sections 2, 3, 4, and an R code for implementing the novel rank-sum test are available with the paper at the Biometrics website on Wiley Online Library.
Acknowledgements
This research was supported by NIH grants 1R03DE020839 and 1R03DE022538. The authors would like to thank Jim Beck and Kevin Moss in the School of Dentistry at the University of North Carolina for providing the data set on periodontal disease from the Piedmont 65+ Dental study. We also thank the editor, the associate editor, and a referee for their constructive comments.
References




















