Persistent confusion in nutrition and obesity research about the validity of classic nonparametric tests in the presence of heteroscedasticity: evidence of the problem and valid alternatives

The use of classic nonparametric tests (cNPTs), such as the Kruskal– Wallis and Mann–Whitney U tests, in the presence of unequal variance for between-group comparisons of means and medians may lead to marked increases in the rate of falsely rejecting null hypotheses and decreases in statistical power. Yet, this practice remains prevalent in the scientific literature, including nutrition and obesity literature. Some nutrition and obesity studies use a cNPT in the presence of unequal variance (i.e., heteroscedasticity), sometimes because of the mistaken rationale that the test corrects for heteroscedasticity. Herein, we discuss misconceptions of using cNPTs in the presence of heteroscedasticity. We then discuss assumptions, purposes, and limitations of 3 common tests used to test for mean differences between multiple groups, including 2 parametric tests: Fisher’s ANOVA and Welch’s ANOVA; and cNPT: Kruskal–Wallis document the impact of heteroscedasticity on the validity of these tests under conditions similar to those used in nutrition and obesity research, we conducted simple simulations and assessed type I error rates (i.e., false positives, defined as incorrectly rejecting the null hypothesis). We demonstrate that type I error rates for Fisher’s ANOVA, which does not account for heteroscedasticity, and Kruskal–Wallis, which tests for differences in distributions rather than means, deviated from the expected significance level. Greater deviation from the expected type I error rate was observed as the heterogeneity increased, especially in the presence of an imbalanced sample size. We provide brief tutorial guidance for authors, editors, and reviewers to identify appropriate statistical tests when test assumptions are violated, with a particular focus on cNPTs. Am J Clin Nutr 2021;113:517–524.


Introduction
In science (1-3) generally, and nutrition and obesity research specifically (4)(5)(6), concerns have been raised about the extent to which research methods are rigorous, meet modern standards, and can be expected to produce reproducible results. Whether such concerns reflect a growing lapse in rigor or simply a greater recognition of the gaps that have riddled science since its beginning remains uncertain. What is clear, however, is that there is always room for improvement, and scientific standards for increasing rigor, reproducibility, and transparency are developing in a positive direction. In 2015, the NIH updated its instructions for grant applications to refine the standards for biomedical research (7). Various suggestions for improvements have also recently been published within the nutrition and obesity fields (8)(9)(10).
One aspect that merits further attention is the use of statistical procedures in a manner that leads to valid inferential testing. Inappropriate choices of statistical models or incorrect implementation of statistical procedures in nutrition and obesity research have been noted elsewhere (4,11,12). In addition to perpetuating confusion within the scientific process, errors of this type can cultivate mistrust in nutrition science (8). Here, we address one area where improvements in the choice of statistical tests, and in the understanding of their validity, may be warranted: the validity (or lack thereof) of using classic nonparametric tests (cNPTs) to test for differences in central tendency (e.g., means or medians) between groups when heteroscedasticity is present. Heteroscedasticity means a difference in statistical dispersion between groups. Variance is one measure of statistical dispersion and is the measurement of heteroscedasticity we use here. Supplemental Figure 1 depicts a simple example of heteroscedasticity between 2 groups. Note that we do not discuss single-sample or paired-sample tests here. We define cNPTs as tests that do not rely on any assumptions about the data having a particular distribution, other than having a finite mean and a finite variance. The terms nonparametric and distributionfree do not always refer to the same thing. A distributionfree test is a test that does not depend on the data having any particular distributional form (e.g., the data do not need to fit a normal distribution). In contrast, the term nonparametric has historically been used in multiple ways. Herein, we refer to nonparametric in a way synonymous with distribution-free [cf. Hollander et al. (13)]. Further, cNPTs (as opposed to nonparametric tests in general) are statistical procedures that do not involve explicit resampling, thereby excluding bootstrap, randomization, or permutation testing, and are usually rankbased.
The aims of the present article are to 1) describe various misconceptions about cNPTs in the context of heteroscedasticity, 2) simulate the impact of such use on the validity of significance testing, and 3) provide a tutorial on how to conduct cNPTs validly. We focus on tests for ≥3 groups; however, our arguments also apply to tests for 2 groups.

Misconceptions of Using Nonparametric Tests in the Presence of Heteroscedasticity
The main misconceptions about cNPTs are that they are "assumption free" or "less powerful than their parametric counterparts" (4). To be clear, when the assumptions of the parametric test (e.g., normality and equal variance for Student's t test and ANOVA; see Table 1) are met, the parametric test will generally be the most powerful test available. When the assumptions of the parametric test are not met, nonparametric tests will often (but not always) be more powerful than parametric tests, depending on multiple factors including the magnitude of violation of test assumptions of parametric tests and the sample size (14)(15)(16). However, nonparametric tests are generally sensitive to distributional differences beyond mean or median differences between groups. Thus, one needs to ensure that there is no heteroscedasticity when using cNPTs (that is, when there is homoscedasticity). Further, this highlights the difference between cNPTs and other nonparametric tests, such as bootstrap tests and permutation tests, which do not require the homoscedasticity assumption.
A common, yet incorrect, procedure used to test for a difference in means or medians between 2 groups is as follows: 1) A test of normality is conducted for each group, or for the combined residuals from a model including a term for group, with the Kolmogorov-Smirnov or Shapiro-Wilk test (valid). The italicized tests are emphasized in this article.
2) If the normality assumption is supported, an F test is used to test for heteroscedasticity (valid). a) If the homoscedasticity assumption is supported, Student's t test is used (valid). It is important to note that our use of the word "valid" here does not necessarily mean "optimal." The optimal choice of test depends on many factors (17). b) If the homoscedasticity assumption is not supported, the Mann-Whitney U test is used (incorrect). 3) If the normality assumption is not supported, Mann-Whitney U is used (valid, if homoscedasticity is present).
The challenge with this erroneous procedure, particularly step 2b, is that Student's t test assumes that the 2 groups follow a normal distribution with the same variance (i.e., homoscedastic). The Mann-Whitney U test actually tests for differences in distributions, and therefore does not assume any specific distributional shape. Rather, it assumes the 2 groups are drawn from populations with identical distributions under the null hypothesis, including mean, median, variance, and shape. The only way that the Mann-Whitney U test can test for differences in central tendency (e.g., means and medians), then, is for the distributions of the 2 groups to be identical except that the central tendency is different. If the test is used improperly (e.g., in the presence of heteroscedasticity), a Mann-Whitney U test can be statistically significant even with identical central tendency. Thus, cNPTs are not a cure for heteroscedasticity, and generally should not be used in the presence of important heteroscedasticity.
With large sample sizes and higher statistical power, highly statistically significant evidence for a departure from the assumptions (normality and homoscedasticity) does not mean that the magnitude of assumption violation is large or concerning. Therefore, the analyst should attend to the magnitude of the apparent heteroscedasticity and not just its statistical significance.
The use of cNPTs in cases of heteroscedasticity is of particular concern in nutrition and obesity research, where heteroscedasticity is common. Unequal variances between groups can occur for multiple reasons. In a study of diet and exercise, for example, participants in a control group may continue their usual diet and exercise routine, whereas those in the intervention group vary on the spectrum of compliance to noncompliance over time. This could lead to a smaller variance in body weight change scores in the control group than in the intervention group (18). Diet interventions may have the opposite effect when the outcome of interest is dietary diversity. A control group consuming a usual diet may demonstrate higher variance than a more restrictive intervention group.

Assumptions, Purposes, and Limitations of 3 Common Tests
This section describes misconceptions about cNPTs in the presence of heteroscedasticity by comparing them with their parametric counterparts. We focus on 3 tests for differences between ≥3 groups: 1) Fisher's ANOVA, usually referred to as just ANOVA or Fisher's F test, which is a parametric test; 2) the Kruskal-Wallis test, a cNPT; and 3) Welch's ANOVA, a variant of Fisher's ANOVA that addresses heteroscedasticity. Table 1 summarizes assumptions and the hypothesis of each test. The descriptions here are basic overviews; interested readers should refer to more detailed discussions for more information (19,20).

Fisher's ANOVA
Fisher's ANOVA refers to a parametric analysis of variance. Fisher's ANOVA is often what researchers try first with continuous data by default and is the subject of many basic statistical tutorials. It is a powerful and useful test but can have undesirable statistical properties when basic conditions or assumptions are violated. Fisher's ANOVA assumes normality of residuals, independence of observations, and homogeneity of variance. Although an equal sample size is not required, unequal sample sizes among groups can exacerbate the bias in type I and type II errors in the presence of heteroscedasticity, as we discuss later in our simulations.
The fact that unequal group variances and sample sizes occur commonly in practice has led some to suggest that Fisher's ANOVA should not be taught or used at all, instead moving directly to Welch's ANOVA (described later). Despite the challenges with Fisher's ANOVA, 2 aspects allow it to remain useful: 1) the denominator df is larger than Welch's ANOVA, and 2) although Fisher's ANOVA develops serious problems (bias, type I error rate increase) in the presence of heteroscedasticity, data can potentially be transformed to fix variance heterogeneity, for example, by using a Box-Cox transformation (21); common transformations like log, square root, and inverse are special cases of the Box-Cox transformation.
The first point means that, when assumptions are met, Fisher's ANOVA is more powerful for testing hypotheses than other tests like Welch's ANOVA. The second point means that when ≥1 assumption, such as homogeneity of variance, is not met, it may be possible to transform data to be used in this more powerful statistical approach. When these assumptions and conditions do not hold, other tests should be used.

Kruskal-Wallis test
When the assumption of homogeneity of variance fails, some researchers erroneously substitute a nonparametric test, such as the Kruskal-Wallis test, that relies on the ordered ranks of the data. The first step of the Kruskal-Wallis test is to sort the data from smallest to largest, regardless of the group that the observations belong to, and replace the actual observed data with their sorted ranks. The advantage of the Kruskal-Wallis test is that no assumptions about the specific distributions of the data need to be made for the test procedure to be valid, other than that whatever the distributions are, they are the same for all groups being compared. However, as a general rule, both Kruskal-Wallis and Fisher's ANOVA tend to increase the type I error rate in the presence of heteroscedasticity with equal sample size (20).
Kruskal-Wallis can be less powerful than either Fisher's or Welch's ANOVA for 2 reasons. First, the Kruskal-Wallis test procedure loses statistical information in the first step when the actual data are replaced by sample ranks. Second, the Kruskal-Wallis test has a more ambitious goal than either Fisher's or Welch's ANOVA. Whereas Fisher's and Welch's ANOVAs evaluate differences in central tendency (e.g., means), the Kruskal-Wallis test compares differences in distributions between groups, including central tendency, variance, and other distributional characteristics. By definition, if we know that the groups have differences in variances, then we know that the groups have different distributions (although not necessarily differences in central tendency), and an appropriately powered, but possibly inappropriately used, Kruskal-Wallis test is expected to show a statistically significant result, even if the measure of central tendency is equal. Thus, the belief that nonparametric test procedures, such as Kruskal-Wallis or Mann-Whitney U, are not influenced by heterogeneous variances in the data is erroneous. Published articles, however, continue to state that Kruskal-Wallis tests were used when the equal variance assumption was violated. In many cases, utilizing the Kruskal-Wallis test when variance heterogeneity is present will result in an increased type I error rate for testing differences in central tendency, often expressed at the median (10, 12).

Welch's ANOVA
Welch's ANOVA (sometimes called Welch's F test) is designed to be robust in the presence of normality and when the homogeneity of variance assumption does not hold. Welch's ANOVA employs 2 innovations compared to Fisher's ANOVA. First, it weights observations within each group by the inverse of the estimated group variances. That is, observations that come from groups with higher variances have smaller weights than do observations that come from groups with small variances. The second innovation of Welch's ANOVA is that the denominator df employs the Welch-Satterthwaite adjustment to approximate the F-distribution. Because Welch's ANOVA is based upon the general linear model, it can be viewed as a generalization of Fisher's ANOVA that is robust to the heterogeneity of variance. Because Welch's ANOVA places weights on observations from groups that are inversely proportional to the variance in each group, the effect of heterogeneity is mitigated. Hence, the procedure is better able to preserve the type I error rate.
One disadvantage of Welch's ANOVA is that it estimates group variances in addition to group means. Thus, this model spends more df than Fisher's ANOVA, and the denominator df for Fisher's ANOVA will always be larger than for Welch's ANOVA. As a result, Welch's ANOVA tends to have lower power when the homoscedasticity assumption holds. Thus, the gain in robustness (i.e., useful when there is violation of homoscedasticity) comes at the cost of statistical power. Although it is important to recognize that Welch's ANOVA is robust to heterogeneity in variance, it may still be useful to use the Kruskal-Wallis model in some circumstances, such as when the variances are homogeneous, but the residual distribution is not normal.

Simulated Impact of Using Nonparametric Tests in the Presence of Heteroscedasticity
To document the effect of heteroscedasticity on the validity of a cNPT under conditions similar to those used in nutrition and obesity research, we conducted simple simulations and assessed the empirical probability of rejecting the null hypothesis (H 0 ) of no differences in central tendency between groups (i.e., the type I error rate). In the simulations, samples were randomly sampled from normal distributions with the same mean and different variances between groups. Because it is known that imbalanced sample sizes influence the type I error rate in the presence of heteroscedasticity, we separately performed simulations with balanced and imbalanced sample sizes. Supplemental Figure 2 summarizes the overall simulation process.
Furthermore, to investigate the effect of unequal sample sizes among groups on type I error rate when combined with heteroscedasticity, we designed the simulation with 2 groups, 2 different SDs (

Statistical tests
We performed 3 different tests on each of the 10,000 generated data sets: Kruskal-Wallis test, Fisher's ANOVA (i.e., with equal variance assumption) (23), and Welch's ANOVA (i.e., without equal variance assumption) (23). Type I error rate was defined as the number of simulations with P value (p i ; i = 1, 2, . . . , N) below significance level (α = 0.1, 0.05, 0.01):p = N i=1 I(p i < α)/N, where I is the identity function. The 95% CI of empirical type I error rates (p) was computed asp ± 1.96 √p (1 −p)/n. For data generation and statistical analysis, we used the statistical computing software R 3.3.1 (24). The code is available in our study repository (25).

Results
The type I error rate of Fisher's 1-factor ANOVA and Kruskal-Wallis increased as the maximum SD (a) increased (that is, greater heteroscedasticity), whereas that of Welch's 1factor ANOVA was preserved even when a increased (Figure  1). Although the deviation was not clearly observed for the inappropriate tests under an equal sample size, this simply means the error is minimized if the sample size is equal under the same parameter settings. The results do not mean the deviation is negligible even under the same sample size, as we have demonstrated in Figure 1 (equal sample size but different SD). Deviation of type I error rate from the nominal significance level was most profound with unequal sample sizes in different groups (Figure 2). The type I error rate increased when the sample size of the group with larger variance was smaller and inappropriate tests (Kruskal-Wallis and Fisher's ANOVA) were used. It is worth noting that Fisher's ANOVA with balanced sample sizes (i.e., 30:30) and different SDs appears to preserve the type I error rate; however, this is because of the relatively small difference in SE and the small number of groups. √p (1 −p)/n. We set the SD of 4 groups to 1 and that of the other group to ≥ 1, i.e., {σ 1 , σ 2 , σ 3 , σ 4 , σ 5 ,} = {1, 1, 1, 1, a}, where a ≥ 1. Each statistical test was of the overall null hypothesis of no differences in means between groups. Although Kruskal-Wallis tests for differences among distributions, researchers sometimes use it to test for differences in central tendency, which is invalid when there is heteroscedasticity.
We acknowledge 2 limitations of our approach. We used 0.05 as a prespecified significance level because it is frequently used. Although threshold-based inference like "statistical significance" at the 0.05 level has limitations, using type I error permits us to investigate the influence of incorrect analysis on the distribution of P values. If the type I error rate does not match the expected rate, it implies that the distribution of the P values is not what it should be, that the expected value of the P values is also unlikely to be what it should be, and therefore that the tests will be biased in general, regardless of whether one uses specific significance thresholds. The degree and direction of bias may depend on the significance level used. Given that P values and estimates of uncertainty, like CIs, use the same underlying math, incorrect type I error rates also imply biased CIs. Furthermore, we have not discussed the influence of heteroscedasticity on power or type II error (failure to reject null hypotheses that are actually wrong). Given that Welch's ANOVA considers the difference in variance, whereas Fisher's ANOVA does not, Type I error rate was defined as the number of simulations with P value (p i ; i = 1, 2, . . . , n) below significance level (α):p = N i=1 I(p i < α)/n. The 95% CI of type I error rates (p) was computed asp ± 1.96 √p (1 −p)/n. Each statistical test was of the overall null hypothesis of no differences in means between groups. Although Kruskal-Wallis tests for differences among distributions, researchers sometimes use it to test for differences in central tendency, which is invalid when there is heteroscedasticity.

FIGURE 3
Diagnostic checks for an ANOVA model. Key questions, hypothesis tests, and visual plots that could be conducted whenever one considers diagnostics associated with any ANOVA model-based analysis. Note that each of these assessments is dependent on a sufficiently large sample, just like the overall statistical test of interest. the power of Welch's ANOVA could be theoretically lower when heteroscedasticity is not present. Thus, the choice of appropriate tests is dependent on the balance between type I and II error rates, which should be done on a case-by-case basis.

Tutorial of Valid Statistical Procedures in the Context of Heteroscedasticity
Some experts who analyze their data but are not formally or extensively trained in statistics may not be as familiar with the intricacies and assumptions involved with statistical methods. To assist in such situations, we present a flowchart that can provide guidance on which statistical method is best to use. These charts are meant as guides, not as edicts for the user to follow uncritically. The charts can be thought of as a framework for the structured thinking that one should engage in when deciphering which tool is best used and in which circumstance. Moreover, this guidance is not a comprehensive assembly of all procedures that could be applied when comparing groups (26)(27)(28).
In Figure 3, we display guidance on some diagnostics checks that should be used whenever one evaluates whether any ANOVA model is appropriate. We present 4 main diagnostics for the ANOVA model with accompanying tools and visuals to evaluate these checks: Independence, Constant Variance, Normality, and Omission of Other Important Predictor Variables. Note that most of these tools and visuals are readily available in most statistical software but require prior planning by the investigator to implement successfully. The implementation of these guidelines should consider the characteristics of the statistical tests in addition to the magnitude of heteroscedasticity (e.g., the difference in variances between groups). In very large samples, for instance, tests of heteroscedasticity may be statistically significant even if the difference in variances is trivial. However, if the magnitude of the difference in variances between groups is statistically significant but trivial, the influence of the choice of statistical test (Fisher's ANOVA rather than Welch's ANOVA) on the type I error rate may also be trivial, and the choice of statistical test (Welch's ANOVA rather than Fisher's ANOVA) may also include considerations for loss of power. In contrast, if the sample size is small, tests for heteroscedasticity may not be powerful enough to detect even meaningful amounts of heteroscedasticity if it exists. Therefore, investigators should consider adopting statistical methods that allow for heteroscedasticity and nonnormality when sample sizes are small, and consider the magnitude of difference in variance between groups, not only the significance of tests of heteroscedasticity, to decide on tests to use when sample sizes are large. Figure 4 contains a decision tree along with some remedial measures and various models that could be fit to the data to obtain the most valid analysis. In the case that the data neither are normally distributed nor have equal variance, a nonparametric bootstrapping approach could be used to test for significant differences in the means. Nonparametric bootstrapping approaches do not rely on distributional assumptions, because they rely on the empirical distribution (29).

Conclusion
We introduced the assumptions of the statistical tests for the difference in central tendency (e.g., Fisher's ANOVA, Welch's ANOVA, and Kruskal-Wallis) while discussing the misconception of cNPTs (Kruskal-Wallis) among those tests. Our simulations indicated that Kruskal-Wallis and Fisher's 1factor ANOVA tests provide (in many cases positively) biased type I error rates in the presence of heteroscedasticity, whereas Welch's 1-factor ANOVA does not. We further provided a tutorial on selecting statistical tests under heteroscedasticity. More training in these common errors may be needed for nutrition and obesity researchers.