Limited Shared Variance among Measures of Cognitive Performance Used in Nutrition Research: The Need to Prioritize Construct Validity and Biological Mechanisms in Choice of Measures

ABSTRACT Background The literature on correlates of nutrition has seen an increase in studies focused on functional consequences at the levels of neural, perceptual, and cognitive functioning. A range of measurement methodologies have been used in these studies, and investigators and funding agencies have raised the questions of how and if these various methodologies are at all comparable. Objective The aim was to determine the extent to which 3 different sets of cognitive measures provide comparable information across 2 subsamples that shared culture and language but differed in terms of socioeconomic status (SES) and academic preparation. Methods A total of 216 participants were recruited at 2 US universities. Each participant completed 3 sets of cognitive measures: 1 custom-designed set based on well-understood laboratory measures of cognition [cognitive task battery (COGTASKS)] and 2 normed batteries [Cambridge Neuropsychological Test Automated Battery (CANTAB), Weschler Adult Intelligence Scale, fourth edition (WAIS-IV)] designed for assessing general cognitive function. Results The 3 sets differed with respect to the extent to which SES and educational preparation affected the results, with COGTASKS showing no differences due to testing location and WAIS-IV showing substantial differences. There were, at best, weak correlations among tasks sharing the same name or claiming to measure the same construct. Conclusions Comparability of measures of cognition cannot be assumed, even if measures have the same name or claim to assess the same construct. In selecting and evaluating different measures, construct validity and underlying biological mechanisms need to be at least as important as population norms and the ability to connect with existing literatures.


Introduction
The literature on nutritional deficiencies and their amelioration has seen an increase in studies focused on consequences at the levels of neural, perceptual, and cognitive functioning, from both basic science and translational perspectives (1)(2)(3)(4)(5)(6). Consider that in the 5 y between 2016 and 2020 (inclusive) >13,000 papers were published on some aspect of nutrition and cognition, at an average of >2600 papers per year (see Table 1; see the Supplemental Material for details on how these estimates were obtained). A range of measurement methodologies have been used in these studies, and investigators and funding agencies have raised the questions of how and if these various methodologies are at all comparable. Any sense of cumulative progress in this domain requires an understanding of the level of comparability across approaches. We present here, to our knowledge, the first and only controlled withinperson comparison of different measurement approaches.
The Bill and Melinda Gates Foundation and Grand Challenges Canada commissioned a review of currently used measures in 4 domains of interest: cognitive abilities, social and behavioral development, motor skills, and home environment. The final report from this work (7) concluded that "there is no 'one size fits all'" approach and that there is no identifiable "gold standard" for measuring functional outcomes in these domains. Although the report allows the range of measures to be grouped in terms of functional domains, appropriate populations, etc.,

TABLE 1 Estimated number of papers published per year and cumulatively on nutrition and cognition from 2016 and 2020
Year Total  Cumulative   2016  2210  2210  2017  2510  4720  2018  2350  7070  2019  2960  10,030  2020 3400 13,430 it offers no guidance in terms of determining whether the various tools are in fact assessing comparable abilities and functions in comparable ways.
A review of the measures considered in that report reveals 3 general challenges to any attempt to assess comparability. First, the majority of the measures lack theoretical or biological specificity to the functions they propose to assess. For example, the measures of memory appropriate for adolescents or young adults considered in that report include general measures of intellectual performance as well as scales developed for application to career development. They span up to 6 of what the authors of the report identify as subdomains of general cognitive functioning, which relate only loosely to currently accepted scientific conceptions of memory (8) and have no apparent reference to brain systems and circuits that support memory (9). Second, a large number of the measures (including many that are timed) are administered manually, without appropriate instrumentation, and those that are administered with instrumentation either do not report or do not allow the precision and consistency of their measurements to be assessed. This is critical because differences across display and clock technologies can often lead to large variations in measured response latencies (10), especially when the differences of interest exist on the scale of milliseconds. Third, many applied studies (such as intervention studies) have a concern with the performance of specific populations with specific functional needs (e.g., factory workers, tea pluckers, students, etc.). The generality of the majority of the tests considered in that report a priori limits the relevance of what is measured to the functional needs of the population.
The lack of specificity with respect to the cognitive construct of interest and/or the biological underpinnings of that construct, the lack of concern with proper and precise instrumentation, and the lack of functional relevance to the population of interest all suggest that the results from the use of these kinds of measures may lead to muddled outcomes. And indeed, that is the case, as has been noted (11)(12)(13).
The present study was a controlled comparison of 2 general approaches in the form of 3 different test batteries. The first is one that we have used in a set of field studies of interventions designed to address the consequences of iron deficiency (1,4,14), referred to here as COG-TASKS. The tasks used to assess cognitive performance in these studies were selected, in part, on the basis of the extent to which they rely on brain areas differentially sensitive to variation in iron (15,16) and, in part, on the extent to which they assess functionally relevant abilities (14). The second approach is represented by 2 frequently used, normed batteries of cognitive functioning: the Cambridge Neuropsychological Test Automated Battery (CANTAB; Cambridge Cognition) (17)(18)(19), and the Weschler Adult Intelligence Scale, fourth edition (WAIS-IV; Pearson) (20)(21)(22). Critically, all 3 sets of measures included tasks that had the same name [e.g., go/no-go (GNG)] or that claim to measure the same cognitive construct (e.g., working memory). All 3 sets of measures were taken by 2 samples of healthy, nonclinical, college-aged women, one at The University of Oklahoma (OU) and the other at Cornell University (CU). The ability to acquire measurements at these 2 universities allowed us to quantify the patterns of shared and distinct variance using 2 samples that possess a common language and culture but that differ in 2 specific characteristics-socioeconomic status (SES) and general academic achievement-known to modulate a range of perceptual and cognitive measures (23). To our knowledge, although the questions of interest are important, the present study is the first and only one to perform a controlled, within-participants investigation of shared variance. Our predictions were that 1) the shared variance across the tasks, even though many shared the same name or claim to investigate the same cognitive construct, would be low, and that 2) the shared variance within task sets would be much higher than across task sets.

Subjects
A total of 216 women were recruited at 2 testing locations: half of the sample was recruited from the Norman, Oklahoma, campus of OU, and half of the sample was recruited from the Ithaca, New York, campus of CU. We restricted consideration to females on the basis of our primary interest in the effects of iron deficiency on cognition. All subjects had (self-reported) normal or corrected-to-normal vision and hearing, were proficient in written and spoken English, and reported unencumbered use of both hands. Subjects were compensated with a $50 gift card at the end of 3 d of participation.

Study design
The study was designed as a 2 (location: OU, CU) × 3 (assessment: COGTASKS, CANTAB, WAIS-IV) factorial with assessment as a within-subjects variable. The ordering of test battery per subject was determined using a balanced Latin square, and the ordering of the tasks within each battery was fixed.

Cognitive assessments
The 3 sets of cognitive assessments were administered on 3 consecutive days. All tasks were administered by research assistants trained to a common standard by MJW and DMDV, who also performed random periodic checks for consistency of procedure.

COGTASKS.
All of the tasks were developed and programmed by MJW using publicdomain software (10) that allowed for highly accurate timing of stimulus displays and behavioral responses (±1 ms); all programs and stimuli are freely available on request. Each of the tasks have long histories in the experimental study of cognition, with some dating back to the 19th century (24). This is to say that, although the tasks do not have associated norms in the traditional sense, there is a long literature than can be consulted for normative patterns. Brief descriptions of the tasks are provided here, with procedural details presented in the Supplemental Material.
The simple reaction time (SRT) task provides an estimate of the speed of the simplest possible behavioral response to a visual stimulus. The GNG task provides an estimate of the efficiency of sustained attention and the speed of attentional capture in the absence of a need to filter competing information. The attentional network task (ANT) (25) provides an estimate of the effectiveness of 3 components of attention: alerting (low-level attentional capture), orienting (midlevel spatial selective attention), and conflict (high-level selection). The Sternberg memory search (SMS) task (26) estimates the speed and accuracy with which immediate visual memory can be searched. The composite face effect (CFE) task (27) estimates the extent to which information from immediate perception and memory can be effectively coordinated. The cued recognition task (CRT) follows a modified (28) version of a classic (24) visual recognition memory task that estimates the speed, accuracy, and efficiency of recognition based on short-duration visual memory.

CANTAB.
The CANTAB measures specific aspects of cognition, including memory and learning. The SRT task provides an estimate of the speed of the simplest possible behavioral response to a visual stimulus. The affective go/no-go task (AGNG) evaluates latency, error, and bias when presented with positive or negative affective words that must be placed into an emotional category. The ability to shift between 2 different spatial aspects (location and direction) is measured in the attentional switching task (AST). The motor screening task (MOT) assesses sensorimotor skill. The Stockings of Cambridge (SOC) task requires spatial planning skills by replicating a visual pattern using the minimum number of moves. The verbal recognition memory (VRM) task assesses the ability to learn, encode, and retrieve new verbal information. The pattern recognition memory (PRM) task measures speed and accuracy of a forced-choice criterion for distinguishing newly or previously presented visual stimuli.

WAIS-IV.
The WAIS-IV battery assesses more general measurements of cognition and intelligence. The block design subtest measures visual pattern construction abilities. The similarities subtest measures problem solving and conceptualization of how 2 words are subjectively related by the participants. The digit span subtest evaluates accurate recall for a presented sequence of numbers. Nonverbal and abstract problem-solving skills are measured by the matrix reasoning subtest. In the vocabulary subtest, participants must rely on their memory to identify both visually and verbally presented items. The arithmetic subtest assesses the participants' ability to mentally solve arithmetic problems. The symbol search subtest measures information-processing speed by presenting subjects with target symbols that they must identify when randomized with other symbols. The visual puzzles subtest requires the participant to use nonverbal reason and verbal perception to reconstruct a visually presented puzzle. Topics of general knowledge are measured in the information subtest. The coding subtest assesses nonverbal learning and nonverbal short-term memory by copying a series of presented symbols.

Ethics
This study was approved by the institutional review boards at both OU and CU.

Statistical analyses
Differences in proportions or frequencies as a function of location were tested using a chi-square test. Differences as a function of location for each of the dependent variables in each of the tasks were tested using 2-tailed t tests; variables that were expressed as proportions or percentages were transformed prior to analysis using an arcs in-square root transformation to deal with heterogeneity of variance (29). Correlations between measures that either had the same name (e.g., GNG) or that putatively measured the same construct (e.g., working memory) were assessed using the Pearson product moment correlation coefficient, r.
A final factor analysis was performed on the correlation matrix of the Z-transformed values using a varimax rotation. All analyses were performed using SAS 9.4 for Linux (2019; SAS Institute).

Demographics
The distribution of subjects by race and ethnicity at each of the 2 locations is presented in Table 2. There were no significant differences in race or ethnicity as a function of testing location (χ 2 = 0.06, NS  Figure 1 plots the distribution of SAT (Scholastic Aptitude Test) reading and mathematics scores by testing location and shows that both scores are dominated much more by the high ranges at CU relative to OU (χ 2 = 271.21, P < 0.0001). Table 3 displays the means, measures of variability, and results of the t tests assessing differences due to testing location for all of the dependent measures from each of the tasks in the COGTASKS set of measures. There were no significant differences obtained for any of the dependent measures in this set. Table 4 presents the same analyses for all of the dependent measures from each of the tasks in the CANTAB set of measures. Significant differences due to testing location were found for the AGNG (accuracy), the MOT, VRM old and new items, and PRM [all reaction time (RT)]. In all cases, performance by CU students was better than that of OU students. Table 5 presents these analyses for all of the dependent measures from each of the tasks in the WAIS-IV. Significant differences favoring the CU over the OU subjects were obtained for all but 5 of the variables: similarities, digit span, matrix reasoning, and the working memory composite score (with the difference on this measure being marginally significant).

Correlations between tasks
We next examined the pairwise correlations between tasks that either have the same name or that claim to assess the same cognitive construct. Table 6 presents the correlation coefficients and shared variances for these pairs of tasks. There were a number of significant correlations, but the majority were weak (all r < 0.27), and, on average, the shared variance was only 5% for the pairs of measures having significant correlations. This shared variance is very low relative to the partial variance that some of the COGTASKS variables have demonstrated as a function of treatment condition in some of our field studies (2).

Factor analysis
Finally, we submitted the data to a factor analysis, using the correlation matrix on the Z-transformed scores, using a varimax rotation. The first 3 eigenvalues accounted for 40%, 18%, and 14%, respectively, of the variance, cumulatively accounting for 72% of the total variance. The remaining eigenvalues were increasingly <1.0, with the remaining factors each accounting for <2.5% of the variance. The 3-factor solution perfectly segregated the 3 sets of measures, with 1 exception. Factor 1 comprised all of the COGTASK measures, factor 2 comprised all of the WAIS-IV measures (with 1 exception), and factor 3 comprised all of the CANTAB measures. The single exception was the processing speed index from the WAIS-IV, which served to relate all 3 sets of measures.

Discussion
Accompanying a sustained and increasing interest in assessing cognitive sequelae of a range of nutritional deficiencies and interventions  (see Table 1) has been an interest in understanding the best practices in choosing and using behavioral measures of cognition. Although there has been acknowledgment that there is no viable "one size fits all" approach to assessment (7), the field has concentrated on characteristics such as population norms and external validity (12). While these characteristics are desirable, we believe that they have tended to be pursued at the expense of construct validity and biological motivation, often with the thought that if 2 measures share a name or are claimed to be assessing the same cognitive construct, then they must be comparable.
To illustrate the problems with this logic, we conducted what, to our knowledge, is the first and only comparison study in which female participants from 2 US universities completed 3 sets of cognitive assessments: a set of tasks we developed for use in studies of iron repletion (COGTASKS) (1,4,14), a widely used commercial neuropsychological battery (CANTAB), and a widely used measure of general intelligence (WAIS-IV). The 2 samples, from 2 US universities, shared (generally) a culture and a language and were very similar in terms of race and ethnicity. However, they were different in terms of SES and academic achievement, 2 factors that are known to affect scores on measures of cognitive performance. All participants completed all 3 sets of assessments.
No significant differences as a function of testing location were found for any of the variables in the COGTASKS, and only a small number of differences were found for the CANTAB variables; however, almost all of the measures from the WAIS-IV had significant differences due to testing location. This suggests that, among these 3 sets of measures, the COGTASKS were the least and WAIS-IV measures were the most sensitive to potentially confounding effects of SES and educational preparation.
Critically, the correlations between tests that either shared the same name or putatively assessed the same construct were uniformly low, with, on average, pairs of measures having only 5% shared variance. Fur-thermore, a factor analysis on all of the measures simply reproduced their original grouping, with the 3 sets of measures being related by the common factor of processing speed, accounting for 72% of the total variance.
As an example of the weak relations across measures, consider the relation between the GNG task (COGTASKS) and the affective GNG task (CANTAB). Researchers would not be faulted for thinking that these 2 tasks were assessing the same cognitive operations. However, the correlation between the RTs in the 2 tasks was 0.26, with only 7% shared variance. The devil is in the details in this comparison. In the COG-TASKS version of the task, simple nonverbal visual forms (vertical and horizontal bars) were used as the go and no-go stimuli. In comparison, in the CANTAB version of the task, the stimuli were words that varied in affective valence. Consequently, even though the 2 tasks had very similar names, and putatively were assessing the same cognitive constructs, the internal computations required by the 2 tasks were quite different. In the COGTASKS version, the test stimulus needed to be properly categorized by way of a learned association, and then a response needed to be either withheld or given. In contrast, in the CANTAB version, retrieval of the word's meaning from semantic memory was required, then the word needed to be properly categorized by a learned association as either a go or a no-go stimulus, all while a potentially interfering or facilitating affective response was being computed. These are 2 very different sets of cognitive and affective operations, so it should not be surprising that the relation between the 2 tasks was rather weak.
We believe that the central conclusion from this work is that, in many cases, construct validity and a concern for underlying biological mechanisms need to be at least as important as population norms and the ability to connect with existing literatures. It all comes down to what needs to be measured. If the questions are at the level of general cognitive functioning, independent of any specific biological state, then packages such as CANTAB and WAIS-IV are very useful. However, if it is the case (as is true with variations in iron status) that the biological state needs to be considered, given the nonuniform distribution of iron in the brain (16), then a more sensitive approach would be to select cognitive assessments based on differential involvement of the specific brain regions that are dependent on iron. The advantages of this more nuanced measurement approach were commented on >30 y ago in the nutrition literature (30) and they remain true. Fortunately, the cognitive neuroscience literature is quite rich, containing multiple sources of evidence helpful in selecting measurements.
Beyond selecting tasks, it is important to understand the nuances of experimental design and data analysis. Returning to the GNG task, simply varying the percentage of go vs. no-go trials can dramatically change the pattern of results. In addition, there are generally accepted practices in analyzing cognitive data that are often left unsaid in publications. For example, when analyzing RTs, it is critical that RTs <200 ms or longer than (for example) 2000 ms be removed from the data, as these reflect anticipatory responses and lapses of attention, respectively. Furthermore, since the distribution of RT data is not Gaussian at the individual subject level, the summary statistic for each subject should be the median rather than the mean and should be calculated either only for correct responses or for correct and error responses separately.
One significant strength of the present study is that the same participants completed all 3 sets of measures, allowing for much stronger inferences regarding the level of shared variance. An additional strength is that the 2 samples differed primarily in terms of SES and educational preparation, 2 factors that are known to influence scores on cognitive tests. This allowed us to quantify the extent to which each set of tasks would be subject to the potentially confounding influences of differences in SES and educational preparation. One weakness of the present study is that, relative to the general population, the 2 samples of college students can be assumed to be higher performing, which does pose some limits to generalizability. That being noted, the level of performance on the COGTASKS was comparable to what we have obtained with college-aged women in Rwanda (2), male and female adolescents in India (4), and women of reproductive age in India (14). A second potential weakness is that the inferences drawn here are limited to tests of the cognitive constructs represented in the overlap among the 3 sets of tasks. A third weakness is that the results, drawn as they are from a healthy population, do not necessarily generalize to individuals with specific dietary insufficiencies. However, we would suggest that the present findings suggest similar results with other cognitive constructs (e.g., measures of executive function).
Perhaps, then, the last conclusion to be drawn is that this is an area that can benefit immensely from interdisciplinary collaborations. Certainly, it has been our experience that cross-talk between nutritional science and cognitive neuroscience has been quite fruitful. We would further argue that it has allowed for measurement that is more biologically grounded and stronger in terms of construct validity than would have been the case otherwise.