Abstract

Objectives

Measuring cognition in an aging populabtion is a public health priority. A move towards survey measurement via the web (as opposed to phone or in-person) is cost-effective but challenging as it may induce bias in cognitive measures. We examine this possibility using an experiment embedded in the 2018 wave of data collection for the U.S. Health and Retirement Study (HRS).

Methods

We utilize techniques from multiple group item response theory to assess the effect of survey mode on performance on the HRS cognitive measure. We also study the problem of attrition by attempting to predict dropout and via approaches meant to minimize bias in subsequent inferences due to attrition.

Results

We find evidence of an increase in scores for HRS respondents who are randomly assigned to the web-based mode of data collection in 2018. Web-based respondents score higher in 2018 than experimentally matched phone-based respondents, and they show much larger gains relative to 2016 performance and subsequently larger declines in 2020. The differential in favor of web-based responding is observed across all items, but is most pronounced for the Serial 7 task and numeracy items. Due to the relative ease of the web-based mode, we suggest a cutscore of 12 being used to indicate CIND (cognitively impaired but not demented) status when using the web-based version rather than 11.

Discussion

The difference in mode may be nonignorable for many uses of the HRS cognitive measure. In particular, it may require reconsideration of some cutscore-based approaches to identify impairment.

As the population of the United States and the world ages, health issues that arise as a function of this increased longevity are becoming more salient. A key such health condition is age-related decline in cognitive functioning (Salthouse, 2009). As a result, there is an increased focus on measuring cognitive functioning in surveys that aim to understand the health and well-being of older populations (and other populations, Biemer et al., 2022). The Health and Retirement Study (HRS, Sonnega et al., 2014) is designed to provide such information. While the HRS is a US-based cohort of people over age 50 and their spouses, it is also the template for a broader range of global studies focused on aging (Sonnega et al., 2014). The HRS studies cognition with its respondents (Crimmins et al., 2011) but there are numerous challenges associated with this measurement task.

One key challenge has to do with conducting comparable measurements across different measurement modalities (e.g., phone vs face-to-face responding). From a psychometric perspective, such variation would be an example of the more general problem of measurement invariance (Millsap, 2011). The quality of measurement may vary as a function of survey modality; understanding this variation and its implications is essential to proper subsequent use of the resulting measures. Mode effects are a generic problem in survey measurement; for example, there was concern in the period prior to the 2016 election that voters may be less likely to express support for specific candidates in some survey formats (Smalley & Wolf, 2022). In the context of the HRS, previous work has identified mode effects associated with web-based responding for other components (e.g., measures of depression and physical activity) of the survey (Cernat et al., 2016). Mode effects exist due to both changes required given the features of a modality (i.e., visual components would not be possible on a phone-based survey) and to differences on the part of a respondent when they encounter otherwise identical items in one mode versus another (see the review by Dillman & Christian, 2005).

Previous work has focused on the specific problems of mode effects in surveys targeting cognition; more detailed reviews regarding how modality may affect cognition can be found elsewhere (Ofstedal et al., 2021). One consideration pertains to the way that different modalities may engender different cognitive processes (i.e., the “extended mind” hypothesis; Clark & Chalmers, 1998); this hypothesis could help explain why identical items presented in different modalities still lead to different responses. Other considerations include possible differences when survey items are presented visually (as they may be in a web-based survey) versus aurally (as they would be in a phone-based format); some research has suggested that these different modalities lead to related but not identical measures of political knowledge (Munzert & Selb, 2017).

The overarching concern here is that the introduction of a web-based survey mode in the HRS may be “easier” than the phone-based mode given that respondents can quickly utilize computing tools (e.g., text editors, calculators) when they engage with the survey questions via the web (there is some possibility that such aids could be used in phone-based administration as well). Such behavior has previously been a concern in, for example, questions of political knowledge (Clifford & Jerit, 2016). Compounding the challenge is that, by the nature of how HRS categorizes respondent eligibility for the cognition battery, many HRS respondents may not be eligible for the web-based survey.

This paper attempts to assess the possibility of mode effects focusing on the novel deployment of the HRS cognitive module (Ofstedal & Fisher, 2005) via both the web and phone in 2018. To do this, we deploy techniques from item response theory (IRT; van der Linden & Hambleton, 2013) and the differential item functioning (DIF) literature. Our use of IRT builds on earlier work focusing on the effect of web-based administration on this survey (Ofstedal et al., 2021; note, this earlier work was done using data from the 2012 to 2014 waves of HRS data collection) and other recent attempts to use latent variable models for this purpose (Gatz et al., 2023). A key benefit of our IRT-based approach is that it allows us to focus on item difficulty, which is our central concern given the nature of the change in modality. In contrast, other approaches such as factor analysis would be useful if multiple facets of cognition were being addressed in the survey and interest was in relative differences in how items loaded onto these facets across different modalities.

Our aims are fourfold. First, we estimate the difference in cognitive functioning between web- and phone-based respondents in 2018 based on a randomized experiment embedded into the HRS design (including attempts to account for the role of noncompliance as not all respondents assigned to the web-based mode completed the survey in that mode). Second, we attempt to use the longitudinal nature of the HRS (and the traditional usage of both phone- and in-person modalities) to provide additional context for these results regarding the web modality. Third, we examine potential variation across item types in the degree to which their functioning may vary in the web-based modality. Fourth, we attempt to probe the stability of rank-ordering of respondents across various waves in an attempt to further understand the implications of mode-based differences in survey responding, especially for widely used cutpoints (Langa et al., 2022). Collectively, these results are important in guiding future research that utilizes web-based cognitive measures and also suggest possible avenues for improving the web-based measurement of cognitive functioning.

Method

Data

The Health and Retirement Study is a biannual survey of U.S. respondents over age 50 and their spouses. It is designed to provide information about the economic well-being and the health of this population; more information about its design is available elsewhere (Sonnega et al., 2014). We focus here on the cognitive measures collected in this survey. The HRS measure of cognitive functioning (Ofstedal & Fisher, 2005) is based on the Telephone Interview for Cognitive Status survey for assessing cognitive status via the telephone (Bugliari et al., 2016). It includes tasks that tap respondents’ memory, mental status, and vocabulary. Although we utilize a latent variable approach, we consider many of the same data elements that are used to construct the (widely used) “cogtot” measure constructed by RAND (Bugliari et al., 2016).

Prior to 2018, the cognitive measures were collected using phone and face-to-face modalities. Earlier work has examined whether mode effects may challenge measurement when the survey is administered via both phone and face-to-face modalities (Herzog & Rodgers, 1998); results tended to support the exchangeability of scores from those two measures. Responses alternate from one mode to the other across consecutive waves of HRS data collection. We leverage the fact that different modalities are used by the HRS in the design of our study. In particular, we randomize eligible respondents to modality. More recent research (Ofstedal et al., 2021) has begun to challenge the view that previously used modalities are interchangeable; the situation is now more complex with the introduction of the web-based mode. We focus here on an experiment embedded into the HRS’s 2018 administration that allows for reasonably straightforward analysis of mode effects; we now turn to details of this experiment.

Experimental Design

The HRS embedded an experiment into the 2018 wave of data collection to study the potential mode effect of introducing the cognitive interview via the web. Respondents that were assigned to the face-to-face mode in 2016 were randomly assigned to either the phone or web-based modes for the 2018 survey. There were a variety of eligibility requirements (see Author Note 1). In particular, eligibility was restricted to households empanelled prior to 2016 in which all respondents identified in their most recent prior interview as internet users. Households in which either respondent last completed an interview in Spanish, by proxy, or resided in a nursing home were excluded, as were households with a pending baseline, exit, or post-exit interview. In total, the experiment included 3,632 respondents with 2,258 selected randomly assigned to the web and 1,374 phone-based controls. After further restrictions, we focus analysis on a subset of 2,740 respondents (1,052 of whom were assigned to phone and 1,688 to web; see Author Note 2).

One crucial challenge is that respondents assigned to web-based cognitive surveying could instead opt for phone-based surveying; we denote those who were assigned to web-based responding and completed it in that modality as compliers and those assigned to the web but who used the phone-based modality as noncompliers (n = 371; see Author Note 3). If those respondents are, for example, older or have lower levels of cognitive functioning, then this might lead to bias in our estimates of the mode effect. We thus attempt to model such noncompliance in our analysis of the mode effect. These models are based on a set of covariates pertaining to respondent age and sex, cognitive functioning at the prior wave (see Author Note 4), self-reported health and number of chronic conditions, and whether the respondent was partnered. Descriptive statistics for these variables are shown in Table 1. Compliers and noncompliers are of similar age, but noncompliers are more likely to be female, have a lower level of educational attainment and lower cognitive functioning in 2016, have more health problems, and be unpartnered.

Table 1.

Descriptive Statistics Related to Mode Assignment and Compliance/Noncompliance

Standardized means
Assigned web
nMeanAssigned phoneCompliersNoncomplierspa
Age2,74067.960.01−0.010.017.05e−01
Female2,7400.590.01−0.030.078.41e−02
Highest degree2,7403.100.020.01−0.108.37e−02
Cognition2,664−0.01−0.020.06−0.181.03e−04
Self-reported health (2018)b2,7402.630.00−0.090.301.02e−09
# Chronic conditions (2018)2,7402.260.01−0.070.223.00e−06
Partnered (2018)2,7400.64-0.040.13−0.375.42e−16
Standardized means
Assigned web
nMeanAssigned phoneCompliersNoncomplierspa
Age2,74067.960.01−0.010.017.05e−01
Female2,7400.590.01−0.030.078.41e−02
Highest degree2,7403.100.020.01−0.108.37e−02
Cognition2,664−0.01−0.020.06−0.181.03e−04
Self-reported health (2018)b2,7402.630.00−0.090.301.02e−09
# Chronic conditions (2018)2,7402.260.01−0.070.223.00e−06
Partnered (2018)2,7400.64-0.040.13−0.375.42e−16

ap-Value for test of different in means between compliers and noncompliers.

b1 = excellent; 5 = poor.

Table 1.

Descriptive Statistics Related to Mode Assignment and Compliance/Noncompliance

Standardized means
Assigned web
nMeanAssigned phoneCompliersNoncomplierspa
Age2,74067.960.01−0.010.017.05e−01
Female2,7400.590.01−0.030.078.41e−02
Highest degree2,7403.100.020.01−0.108.37e−02
Cognition2,664−0.01−0.020.06−0.181.03e−04
Self-reported health (2018)b2,7402.630.00−0.090.301.02e−09
# Chronic conditions (2018)2,7402.260.01−0.070.223.00e−06
Partnered (2018)2,7400.64-0.040.13−0.375.42e−16
Standardized means
Assigned web
nMeanAssigned phoneCompliersNoncomplierspa
Age2,74067.960.01−0.010.017.05e−01
Female2,7400.590.01−0.030.078.41e−02
Highest degree2,7403.100.020.01−0.108.37e−02
Cognition2,664−0.01−0.020.06−0.181.03e−04
Self-reported health (2018)b2,7402.630.00−0.090.301.02e−09
# Chronic conditions (2018)2,7402.260.01−0.070.223.00e−06
Partnered (2018)2,7400.64-0.040.13−0.375.42e−16

ap-Value for test of different in means between compliers and noncompliers.

b1 = excellent; 5 = poor.

Statistical Analysis

Prediction of noncompliance

Roughly 22% of respondents assigned to the web selected out of web-based responding in favor of taking the survey via phone. We attempt to model this noncompliance using logistic regression; that is, we model

(1)

where σ is the logistic sigmoid σ(x)=11+ex and zi is a vector of predictors. We consider three sets of predictors. The first set is based on respondent demographics. Respondent sex and race are included as well as respondent age; given that age effects may be nonlinear, we first map respondent age onto a b-spline basis (Hastie et al., 2009). The second set is respondent cognitive functioning in 2016 as computed via the IRT approach described later. The third set is a saturated model including the first two sets as well as the additional variables shown in Table 1. Probabilities generated via Equation 1 will be used to adjust subsequent analysis to account for noncompliance as we describe later. We quantify our ability to predict noncompliance using the Area Under the Curve (AUC; Janssens & Martens, 2020) and the Inter-Model Vigorish (IMV; Domingue et al., 2021).

IRT models

We use IRT models (van der Linden & Hambleton, 2013) for the purposes of modeling responses. These models suppose that the cognitive functioning of respondent i is captured by θi and that item functioning is captured by some set of parameters. For simplicity, we utilize the Rasch model such that item functioning is completely characterized by difficulty parameters. Given that the cognition data take the form of both dichotomously and polytomously scored items, we use the partial credit model formulation (Muraki, 1992); specifically see the R function “gpcmIRT” (Chalmers, 2012). For item j, the difficulty of the kth response option is βjk. If item j is scored as 0, 1, ..., K − 1 (i.e., it has scores in K categories), we can write the probability of person i scoring in the kth category as

(2)

where

(3)

and Λj=kωjk. If k = 2, then Equation 2 simplifies to the standard one-parameter logistic model for dichotomous item responses:

For those unfamiliar with IRT models, note that this model has a form similar to that of common logistic regression models; the challenge in this setting is that neither θi or βj is directly observed.

Here we have two groups: phone- and web-based survey modality. Multiple-group models (Bock & Zimowski, 1997) allow us to assume separate priors on ability for respondents of different groups. We will assume that θphoneN(μphone, σ phone2) and θwebN(μweb, σ web2). For purposes of identification, we’ll assume that μphone=0 and σphone2=1. We will then estimate, using the R function multiple Group (Chalmers, 2012), μweb and σweb2, which will be the basis for our identification of the overall mode effect. Code to replicate our analysis is available (see Author Note 5).

Adjustment of noncompliance

Conventional analysis of experiments would use the propensity weights in, for example, inverse probability weighting schemes to correct for noncompliance (Cole & Hernan, 2008). The IRT-based approach we utilize here for identification of the treatment effect does not readily allow for such weighting, so we consider two alternative approaches using the propensity weights. We first exclude phone-based respondents whose propensity score (calculated via Equation 1) is above the τth quantile of the distribution of those who selected out of web-based responding. Such an approach is primarily intended to be suggestive because they do not clearly lead to groups that can be compared (e.g., it is not simply the web-based respondents who were above a certain quantile on the propensity score that selected for phone-based responding). We next simulated 10 data sets where inclusion for respondents assigned to phone was randomly decided via the propensity weights (e.g., for a weight of w, a respondent was in the sample with probability (1 − w)); we then separately analyze these data and take the mean of μweb and σweb2 across these 10 simulated data sets.

Additional analyses

To assess mode effects across bundles of items (the delayed recall items, the immediate recall items, the serial 7s, and the numeracy items), we conduct basic DIF (Camilli, 2006) analyses, regressing (using ordinary least squares) the standardized bundle score on the overall mean score plus a group indicator.

To analyze rank-order stability of cognitive scores computed at different time points, we computed Spearman correlations.

To analyze implications for certain cutpoints on the cognitive scale, we computed empirical cumulative density functions (ECDFs). For the web-based respondents, this required imputed scores for backward counting items which we obtained using previously established techniques (Mccammon et al., 2023). Note that this imputation analysis relies on two assumptions. First, we use imputed scores for web-based respondents. This assumption relies upon the item-specific mode effects (as per Section “Item-Specific Mode Effects”) for the unobserved items being consistent with those observed here. Second, we ignore attrition here (i.e., we remove attriters from the analysis and do not attempt to adjust for this fact). The mean of the imputed web scores for backward count is 1.94, with 96.7% of cases scoring a two (i.e., the highest score). For control group cases, the mean is 1.92, with 95.9% scoring a two. This suggests that nearly everyone gets a perfect score in both modalities and thus limits the potential for imputation as a source of bias.

We also imputed small numbers of cases where data were missing for other measures so as to be able to construct sum scores using a previously discussed approach (McCammon et al., 2023). The number of imputed cases is shown in Supplementary Table 3. These imputations effectively embed the mode effect in the imputed data. Even if the mode effect differs among the missing respondents, we argue it would lead to relatively little bias here. Consider the following simple example: suppose that there is a mode effect of ϵ in the observed web-based respondents on one class of items whereas the effect is 2ϵ among the unobserved. Standard analysis of imputed data would lead to an estimate of ϵ whereas in the complete data the true pool effect is (1,258ϵ + 59 * 2ϵ)/1,317 = 1.04ϵ if we use the sizes from the delayed recall measures which required the most extensive imputations. This resulting bias (1.04ϵ − ϵ = 0.04ϵ) is fairly minimal; we view it as unlikely to affect our key findings.

Results

Understanding Noncompliance

We first focus on the problem of noncompliance (i.e., the fact that people assigned to the web-based modality could opt to take the phone survey). As we may anticipate given the results in Table 1, demographics—that is, age (splined), gender, race (as a factor), and education (highest degree)—are clear predictors of noncompliance (AUC = 0.64). Their predictive strength can be compared to various benchmarks. For example, the AUC is far less than AUCs of above 0.8 for predicting mortality in these data using similar demographic information (Domingue et al., 2017).

We also consider cognition in the 2016 wave as a predictor. As a stand-alone predictor, cognition in the prior wave is a relatively weak predictor of noncompliance (AUC = 0.57). The fact that cognition is a weak predictor is reassuring given that it suggests noncompliance might be largely ignorable for the purposes of estimating the treatment effect. We then consider models that include all variables from Table 1. Crucially, cognition in 2016 does not provide additional information net of these other predictors and also leads to some missingness (the IMV calculated via out-of-sample data suggests that inclusion of the cognition variable results in only very minor changes to prediction quality). Thus, we use predictions from the full model minus cognition (final row in Table 2) in subsequent analyses that utilize propensity weights.

Table 2.

Predicting Noncompliance Among Those Assigned to Web-Based Responding.

PredictorsNn NoncompliantAUCIMV
Demo+Edu1,6883710.6420.016
Cog (2016)1,6243580.5720.005
Fulla1,6243580.7030.035
Full (no Cog)1,6883710.7050.032
PredictorsNn NoncompliantAUCIMV
Demo+Edu1,6883710.6420.016
Cog (2016)1,6243580.5720.005
Fulla1,6243580.7030.035
Full (no Cog)1,6883710.7050.032

Note: AUC = area under the curve; IMV = Inter-Model Vigorish.

aAll variables in Table 1.

Table 2.

Predicting Noncompliance Among Those Assigned to Web-Based Responding.

PredictorsNn NoncompliantAUCIMV
Demo+Edu1,6883710.6420.016
Cog (2016)1,6243580.5720.005
Fulla1,6243580.7030.035
Full (no Cog)1,6883710.7050.032
PredictorsNn NoncompliantAUCIMV
Demo+Edu1,6883710.6420.016
Cog (2016)1,6243580.5720.005
Fulla1,6243580.7030.035
Full (no Cog)1,6883710.7050.032

Note: AUC = area under the curve; IMV = Inter-Model Vigorish.

aAll variables in Table 1.

Overall Mode Effects

Using multiple-group IRT estimation, we can assess overall mode effects. We first do so via unadjusted comparison of respondents who completed the cognitive survey according to their assigned modality (i.e., ignoring noncompliance); see the top row of Table 3. This analysis suggests a relatively large mode effect of nearly one third of a standard deviation of respondent ability on the phone-based survey. This contrasts with earlier work on mode effects related to phone/in-person differences that found negligible results (Ofstedal & Fisher, 2005). However, this estimate may be biased given that it does not adjust for noncompliance.

Table 3.

Estimated Group Differences in Cognition.

n Phonen Webμwebσweb2
Unadjusted1,0521,3170.320.62
τ0.991,0491,3170.320.62
τ0.951,0241,3170.300.62
τ0.91,0001,3170.280.62
τ0.89421,3170.260.62
Random809.801,317.000.290.60
n Phonen Webμwebσweb2
Unadjusted1,0521,3170.320.62
τ0.991,0491,3170.320.62
τ0.951,0241,3170.300.62
τ0.91,0001,3170.280.62
τ0.89421,3170.260.62
Random809.801,317.000.290.60
Table 3.

Estimated Group Differences in Cognition.

n Phonen Webμwebσweb2
Unadjusted1,0521,3170.320.62
τ0.991,0491,3170.320.62
τ0.951,0241,3170.300.62
τ0.91,0001,3170.280.62
τ0.89421,3170.260.62
Random809.801,317.000.290.60
n Phonen Webμwebσweb2
Unadjusted1,0521,3170.320.62
τ0.991,0491,3170.320.62
τ0.951,0241,3170.300.62
τ0.91,0001,3170.280.62
τ0.89421,3170.260.62
Random809.801,317.000.290.60

We consider two ways of adjusting for noncompliance (see Section “Adjustment of noncompliance”). We first restrict ourselves to data with respondents who are relatively unlikely to be noncompliant (based on estimates from Equation 1). We then use estimates from Equation 1 to generate multiple data sets in an attempt to simulate the process of compliance. Both approaches—which lead to estimates of approximately 0.29–0.32—suggest that noncompliance (as modeled by the given predictors) leads to only a slight upward bias in the observed mode effect. Given the relatively mild impact of noncompliance, we do not adjust for it in the analyses below; while this might lead to some bias in the resulting estimates, it also allows for more straightforward analysis. We omit those assigned to web but who responded via phone in subsequent analyses.

Longitudinal Analysis

We can use the biennial nature of the HRS to get a longitudinal view of the issue as well. Figure 1A employs the multigroup IRT estimation strategy to estimate abilities of 2,161 individuals assessed between 2012 and 2018. Focusing first on the gray line, two facts are important. When we consider same-mode measures, respondents decline over time (scores in 2012 are higher than in 2016 and 2014 scores are higher than 2018). This is reassuring given that such a finding would be expected given the general decline in cognition as respondents age (Salthouse, 2009). Second, there seems to be a mode effect such that phone-based scores are higher than face-to-face scores. Turning to the red dot that shows the estimated cognitive functioning of web-based respondents, we can see that web-based respondents appear to have much higher levels of cognitive functioning. This is, of course, not plausible given the experimental design. Further, the size of the effect is much larger than the relative advantage that phone-based respondents have relative to in-person respondents in the previously used modalities.

Cognitive trajectories over the 2012–2020 waves of Health and Retirement Study data collection.
Figure 1.

Cognitive trajectories over the 2012–2020 waves of Health and Retirement Study data collection.

We can also leverage the fact that respondents from this experiment largely took the phone-based cognitive survey in the 2020 wave of data collection. Figure 1B shows that, of the respondents in the experiment who also were assessed in 2020, we see a substantial decrease in functioning for the web-based group in 2018 when they are assessed via phone in 2020. In contrast, the phone-based group performs similarly at both waves.

Item-Specific Mode Effects

We conducted item-specific analysis to test whether mode effects are homogeneous or vary in severity across items. Given the nature of the HRS cognitive interview, we considered four bundles of items: the delayed recall items, the immediate recall items, the Serial 7s, and the numeracy items. Results are shown in Supplementary Table 1. We first note the standardized difference in the standardized sum score across the bundle. Note that web respondents score higher across all bundles; if all bundles show some degree of bias toward web-based respondents, this makes it challenging to identify differential item functioning across any given bundle in an unbiased manner (Stenhaug et al., 2021).

That said, we can still assess differences in relative magnitudes using DIF techniques. In the DIF columns, we focus on the coefficient associated with web-based responding. Note that the recall items (both delayed and immediate) show negative coefficients. This is due to the problem mentioned earlier (Stenhaug et al., 2021) and should be interpreted as suggesting that bias in favor of web-based responding is less severe on these items as compared with the Serial 7s and numeracy items. In the multigroup columns, we re-estimated group mean differences after dropping each bundle. These analyses confirm that the bias is most severe on the Serial 7 and the numeracy items with the numeracy items showing the most severe bias across all analyses.

Stability of Respondent Rank Ordering

The above results suggest that the web-based version of the cognitive assessment was easier than the phone-based version. This has implications for the use of these data, which is a point we return to in the Discussion. A different question is whether the test was differentially easier for some respondents as compared with others such that it changed the relative ranking of respondents. If the web-based version is just uniformly easier, then it might still be possible to establish cutpoints (e.g., Langa et al., 2022).

To analyze this issue, we consider the rank-order correlations (i.e., Spearman correlations) of 2016 cognitive scores (specifically the 27 point Langa–Weir score) with 2018 estimates (via the IRT approach emphasized here) separately across response modality. Results are shown in Supplementary Figure 1. The overall stability is comparable across modes with the estimates for phone-based respondents having a Spearman correlation of 0.51 with the 2016 cognition scores whereas the web-based scores are correlated with 2016 scores at 0.45. We also stratified rank-order correlations by age; see Supplementary Table 2. The results are generally consistent with the findings from Supplementary Figure 1 wherein phone-based correlations are somewhat higher than web-based estimates.

Implications for Langa–Weir Cutpoints

We finally consider the implications for the widely used Langa–Weir cutpoints (Langa et al., 2022). These cutpoints are based on the 27-point sum score scale defined by the immediate and delayed recall, backward counting, and serial 7 items (but not the numeracy items). We first discuss the left panel of Figure 2. This panel is based on aggregated θ scores computed via multigroup IRT. For the phone-based respondents, we use their observed score on the 27-point scale underlying the Langa–Weir classifications. The x-axis shows the average θ score for each of the different points. We use imputed scores for missing data. We document differences in missingness rates across the relevant measures as a function of modality (see Supplementary Table 3); as we argue in Section “Additional analyses,” we think that these imputations are unlikely to introduce bias. These results suggest a consistent increase in web-based θ values relative to phone-based θ values for a similar sum score.

Analysis of scores related to Langa–Weir cutpoints (Langa et al., 2022). Left: Scatterplot of mean IRT scale scores for phone (actual) and web (imputed) respondents with dots scaled to represent the amount of data for each score. Middle: Empirical cumulative density functions (ECDFs) for web- and phone-based sum scores. Right: Cutpoint analysis based on ECDFs.
Figure 2.

Analysis of scores related to Langa–Weir cutpoints (Langa et al., 2022). Left: Scatterplot of mean IRT scale scores for phone (actual) and web (imputed) respondents with dots scaled to represent the amount of data for each score. Middle: Empirical cumulative density functions (ECDFs) for web- and phone-based sum scores. Right: Cutpoint analysis based on ECDFs.

In the middle panel of Figure 2, we compute ECDFs for the sum scores used in the left panel. Note that the web-based scores are uniformly right-shifted relative to the phone-based scores which is consistent with the notion of a relative constant mode effect. Of particular interest are the points on the x-axis shown in blue which pertain to a key cutpoint in the Langa–Weir framework. In the right panel, we focus on the implications for the threshold separating those respondents who are cognitively impaired but not demented (CIND) from those exhibiting normal functioning. On the 27-point scale, it is a value of 11 when assessed using interviewer modalities. Here, 7.2% of the phone-based respondents are at or below that cutscore (note the segments in gray). Given the random equivalence of the web- and phone-based groups, we then identify the sum score for the web-based respondents for which similar proportions of respondents are less than or equal to that score. Of the web-based respondents, 5.1% get less than or equal to an 11, and 7.9% get less than or equal to a 12. Thus, we argue that a cutscore of 12 would be superior to a cutscore of 11 as the maximal score for a person to be classified as CIND based on the web mode.

Discussion

Mode effects can be a serious threat to the inferences made based on psychological measures. Given the ubiquity of devices and the relative challenge of getting people to respond to surveys via the phone, a transition to web-based surveying is of clear interest to surveys like the HRS. However, that transition also leads to challenges associated with the possibility of mode effects. Using an experiment embedded in the 2018 survey, we find that web-based respondents do much better than expected on the survey; these results align with others (Ofstedal et al., 2022) suggesting some degree of difference in web-based responding on cognitive surveys. While we do not test hypotheses for why this might be, one clear possibility is that respondents are making use of the affordances offered by a computer (e.g., computational power) in a way that they do not when responding via phone. Alternatively, differences between the visual presentation of survey questions in web surveys and the aural communication required in telephone surveys could be affecting the cognitive strategies respondents are deploying to generate responses. Web-based respondents can review questions and the corresponding response categories, while telephone respondents typically hear the information read to them once (Dillman et al., 1995). The cognitive burden required to process the survey question, formulate an answer, and identify the most appropriate answer category may then be greater for telephone respondents. Further, while we focus on comparisons to phone-based assessment, recent reports (Smith et al., 2023; see also Figure 1) suggest that phone-based assessment may itself have a mode effect such that it is easier than in-person assessment. This may further complicate comparisons of web and face-to-face scores.

One potential threat to the study design is that respondents assigned to web-based responding could still opt into phone-based responding. While there was some clear patterning as to what kinds of respondents made that switch (e.g., they were less well-educated and in relatively poorer health), adjustments for this type of noncompliance suggested that it led to minimal bias in our estimates of the mode effect. We also took advantage of the longitudinal structure of the HRS data to put the web effect in context. Web-based respondents show gains on the 2018 survey and the web effect is larger than the relatively modest advantage that accrues to phone-based responding vis à vis in-person responding. Further, there is an appreciable decline in 2020 when the survey is administered via phone for those who took the survey in 2018 via the web.

We acknowledge limitations. A primary one pertains to the issue of generalizability, given that respondents had to meet specific criteria to be included in the experiment. Respondents who are not as familiar (due to, for example, age) with digital devices may not benefit in the same way from web-based responding. Further, our results may not be informative about respondents suffering from serious cognitive impairment as they also were not eligible. In addition, other approaches are available for the analysis of (and adjustment for) mode effects (e.g., Kolenikov & Kennedy, 2014); usage of such approaches may provide additional information about the role of web-based responding in how the HRS measures cognition. Our results related to the Langa–Weir cutpoints also utilize imputed values; while we have argued that the imputations are unlikely to induce substantial bias, this is an important caveat. One restriction imposed by our use of imputed values is that the thresholds identified here are of limited utility outside the context of the HRS studies given their reliance on this imputation. Finally, our results only consider responses; future work could perhaps incorporate response time to further interrogate differences in response behavior as a function of survey modality.

Implications and Recommendations

Our findings have implications for the HRS. For clarity, we list them below:

  • Direct comparisons of scores derived from the web to those from other modalities will likely be misleading due to the existence of mode effects. In particular, web-based scores can be expected to be slightly higher than phone-based scores.

  • Adjustments to make scores directly comparable are perhaps possible (Kolenikov & Kennedy, 2014) but will need to be used with great care and careful attention to the assumptions underlying the relevant models.

  • In studies that focus on classifications of cognitive function, classifications of CIND from the web-based measure may want to use a threshold of 12 and below (in place of the “11 and below” threshold that has been used previously, Langa et al., 2022).

Changes in technology and peoples’ expectations about it will require changes to conventional survey methods. As more of the HRS transitions to web-based assessment, there will be a keen need for attention to the resultant changes in what the HRS is measuring about respondents. Surveys like the HRS will need to use techniques along the lines of those deployed here to help calibrate the impacts of such changes on measures of key quantities.

Author Notes

  1. Official documentation on the matter can be found at: https://hrsdata.isr.umich.edu/sites/default/files/documentation/data-descriptions/1652380440/h18dd.pdf

  2. We excluded 143 respondents who responded via one of our nonfocal modalities (e.g., face-to-face or via the web using a small screen device) and 749 nonrespondents.

  3. Note that the HRS has continued to allow respondents assigned to the web to complete surveys via phone.

  4. Cognitive functioning at the prior wave was calculated using the multigroup IRT approach introduced in Section “IRT models”.

  5. https://github.com/ben-domingue/hrsweb

Funding

Funded by the Jacobs Foundation.

Conflict of Interest

None declared.

References

Biemer
,
P. P.
,
Harris
,
K. M.
,
Burke
,
B. J.
,
Liao
,
D.
, &
Halpern
,
C. T.
(
2022
).
Transitioning a panel survey from in-person to predominantly web data collection: Results and lessons learned
.
Journal of the Royal Statistical Society Series A: Statistics in Society
,
185
(
3
),
798
821
. doi:10.1111/rssa.12750

Bock
,
R. D.
, &
Zimowski
,
M. F.
(
1997
).
Multiple group IRT
. In
W.
 
van der Linden
(Ed.),
Handbook of modern item response theory
(pp.
433
448
).
Springer
.

Bugliari
,
D.
,
Campbell
,
N.
,
Chan
,
C.
,
Hayden
,
O.
,
Hurd
,
M.
,
Main
,
R.
,
Mallett
,
J.
,
McCullough
,
C.
,
Meijer
,
E.
,
Moldoff
,
M.
,
Pantoja
,
P.
,
Rohwedder
,
S.
, &
St.Clair
,
P.
(
2016
).
RAND HRS data documentation, version P
.
RAND Center for the Study of Aging
. https://hrsonline.isr.umich.edu/modules/meta/rand/randhrsp/randhrs_P.pdf

Camilli
,
G.
(
2006
).
Test fairness
. In
R.
 
Brennan
(Ed.),
Educational measurement
(
Vol. 4
, pp.
221
256
).
Rowman & Littlefield Publishers
.

Cernat
,
A.
,
Couper
,
M. P.
, &
Ofstedal
,
M. B.
(
2016
).
Estimation of mode effects in the Health and Retirement Study using measurement models
.
Journal of Survey Statistics and Methodology
,
4
(
4
),
501
524
. doi:10.1093/jssam/smw021

Chalmers
,
R. P.
(
2012
). mirt:
A Multidimensional Item Response Theory package for the R environment
.
Journal of Statistical Software
,
48
(
6
):
1
29
. doi:10.18637/jss.v048.i06

Clark
,
A.
, &
Chalmers
,
D.
(
1998
).
The extended mind
.
Analysis
,
58
(
1
),
7
19
. doi:10.1093/analys/58.1.7

Clifford
,
S.
, &
Jerit
,
J.
(
2016
).
Cheating on political knowledge questions in online surveys: An assessment of the problem and solutions
.
Public Opinion Quarterly
,
80
(
4
),
858
887
. doi:10.1093/poq/nfw030

Cole
,
S. R.
, &
Hernan
,
M. A.
(
2008
).
Constructing inverse probability weights for marginal structural models
.
American Journal of Epidemiology
,
168
(
6
),
656
64
. doi:10.1093/aje/kwn164

Crimmins
,
E. M.
,
Kim
,
J. K.
,
Langa
,
K. M.
, &
Weir
,
D. R.
(
2011
).
Assessment of cognition using surveys and neuropsychological assessment: the Health and Retirement Study and the Aging, Demographics, and Memory Study
.
The Journals of Gerontology, Series B: Psychological Sciences and Social Sciences
,
66B
(
Suppl_1
),
i162
i171
. doi:10.1093/geronb/gbr048

Dillman
,
D. A.
,
Brown
,
T. L.
,
Carlson
,
J. E.
,
Carpenter
,
E. H.
,
Lorenz
,
F. O.
,
Mason
,
R.
,
Saltiel
,
J.
, &
Songster
,
R. L.
(
1995
).
Effects of category order on answers in mail and telephone surveys
.
Rural Sociology
,
60
(
4
),
674
687
. doi:10.1111/j.1549-0831.1995.tb00600.x

Dillman
,
D. A.
, &
Christian
,
L. M.
(
2005
).
Survey mode as a source of instability in responses across surveys
.
Field Methods
,
17
(
1
),
30
52
. doi:10.1177/1525822x04269550

Domingue
,
B.
,
Rahal
,
C.
,
Faul
,
J.
,
Freese
,
J.
,
Kanopka
,
K.
,
Rigos
,
A.
,
Stenhaug
,
B.
, &
Tripathi
,
A.
(
2021
).
The InterModel Vigorish (IMV) as a flexible and portable approach for quantifying predictive accuracy with binary outcomes [Preprint]
.
SocArXiv
,
1
23
. doi:10.31235/osf.io/gu3ap

Domingue
,
B. W.
,
Belsky
,
D. W.
,
Harrati
,
A.
,
Conley
,
D.
,
Weir
,
D. R.
, &
Boardman
,
J. D.
(
2017
).
Mortality selection in a genetic sample and implications for association studies
.
International Journal of Epidemiology
. doi:10.1093/ije/dyx041

Gatz
,
M.
,
Schneider
,
S.
,
Meijer
,
E.
,
Darling
,
J. E.
,
Orriens
,
B.
,
Liu
,
Y.
, &
Kapteyn
,
A.
(
2023
).
Identifying cognitive impairment among older participants in a nationally representative internet panel
.
The Journals of Gerontology: Series B
,
78
(
2
),
201
209
. doi:10.1093/geronb/gbac172

Hastie
,
T.
,
Tibshirani
,
R.
, &
Friedman
,
J. H.
(
2009
).
The elements of statistical learning: Data mining, inference, and prediction
(2nd ed.).
Springer
.

Herzog
,
A. R.
, &
Rodgers
,
W.L.
(
1998
).
Cognitive performance measures in survey research on older adults
. In
N.
 
Schwarz
,
D.
 
Park
,
B.
 
Knauper
, &
S.
 
Sudman
(Eds.),
Cognition, aging and self-reports
.
Psychology Press
. doi:10.4324/9780203345115

Janssens
,
A. C. J. W.
, &
Martens
,
F. K.
(
2020
).
Reflection on modern methods: Revisiting the area under the ROC curve
.
International Journal of Epidemiology
,
49
(
4
),
1397
1403
. doi:10.1093/ije/dyz274

Kolenikov
,
S.
, &
Kennedy
,
C.
(
2014
).
Evaluating three approaches to statistically adjust for mode effects
.
Journal of Survey Statistics and Methodology
,
2
(
2
),
126
158
. doi:10.1093/jssam/smu004

Langa
,
K.
,
Weir
,
D.
,
Kabeto
,
M.
, &
Sonnega
,
A.
(
2022
).
Langa–Weir classification of cognitive function (1995–2018).
 
Survey Research Center
. https://hrsdata.isr.umich.edu/sites/default/files/documentation/data-descriptions/1680034270/Data_Description_Langa_Weir_Classifications2020.pdf

McCammon
,
R.
,
Fisher
,
G.
,
Hassan
,
H.
,
Faul
,
J. D.
,
Rodgers
,
W. L.
, &
Weir
,
D. R.
(
2023
).
Cross-wave imputation of cognitive ­functioning measures 1992–2020
.
Survey Research Center
. https://hrsdata.isr.umich.edu/sites/default/files/documentation/data-descriptions/1676481563/COGIMP9220_dd.pdf

Millsap
,
R. E.
(
2011
).
Statistical approaches to measurement invariance
.
Routledge
.

Munzert
,
S.
, &
Selb
,
P.
(
2017
).
Measuring political knowledge in web-based surveys: An experimental validation of visual versus verbal instruments
.
Social Science Computer Review
,
35
(
2
),
167
183
. doi:10.1177/0894439315616325

Muraki
,
E.
(
1992
).
A generalized partial credit model: Application of an EM algorithm
.
ETS Research Report Series
,
1992
(
1
),
i
30
. doi:10.1002/j.2333-8504.1992.tb01436.x

Ofstedal
,
M. B.
, &
Fisher
,
G.
(
2005
).
Documentation of cognitive functioning measures in the Health and Retirement Study
.
Institute for Social Research, University of Michigan
. doi:10.7826/ISR-UM.06.585031.001.05.0010.2005

Ofstedal
,
M. B.
,
Kézdi
,
G.
, &
Couper
,
M. P.
(
2022
).
Data quality and response distributions in a mixed-mode survey
.
Longitudinal and Life Course Studies
,
13
,
1
26
. doi:10.1332/175795921X16494126913909

Ofstedal
,
M. B.
,
McClain
,
C. A.
, &
Couper
,
M. P.
(
2021
).
Measuring cognition in a multi‐mode context
.
Advances in Longitudinal Survey Methodology
,
250
271
. doi:10.1002/9781119376965.ch11

Salthouse
,
T. A.
(
2009
).
When does age-related cognitive decline begin
?
Neurobiology of Aging
,
30
(
4
),
507
14
. doi:10.1016/j.neurobiolaging.2008.09.023

Smalley
,
H. K.
, &
Wolf
,
C.
(
2022
).
Building a framework for mode effect estimation in United States presidential election polls
.
Statistics, Politics and Policy
,
13
(
1
),
41
56
. doi:10.1515/spp-2021-0024

Smith
,
J. R.
,
Gibbons
,
L. E.
,
Crane
,
P. K.
,
Mungas
,
D. M.
,
Glymour
,
M. M.
,
Manly
,
J. J.
,
Zahodne
,
L. B.
,
Rose Mayeda
,
E.
,
Jones
,
R. N.
, &
Gross
,
A. L.
(
2023
).
Shifting of cognitive assessments between face-to-face and telephone administration: measurement considerations
.
The Journals of Gerontology, Series B: Psychological Sciences and Social Sciences
,
78
(
2
),
191
200
. doi:10.1093/geronb/gbac135

Sonnega
,
A.
,
Faul
,
J. D.
,
Ofstedal
,
M. B.
,
Langa
,
K. M.
,
Phillips
,
J. W.
, &
Weir
,
D. R.
(
2014
).
Cohort profile: The Health and Retirement Study (HRS)
.
International Journal of Epidemiology
,
43
(
2
),
576
85
. doi:10.1093/ije/dyu067

Stenhaug
,
B.
,
Frank
,
M. C.
, &
Domingue
,
B.
(
2021
).
Treading carefully: Agnostic identification as the first step of detecting differential item functioning.
doi:10.31234/osf.io/974vw.

van der Linden
,
W. J.
, &
Hambleton
,
R. K.
(
2013
).
Handbook of modern item response theory
.
Springer Science & Business Media
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)
Decision Editor: Alyssa Gamaldo, PhD (Psychological Sciences Section)
Alyssa Gamaldo, PhD (Psychological Sciences Section)
Decision Editor
Search for other works by this author on: