Timed “Up & Go” Dual-Task Tests: Age- and Sex-Specific Reference Values and Test–Retest Reliability in Cognitively Healthy Controls

Objective. The purpose of the study was to establish reference values for the Uppsala-Dalarna Dementia and Gait (UDDGait) Timed “Up & Go” dual-task (TUGdt) test variables in cognitively healthy adults and to assess these variables’ test–retest reliability. Methods. For reference values, 166 participants were recruited with approximately equal numbers and proportions of women and men in the age groups 50 to 59, 60 to 69, 70 to 79, and 80+ years (mean age = 70 years, age range = 50–91 years, 51% women). For reliability testing, 43 individuals (mean age = 69 years, age range = 50–89 years, 51% women) were recruited. Two dt tests were carried out: TUGdt naming animals and TUGdt months backward, representing 8 test variables: time scores, costs (the relative difference between single-task and dt time scores), “number of animals,” “number of months,” “animals/10 seconds,” and “months/10 seconds.” Reference ranges for the variables were established by quantile regression in ageand sex-specific groups. For reliability, intraclass correlation coefficients (ICCs), standard error of measurement, minimal detectable change, and Bland–Altman plots were used. Results. Reference values for the TUGdt test variables are presented for the 2.5th and 97.5th percentiles. The reliability of TUGdt time scores was excellent (ICCs between 0.85 and 0.86). “Number of animals” and “animals/10 seconds” as well as “months/10 seconds” showed fair to good levels of reliability (ICCs between 0.45 and 0.58), whereas the reliability for both cost measures and “number of months” was poor (ICCs between 0.34 and 0.39). Conclusion. Normative reference values, potentially useful for clinical and research purposes, were presented in 4 ageand sex-specific groups from 50 years and older. Reliability for the test variables varied between poor and excellent, the lower estimates partly explained by some variables being the ratio of 2 other variables. In UDDGait, TUGdt tests are intended for diagnostic and predictive purposes, for which these tests are promising and require further investigations. Impact. Normative reference values and test–retest reliability results for the UDDGait TUGdt test variables were presented. These results should be useful for both clinical and research purposes.


Introduction
Dementia is an important and growing global public health concern. 1 Currently available methods for identifying dementia are costly and time-consuming, and new tools have been called for. 2 In an increasing number of studies, dual-task (dt) tests that involve simultaneous performance of a mobility task and an attention-demanding verbal task have been investigated as potential tools for diagnosing or predicting dementia. [3][4][5] A reduced dt ability in old age may be the result of normal age-related changes 6,7 and/or early pathological alterations in cognition, 3,8 in both cases possibly related to declining attentional resources or an executive dysfunction. 9,10 The main aim of our ongoing Uppsala-Dalarna Dementia and Gait study (UDDGait) 11 is to explore the possibility of using dt testing in the early detection of dementia. We use dt tests based on the mobility test Timed "Up & Go" 12 (TUG) and a verbal task: (1) TUGdt NA, and (2) reciting months in reverse order (TUGdt MB). Our results have brought about a particular focus on the test variable "animals/10 s" (ie, the number of different animals named per 10 s of TUGdt NA). "Animals/10 s" correlates with concentrations of the Alzheimer disease cerebrospinal fluid biomarkers t-tau and ptau among participants undergoing memory assessment 13 and has a high discriminative capacity for differentiating between groups of individuals with dementia, mild cognitive impairment, subjective cognitive impairment, and healthy controls. 14 Most importantly, animals/10 s has shown an excellent capacity for predicting dementia incidence in younger participants (<72 years) with subjective or mild cognitive impairment. 15 To enable further interpretation of our results as well as to explore the clinical usefulness of TUGdt tests, normative reference values and investigations of reliability are required.
Normative reference values are useful clinical tools because they allow for an objective detection of deviating test performances. To be used as tools in the interpretation of test results, reference values should be calculated from a sample of individuals whose characteristics match those of the individual in question, and the testing procedures should be alike. 16 Test-retest reliability-that is, the consistency of test results from one time to another-is central both in clinical work and in research. In repeated measurements, variability is to be expected due to biological error (differences in daily features of the test person) and technical error (differences in measurement procedures). 17 Such errors may be systematic (eg, learning or fatigue effects, failing equipment) or random (eg, biological or mechanical variations). The reliability estimated in a test-retest setting indicates if a change in an individual's test performance represents a real change in status or if it is caused by random variability.
To our knowledge, only 1 dt study has presented normative reference values, in which TUG was performed as quickly and safely as possible while counting aloud backward in threes from 100. 18 However, those reference values are useful for only that specific test and not for other dt tests with other prerequisites. Regarding reliability, several dt tests have been investigated, where test-retest reliability for dt gait speed or time scores has been found to be fair to excellent among cognitively healthy controls. [18][19][20][21][22][23] Measures of dt cost (ie, the relative difference in time or gait speed between single-task and dt performance) and verbal performances, however, have shown lower levels of reliability. 19,22 The original single-task tests used as components in the UDDGait TUGdt NA and MB have previously been investigated for reliability. The test-retest reliability of the original single-task TUG is high among younger and older cognitively healthy adults. 20,23,24 The test-retest reliability of the verbal fluency test naming animals is moderate to good for number of animals produced during 60 s, 25,26 and the months backward test has excellent reliability for duration and number of errors. 27,28 Our aims in this current study were to establish clinically useful age-and sex-specific reference values for the UDDGait TUGdt NA and TUGdt MB variables among cognitively healthy adults and to assess the test-retest reliability for these variables.

Methods
We conducted an observational study with (1) a crosssectional design to investigate normative reference values for TUGdt NA and MB variables, and (2) a repeated-measures design to assess these variables' test-retest reliability. When applicable, we referred to the Consensus-Based Standards for the Selection of Health Measurement Instruments (COSMIN) guidelines. 29,30 The Regional Ethical Review Board in Uppsala approved this study, and informed consent was obtained from all participants prior to study commencement.

Participants
Participants were recruited through flyers and advertisements in local papers in Uppsala, Sweden, and Swedish-speaking Åland, Finland. For reference values, 166 individuals were recruited in Uppsala. Inclusion criteria were as follows: 50 years or older, no awareness of cognitive decline, ability to rise from a chair and walk 3 m back and forth without the use of walking aids, no need of an interpreter to communicate in Swedish, and a Mini-Mental State Examination 31 (MMSE) score ≥27 at the time of assessment. One individual was assessed but not included due to having an MMSE score <27. The sampling was purposive, and recruitment continued until the age groups 50 to 59 years, 60 to 69 years, 70 to 79 years, and ≥80 years each comprised approximately 20 women and 20 men.
For test-retest reliability, 43 individuals were recruited. This sample comprised 21 individuals from the reference sample who agreed to participate in a retesting session. The reliability sample was completed in Åland with an additional 22 participants recruited via purposive sampling to achieve 5 or 6 women and 5 or 6 men in the age groups specified above. The inclusion criteria were the same as for the reference sample. In accordance with the COSMIN checklist, 29 the minimum age of 50 years was set to achieve a representative sample of the main target population intended for the current TUGdt testing (ie, individuals who undergo memory assessment). Five participants were assessed but not included: 2 individuals due to an MMSE score below 27, and another 3 due to healthrelated issues that arose between the 2 test sessions.

Assessments
Testing sessions for reference values as well as the first visit for reliability testing involved the same assessments as previously used in UDDGait 11,13-15 -that is, report of demographic characteristics, clinical cognitive tests [31][32][33][34] (including MMSE and verbal fluency test), screening for depressive symptoms, 35 TUG, TUGdt NA, TUGdt MB, and motor function tests [36][37][38] (including hand grip strength and 10-m gait speed). All assessments were carried out in Swedish. For reference values, the participants were assessed at 1 session. For test-retest reliability, the participants were assessed at 2 sessions. The interval between test and retest was set at 10 (SD = 4) days, which was considered long enough for the participants' memory of the tasks to have diminished but too short for changes in motor, cognitive, or dt abilities to occur. 29 Most participants carried out the test and retest within the set interval. However, for practical reasons, the time interval was deviant for 6 participants (ie, 3,4,15,16,23, and 26 days between test and retest).
The retest session involved MMSE, TUG, TUGdt NA, TUGdt MB, and the motor function tests. The MMSE was carried out before the dt testing, both to enhance the comparability of the test and retest sessions, and again as an inclusion criterion (score of ≥27). Other test conditions (eg, administration, environment, instructions) were held constant as far as possible. Additionally, to ensure that the participants were stable between the test and the retest session, they were asked to report any incidents or changes in health that could affect their performance. 29 The UDDGait TUGdt testing has been described previously. 11 The dt testing was preceded by the single-task TUG. The TUG test involves measuring the time required for a test person to carry out a movement sequence at a comfortable pace: to rise from a chair, walk 3 m, turn at a marking on the floor, return to the chair, and sit down again. 12 After performing the single-task TUG, the participants carried out 2 different TUGdt tests in the following order: TUGdt NA and TUGdt MB (starting from December). The physical therapist who led the testing gave standardized instructions to the participant before each test, including instructions to complete all tests at their own speed concerning both mobility and verbal performance and to complete the mobility sequence if they did not know what to say. The tests were timed with a stopwatch to an accuracy of 0.01 second and video-recorded from frontal and lateral views.
All TUGdt test variables used in the current analyses were calculated based on time scores and/or the number of words recited (different animals or months in correct, reverse order). The variables animals/10 s and months/10 s were calculated as 10 * (TUGdt number of words/TUGdt time score). The cost variables were calculated as 100×(TUGdt time score − TUGst time score)/TUGst time score.

Statistical Analyses
Descriptive data were summarized by means and SDs; frequencies; percentages, medians and interquartile ranges; and minimum and maximum values, when appropriate.
In the current study, the reference values denote the percentile values that provide a 95% range for the healthy controls. By quantile regression, 39 the 2.5th and 97.5th percentiles were estimated to define the range of 95% of the healthy controls in age-and sex-specific groups. The quantile regression method increases the power to detect differences in the upper and lower tails by weighing portions of the sample to generate coefficient estimates. 40 Because a high proportion (73%) of the healthy controls had a university education, a sensitivity analysis was carried out where participants with and without university education were analyzed separately for comparison.
Single-measurement absolute agreement intraclass correlation coefficients (ICCs) estimated from a 2-way mixed effects (with participant as random factor and time as fixed factor) linear model were used for the test-retest reliability analyses. 41 The variables TUGdt NA and MB time scores as well as TUGdt NA and MB cost were non-normally distributed and were therefore log transformed. Due to the cost variables having negative values, a constant, c = 20, was added to the original values before log transformation. Bootstrap estimation was used for the 95% CIs with 100,000 bootstrap samples by the percentile-t method. 42 The standard error of measurement (SEM) was calculated as a measure of absolute reliability:

Role of the Funding Source
The funding source had no role in the study's design or conduct of this study. In regards to the reporting of the study, the funders have required Open Access publication.

Reference Sample: Participants
An overview of participants' characteristics and test results for the reference sample is shown in Table 1. A total of 166 participants (age range = 50-91 years, 51% women) were included. The majority of the sample was married or cohabiting (71%), and 73% had a university education. These proportions were higher in the younger age groups. A sensitivity analysis found that education level did not substantially affect the TUGdt reference values. Four participants (3 men, 1 woman), equally distributed across age groups, had depressive symptoms.

Reference Values
In Table 2, reference values are presented for the 2.5th and 97.5th percentile, where the upper and lower limits provide values with which an individual's performance can be compared. For variables for which high values represent poorer performance (time scores and cost measures), the 97.5th percentile is clinically most relevant for comparisons and is therefore highlighted in the table. Conversely, for variables for which low values represent poorer performance (number of animals or months, and number of animals or months/10 s), the 2.5th percentile is highlighted (Tab. 2).  118 (71) 18 (90) 19 (100) 15 (65) 17 (85) 10 (46) 17 (77) 6 (30) 16 (80) University education, no. (%) 121 (73) 17 (85) 15 (79) 17 (74) 17 (85) 18 (82) 16 (73) 8 (40) 13 (       a IQR = interquartile range; MB = months backward; NA = naming animals; TUGdt = Timed "Up & Go" dual-task; TUGst = Timed "Up & Go" single-task. b Depressive symptoms were defined as 2 points or more on the 4-item Geriatric Depression Scale. c The verbal fluency test involved naming as many different animals as possible in 60 s while in a sitting position. d Habitual gait speed was measured from a static start in a 10-m corridor. Among both women and men, the reference ranges appeared to vary with age (Tab. 2).

Reliability Sample: Participants
An overview of the participants' characteristics and test results for the test-retest reliability sample are summarized in Table 3. A total of 43 participants (age range = 50-89 years, 51% women) were included, with approximately equal numbers of participants and proportions of men and women in each of the 4 age groups. A majority was married or cohabiting (77%) and had a university education (51%). None of the participants had depressive symptoms.

Test-Retest Reliability
The  (Figure, plot A). Because the CI covered zero, the difference was not statistically significant. Likewise, for TUGdt MB months/10 s (Figure, plot D), the median difference  Outliers were identified in all plots (A: n = 2, B: n = 2, C: n = 1, D: n = 2). These participants' video recordings were studied, and explanations of the differences between the test occasions were found. In the TUGdt NA test, the 3 participants behaved differently on 1 test occasion compared with the other, either by starting to laugh at their own verbal performance (n = 2) or by accidently pushing the chair before sitting down and then correcting its position before finalizing the test (n = 1).
Regarding the TUGdt MB test, the differences between test occasions were not as apparent. Either the participant made a verbal mistake (recited an incorrect month, immediately noticed the mistake and got confused) (n = 2), was generally more hesitant when reciting the months (n = 1), or walked slower (n = 1). Plots concerning the variables TUGdt cost and number of animals and months are not presented here because these did not show any distinct pattern or entail systematic error: TUGdt NA cost (median difference = −1.30, 95% CI = −4.87 to 4.54), TUGdt MB cost (median difference = 1.19, 95% CI = −2.09 to 6.70), TUGdt NA number of animals (median difference = 0.00, 95% CI = −1.00 to 0.00), and TUGdt MB number of months (median difference = 0.00, 95% CI = −1.00 to 1.00).

Discussion
The current study establishes reference values for TUGdt NA and MB test variables in age-and sex-specific groups, where the upper and lower limits indicate the range of variability in cognitively healthy controls. The test-retest ICC estimates showed that the reliability of TUGdt NA and MB time scores was excellent. The variables "number of animals" and "animals/10 s" as well as "months/10 s" showed fair to good levels of reliability, whereas the ICCs of both cost measures and "number of months" were poor. 45 The reference sample's test results of functional mobility and verbal fluency differed from previously reported reference values. The participants in the current study performed singletask TUG slightly slower across all age groups compared with previous research. [46][47][48] By contrast, the median score for the verbal fluency test was 23.5 in our study, whereas a previous Swedish study found mean scores of 17.8 (SD = 5.7) for individuals without tertiary education and 20.6 (SD = 5.7) for individuals with tertiary education. 32 The latter discrepancy may be due to a relatively high educational level in our sample.
The current reference values for TUGdt test variables suggested declining dt ability with age, which is in accordance with previous research. 7 However, the lower limits for the number of animals and months were both unchanged across age groups among women. That is, regardless of age group, 97.5% of women named at least 6 animals and 4 months in correct reverse order, during TUGdt, and their corresponding time scores increased with age. When using reference values, it is important to consider that they indicate how to interpret TUGdt test results in relation to the performance of cognitively healthy individuals. Reference values should not be seen as cutoffs that define a "normal" performance, because an individual's test result that is poorer than the reference value can be interpreted as a probability of 2.5% of random variability and absence of cognitive impairment. A deviating test result is therefore a reason for further investigation of the individual.
The excellent reliability indicated by the ICC estimates of TUGdt NA and MB time scores is in line with previous research on time scores or gait speed during different types of dt tests. [18][19][20][21][22] Compared with time scores, the number of animals named as well as the number of months recited during TUGdt had lower reliability, a finding that is also consistent with previous research. 19,22 A possible explanation as to why the time scores are more reliable than the verbal outcomes is that the mobility task is more automatized than the verbal tasks. Moreover, attention is a multifaceted cognitive construct, which may make cognitive performance unstable and more variable. 22,49 This may have affected all variables that involved verbal performance. For the variables animals/10 s and months/10 s, the ICC estimates showed a fair to good level of reliability. These relatively low ICC estimates may be explained by a possible inflation in systematic error due to combining 2 variables. 19 For a variable constructed as a ratio of 2 other variables, the reliability is lower for the ratio than for the nominator and the denominator separately, when these are positively correlated and their measurement errors are not correlated. 50 For example, even though months/10 s is based on original single-task tests with high to excellent test-retest reliability (TUG 20,23,24 and the months backward test 27,28 ), it has relatively low reliability, which may result from both variations in attention and the merging of 2 variables. For the TUGdt cost measures, ICC estimates were poor, which is in accordance with previous research. 22 Because these variables are also ratios, inflation in systematic error will once again be a factor in the estimates. However, the poor reliability for the cost measures was confirmed by SEM estimates, which suggested a wide within-participant variability, and by MDC, which showed that large differences between assessments could be expected by chance (24.06 for TUGdt NA cost and 30.66 for TUGdt MB cost).
The estimated ICCs included both random variability and putative systematic differences. The Bland-Altman plots of TUGdt MB time score and TUGdt NA animals/10 s both revealed systematic error. However, because the differences were numerically small and no other TUGdt variables showed significant differences between measurements, the errors may be seen as due to random error. Furthermore, 7 outliers were identified in the Bland-Altman plots of TUGdt NA and MB time scores, animals/10 s, and months/10 s. These participants behaved differently on 1 test occasion compared with the other. For the TUGdt NA test, the reasons for this were either laughter or the chair being accidently pushed before sitting down. Random errors of this kind during testing could be reduced by redoing the test when they occur. For the clinical use of TUGdt tests, an additional measurement could be considered when there are apparent disruptions of the performance. For TUGdt MB, distinct reasons for the outliers were not easy to identify because the verbal mistakes, hesitations, and slow gait speed producing outlying scores are all possible effects of dt interference.
In our ongoing UDDGait project, the target population consists of individuals with subjective and/or objective cognitive decline. It is not certain that the results of this study can be generalized to that population. Other research has shown excellent levels of reliability for time scores and step parameters during dt testing among individuals with mild cognitive impairment or dementia. [51][52][53] However, the reliability of dt test outcomes may be affected negatively by cognitive impairment. 54 As previously noted, the most robust TUGdt test outcome for predicting dementia incidence among individuals with mild or subjective cognitive impairment found in a previous UDDGait study was animals/10 s. 15 In the current study, the MDC for animals/10 s showed that a change of at least 4 animals/10 s was required to distinguish a true change in performance. This points out that animals/10 s is not useful for, as an example, evaluating an exercise intervention. However, our main aim with the UDDGait TUGdt tests is prediction of conversion to dementia by a 1-time assessment. Given that animals/10 s has previously demonstrated good predictive capacity, despite the relatively low ICC estimates reported here, there is strong evidence that animals/10 s has the potential to be a useful dt test variable, even with the risk of regression dilution bias. 55 This illustrates that interpretation of ICC estimates should not be considered as absolute evidence of a test's usefulness but should include consideration of the clinical relevance of the results, 56 in this case the variable's predictive capacity.
This study has some limitations. The reference sample was recruited in a university city, and their relatively high educational level could have affected the results. However, a sensitivity analysis found negligible differences related to educational level. Also, generalizing our findings to populations who do not speak Swedish should be done with caution. The number of participants in our reliability sample was less than that recommended by COSMIN, but a sample of 43 is well above the minimal acceptable number of 30 participants. 57 Another possible study limitation was that due to practical reasons, the number of days between test and retest was outside the set interval for some participants. However, our sensitivity analysis found no evidence that such deviations affected the results.
Our study also has several strengths, including the choice of statistical methods. The reference values were calculated by quantile regression, which is useful for understanding an outcome at its various quantiles and enables clinical comparisons. 58 Additionally, for test-retest reliability, ICC, SEM, and MDC were used, supplemented by Bland-Altman plots, which are recommended methods for this purpose. 59,60 Other strengths are related to the reliable TUGdt testing procedures, including standardized instructions and the recording of the verbal performance, which enabled a subsequent validation of the verbal results. 11 Additionally, requirements for reliability studies, 29 such as independent administrations, an appropriate time interval between test and retest, control of participants' physical changes between test and retest, and similar test conditions, were all met.
In summary, we have established age-and sex-specific normative reference values for the UDDGait TUGdt test variables, potentially useful for clinical as well as research purposes. Our reliability analyses indicated excellent reliability for TUGdt time scores, with poor to good reliability for the other test variables. The reliability of the combined variables may have been affected negatively due to being constructed as ratios. Despite the reliability of animals/10 s being only fair to good, previous results regarding the variable's predictive capacity suggest a potential usefulness for this test outcome. Ongoing and future UDDGait studies will examine the potential of TUGdt testing by establishing the test-retest reliability of TUGdt among individuals with cognitive impairment as well as continuing to investigate the predictive capacity of TUGdt testing over time.