Reliability of the Frailty Index Among Community-Dwelling Older Adults

Abstract Background Consistent and reproducible estimates of the underlying true level of frailty are essential for risk stratification and monitoring of health changes. The purpose of this study is to examine the reliability of the frailty index (FI). Methods A total of 426 community-dwelling older adults from the FRequent health Assessment In Later life (FRAIL70+) study in Austria were interviewed biweekly up to 7 times. Two versions of the FI, one with 49 deficits (baseline), and another with 44 (follow-up) were created. Internal consistency was assessed using confirmatory factor analysis and coefficient omega. Test–retest reliability was assessed with Pearson correlation coefficients and the intraclass correlation coefficient. Measurement error was assessed with the standard error of measurement, limits of agreement, and smallest detectable change. Results Participants (64.6% women) were on average 77.2 (±5.4) years old with mean FI49 at a baseline of 0.19 (±0.14). Internal consistency (coefficient omega) was 0.81. Correlations between biweekly FI44 assessments ranged between 0.86 and 0.94 and reliability (intraclass correlation coefficient) was 0.88. The standard error of measurement was 0.05, and the smallest detectable change and upper limits of agreement were 0.13; the latter is larger than previously reported minimal clinically meaningful changes. Conclusions Both internal consistency and reliability of the FI were good, that is, the FI differentiates well between community-dwelling older adults, which is an important requirement for risk stratification for both group-level oriented research and patient-level clinical purposes. Measurement error, however, was large, suggesting that individual health deteriorations or improvements, cannot be reliably detected for FI changes smaller than 0.13.

Frailty describes a state of increased vulnerability to stressors resulting from a cumulative decline in multiple physiological systems among older adults (1).Against the background of population aging and increased frailty prevalence in more recent birth cohorts (2,3), the importance of frailty for both public health and clinical practice (4) is expected to increase in the coming years.The frailty index (FI) (5), 1 of the 2 dominant conceptualizations of frailty, is based on the accumulation of a large number of age-related health deficits and consistently predicts negative health outcomes such as mortality among older adults (6).FIs based on routine administrative and health record data have been developed in recent years as low-cost and wide-coverage tools to screen for frailty in order to identify those older adults with the highest risk for adverse outcomes (7)(8)(9)(10)(11).In addition to risk stratification based on one-time assessments, the FI is also discussed for monitoring health changes in older adults (12)(13)(14)(15)(16)(17)(18).
Both risk stratification based on single assessments as well as the evaluation of health changes requires that the degree of frailty in an older person-a latent quality difficult to observe directly-is measured reliably.Reliability can be defined as the extent to which an instrument yields consistent and reproducible estimates of the underlying true score (p135) (19).Multiple systematic reviews (20)(21)(22)(23) note that, compared to construct and criterion validity, the reliability of frailty tools has received fairly little attention.However, it is only when we are sure that an instrument measures something in the same way every time we deploy it (=reliability), that we can truly ascertain that it is measuring the right thing (=validity) (24).The COSMIN consensus (25) holds The Journals of Gerontology, Series A: Biological Sciences and Medical Sciences, 2024, Vol.79, No. 2 that the domain of reliability consists of 3 different measurement properties: (1) internal consistency, (2) reliability, and (3) measurement error.(1) Internal consistency refers to the degree to which multiple indicators share a common variance due to the underlying construct of frailty, assessed by coefficient alpha or omega (26).(2) Test-retest reliability is the extent to which the relative position of an individual is consistent across multiple time points (24), expressed for example with Pearson's correlation coefficient or the intraclass correlation coefficient (ICC), and is relevant for discrimination between individuals (27), that is, when the FI is used as a tool for risk stratification.(3) Measurement error, finally, is relevant for frailty monitoring, that is, to differentiate "real" frailty changes from error or "noise," and can be assessed with the standard error of measurement (SEM) (27).To date, only 2 studies (28,29) provide estimates of the reliability of the standard clinical FI (30).Based on a large cross-national sample of community-dwelling older adults and confirmatory factor analysis (CFA), Mayerl and colleagues (28) reported internal consistency (omega) of 0.89-0.93.Based on 80 stable hospital patients over 3 months, Feenstra et al. ( 29) reported a test-retest reliability (ICC) of 0.84 and a measurement error (SEM) of 0.06.Although these first studies suggest the FI to be reliable, more evidence is needed against the background of the current and intended future use of the FI in both research and clinical practice.
Here, we use intensive longitudinal data from a nationwide sample of older adults in Austria to provide new evidence on internal consistency, test-retest reliability, and measurement error of the FI among community-dwelling older adults.In this way, we assess the FI's psychometric properties for risk stratification and monitoring in the context of both grouplevel research questions and individual-level clinical purposes.

Data
Longitudinal data came from the FRequent health Assessment In Later life (FRAIL70+) study.At the behest of the first author, a professional survey agency collected information on health deficits among a nationwide sample of communitydwelling older adults aged 70 years and above in Austria.In total, 971 older adults were contacted based on previous participation in population-representative studies, of which 426 individuals agreed to participate (response rate = 44%; Supplementary Methods 1).Before participation, interviewers described the topic, length, and required information of the study, ensured anonymity of all personal data, and obtained written consent for participation.Between September 2021 and January 2022, participants were interviewed every 2 weeks (mean duration between interviews = 14.7 ± 2.3 days) up to 7 times (mean number of interviews per person = 6.8 ± 0.7), resulting in a total number of 2 892 repeated interviews over a mean period of 84.2 ± 17.0 days (Supplementary Figure 1).The first interview was always an in-person interview conducted in the older adult's home and included physical performance tests.Six shorter follow-up interviews were conducted via telephone, except for a subsample of 40 older adults, with whom all interviews were conducted in person to obtain repeated physical performance measures and to compare survey modes.This study was approved by the Ethics Committee of the Medical University of Graz (EK-number: 33-243 ex 20/21).

Variables
Using baseline data, a frailty index (FI 49 ) was calculated from 49 health deficits including self-reported information as well as physical and cognitive performance tests following standard protocol (30).This FI 49 was used to assess internal consistency.Furthermore, a highly similar second FI 44 based on the subset of those 44 health deficits that were measured repeatedly was created to assess test-retest reliability and measurement error.For both FIs, the selected health deficits reflected multiple physiological systems, and included chronic diseases, limitations in basic and instrumental activities of daily living (ADLs, IADLs), mobility restrictions, somatic symptoms, depressed affect, sensory impairments, physical inactivity, selfrated health, and memory problems (Supplementary Table 1).Self-reported health deficits generally referred to problems or difficulties during the last 2 weeks.All health deficits had less than 2% missing values.The FI score was calculated for all participants by dividing the sum of the health deficit score by the total number of health deficits measured, for example, 10/44 = 0.23.A common cut-off value to differentiate between nonfrail and frail older adults is 0.20 (30).
Sociodemographic variables included sex (male/female), chronological age (years), and level of completed education (low = compulsory education, medium = vocational training, and high = high school or higher).Time since baseline was measured in days.As negative health outcomes, we included 1-year mortality, which was ascertained by proxy interviews or contacting the local municipality.Information on vital status 1 year after participation was 99.5% complete.

Statistical Analysis
First, we calculated and plotted descriptive statistics for the baseline FI 49 and the longitudinal FI 44 .Second, we assessed internal consistency.Internal consistency only applies as a measure of reliability, if the multi-item construct under question follows a reflective measurement model, which is linked to criteria (31) such as the direction of causality between construct and indicators, and the interchangeability of and covariation between indicators.In Supplementary Table 2, we outline why we consider the FI to follow a reflective rather than a formative model.Next, as detailed in Supplementary Methods 2, we used polychoric correlations and CFA to test the unidimensionality of the FI prior to calculating internal consistency (coefficient omega).Here, we followed the quality criterion that internal consistency should be greater than 0.80 for population-level research aiming at group comparisons, and greater than 0.90 when individual-level decisions are to be made based on the instrument (p265) (32).Third, we assessed test-retest reliability and measurement error based on the repeated measurements 14-days apart (33,34), a period in which we would not expect substantive frailty changes among community-dwelling older adults; at the same time, memory and learning effects should be limited.For testretest reliability, as detailed in Supplementary Methods 3, we calculated Pearson correlation coefficients and ICC, and for measurement error, we calculated SEM, limits of agreement (LOA), and smallest detectable change (SDC), all based on the 7 repeated FI 44 assessments.Here, we followed the quality criterion of an ICC of 0.75-0.90indicating good reliability, with values above 0.90 being considered excellent (35).For measurement error, clinically meaningful changes (CMC)that is, differences in continuous measures large enough to be considered important, for example, by clinicians or older adults themselves-should be smaller than the SDC and lie outside the LOA (34).Previous work has suggested CMCs for the FI among community-dwelling older adults of 0.06/0.08(36) and 0.04/0.06(37).
All data preparation, calculations, and statistical tests were done with R (v4.3.0), which are documented in the R-Markdown code file available online: https://osf.io/qvek2/.

Sample Characteristics and Descriptive Statistics
Of 426 participants at baseline, 64.6% were women, with a mean age of 77.3 (±5.4,range = 70-96) years.Low education was reported by 19.3%, medium by 54.2%, and high by 26.5%.The mean (SD) and median interquartile range (IQR) of the FI 49 were 0.19 (±0.14) and 0.14 (±0.16).The empirical submaximum (99 th percentile) was 0.63.The prevalence of specific health deficits at baseline is shown in Supplementary Table 1.The FI 49 exhibited a right-skewed distribution, with higher values among women than men (Figure 1A), and a positive relationship with age, with a steeper slope for women than men (Figure 1B).Older adults with a low level of education had higher mean FI 49 values at baseline (0.25 ± 0.16) compared to those who had completed vocational training (0.19 ± 0.14), which again were frailer compared to those who had completed upper secondary or higher education (0.13 ± 0.08; Figure 1C).Participants who died during 1-year follow-up (n = 11, 2.6%) had a substantively higher median FI 49 (0.47 ± 0.20) compared to those who survived (0.18 ± 0.14; Figure 1D).Based on logistic regression analysis adjusted for age, the odds of death were 11% higher (OR = 1.11, 95% CI: = 1.07-1.17)per 0.01 FI points.
The correlation between the FI 49 and FI 44 at baseline was 0.99 (95% CI: = 0.99, 0.99).Descriptive statistics of the longitudinal FI 44 for each assessment (Table 1) showed little change in the average frailty level across biweekly assessments.We also found no evidence of a linear change in the overall level of frailty across the 3 months (Figure 1E).There were, however, considerable within-person instabilities or fluctuations visible, particularly in the higher FI regions (Figure 2) readily seen when repeated FI 44 assessments (points) for each person (lines) were ordered by their mean FI 44 level.Finally, we found that both mean FI 44 change and individual FI 44 fluctuations were similar in both interview modes (Supplementary Figure 2).

Internal Consistency
The mean overall polychoric correlation among health deficits was 0.29 (±0.18), which is adequate (38) for a broad construct such as the FI (Supplementary Figure 3).The mean polyserial correlation between FI 49 and its health deficits was 0.50 (±0.17), which again meets the criteria for scale construction (p93) (33).The highest correlation coefficients were observed for poor self-reported health as well as ADL, IADL, and mobility impairments including slow gait speed (range = 0.60-0.70),whereas lower associations were found for chronic diseases (range = 0.20-0.30)(Supplementary Table 3).
Next, we tested whether the FI can be assumed a unidimensional measure.Comparison of a unidimensional singlefactor model with a multidimensional correlated factor/ first-order model of 3 separated domains (physical, cognitive, and mental health) without a superstructure, and a bifactor model that retains a general factor of frailty as well as remaining subdomain variance showed (Supplementary Table 3) the bifactor model to fit best (χ² = 1 409, df = 1 121, p < .001,CFI = 0.97, TLI = 0.97, RMSEA = 0.02, SRMR = 0.107).In addition, the factor loadings between the unidimensional model and the general factor of the bifactor model were closely correlated (r = 0.96), and 87% of the reliable variance (omega of the general factor in the bifactor model divided by omega of the 1-factor model, 0.81/0.93= 87%) in the health deficits was due to the general factor, which suggests that the FI is unidimensional enough  for practical purposes.Internal consistency reliability for the general factor depicting overall frailty as measured by coefficient omega (26) was 0.81, which is good.More detailed results from the bifactor CFA model (Supplementary Table 4) also show how well specific health deficits reflected the overall frailty level.The highest factor loadings showed for ADLs (eg, using the toilet = 0.88), IADLs (eg, preparing a warm meal = 0.87), self-rated health (0.83), and polypharmacy (0.83).Loadings that were more moderate showed for bedrest (0.69), tiredness (0.64), physical inactivity (0.62), poor appetite (0.54), and attention (0.47) and memory (0.41) problems.Finally, chronic diseases had-except for arthritis (0.44) and dementia (0.47)-notably lower loadings between 0.20 and 0.30, the lowest being cancer (0.14).

Reliability
The Pearson correlation coefficients between adjacent FI assessments (Figure 3) showed a strong association, ranging between 0.86 and 0.89 among the first 4 assessments, and reaching 0.94 and 0.91 between the last 3 assessments.Nonetheless, using 0.20 as a cut-off for frailty (dashed lines), showed that 11%-18% of participants would be classified incoherently-that is, one time as frail and the other time as nonfrail-across assessments only 14 days apart.
Results from the linear mixed regression model showed that the largest part of the total FI variance was between person differences (σ 2 i = 0.125), followed by the error variance (σ 2 residual = 0.05), whereas there was no systematic variation across waves (σ 2 j = 0.004).The ICC was 0.88 (95% CI: = 0.86-0.90),which can be considered very good.

Measurement Error
The SEM was 0.046 (95% CI: = 0.045-0.047)and the SDC was 0.127 (95% CI: = 0.125, 0.130).The latter value means that a FI change of at least 0.13 needs to occur to be (95%) confident, that this change is real and not just due to the measurement error of the instrument.These results were also reflected in the Bland-Altman plots (Figure 4) between adjacent FI assessments.There was no indication of systematic bias, and the larger of the two LOA, which together encompass 95% of the paired observations, ranged between 0.09 and 0.13 across waves.Anchor-/distribution-based CMCs provided in the literature (36,37) for community-dwelling older adults-0.06/0.08,respectively, 0.04/0.06-wereclearly smaller than the SDC, and lay within the LOA in our study, which means that such FI changes (0.06 for example equates to 2.6 deficits) cannot be reliably differentiated from measurement error.Only changes larger than 0.13 (or 5.7 deficits) in individuals can be confidently interpreted as real changes.

Discussion
In this study, we found internal consistency and test-retest reliability of the FI to be good, respectively, very good.This  means that the standard clinical FI under study was able to differentiate well between groups and individuals of communitydwelling older adults, which is an important requirement for risk stratification.The measurement error, however, was relatively large, so only changes above 0.13 in the FI instrument can be safely interpreted as real improvements or deteriorations among individuals.At higher degrees of frailty, differentiating between older adults was easier due to the larger differences between them, while evaluating their health changes was more difficult, as larger health changes were necessary to differentiate genuine health deterioration or improvement from the noise given their high(er) short-term within-person variability.It should be intuitive that what is a meaningful change needs to be standardized in relation to where it is on the scale, reflecting that as with many age-related attributes, variability increases with the degree of frailty.
The first measure of reliability we assessed was internal consistency, which assumes a reflective measurement model (31), which among other factors, depends on the exchangeability of indicators.In contrast to other frailty instruments, particularly phenotypic frailty (39) which is defined by 5 specific indicators (weight loss, exhaustion, weakness, slow gait, and low physical activity), the health deficits of the FI can be seen as manifestations rather than defining characteristics, and are hence in principle exchangeable (30).Another indicator of a reflective measurement model is positive correlations among indicators and between indicators and the overall scale, for which we both found evidence.Using CFA, we tested the unidimensionality of the FI before assessing internal consistency.We found the FI to be essentially unidimensional, although future studies should psychometrically vet the choice of health deficits for the construction of clinical FIs more thoroughly, for example using item response models, to ensure that the best set of indicators for overall frailty are put to use in both research and practice (40).Here, we found that health deficits loaded differentially on the single underlying factor frailty: poor self-reported health and restrictions in ADLs and IADLs as well as mobility reflected overall frailty best, a finding that is also supported by network analyses of the FI, where these deficits are found to integrate many systems (41,42).With the exception of dementia and arthritis, many chronic diseases, on the other hand, contributed notably less to overall frailty, particularly cancer.
Despite generally limited evidence on the reliability of frailty instruments (20)(21)(22)(23), a few studies offer interesting points for comparison.First, our findings on internal consistency are highly similar to those from Mayerl et al. ( 28) with regard to the 1-factor model.In the final bifactor model where we adjusted for the multidomain nature of the FI, we still found a good level of internal consistency (0.81).Among other frailty instruments, internal consistency tends to be smaller, for example, 0.62 in the Edmonton Frail Scale (43) or 0.66-0.80 in the Tilburg Frailty Indicator (44).This is likely due to the often fewer indicators considered in these tools, as coefficient alpha and omega are not only a function of the interrelatedness of the indicators, but also their number.Indeed, Nguyen et al. (45) showed in a simulation study comparing various FI configurations, that the reliability of the FI is associated with the number of health deficits considered, ranging from ICC = 0.19 with just 5 health deficits up to ICC = 0.84 with 45.
Test-retest reliability of the FI over multiple 14-day periods was ICC = 0.88, which is slightly above the results reported for stable hospital patients over 3 months (ICC = 0.84/0.85)(29).Our estimate also compares favorably with the range of test-retest ICCs reported for other frailty instruments, for example, 0.65/0.77for phenotypic frailty over 3 months (29), 0.71 for the FRAIL scale over 7-15 days (46), and 0.88 for the Tilburg Frailty Indicator over 10-25 days (47).In sum, test-retest reliability, as well as internal consistency of the FI were good, and hence the FI can be considered a highquality instrument for risk stratification among older adults.The good reliability of the FI means that it lends itself well for the assessment of group-level differences in research, for example, to identify risk factors or population-health management, for example, to implement prevention programs to halt or decrease health deterioration among particularly vulnerable older adults (4).Given the high level of test-retest reliability, the FI likely can also be employed as a tool to inform individual-level clinical decision-making (48), that is, tailoring interventions to the frailty level, for example by avoiding aggressive treatments among the most vulnerable patients, and by providing goal-oriented and coordinated care.
For measurement error, we found a SEM of 0.05, and upper LOA and SDC values of 0.13 for the FI, which correspond closely to the results of Feenstra et al. (29) The evaluation of the latter values depends on the magnitude of CMCs for the FI (34).For community-dwelling older adults, CMCs of 0.06/0.08(36) and 0.04/0.06(37) have been suggested.Since these CMCs fall within the LOA respectively and are clearly smaller than the SDC in our study, as well as in the work of Feenstra et al. (29), the measurement error of the FI must be considered substantial.The FI as a broad summary measure of an older person's overall health status (5) seems not well suited for monitoring such health changes of 0.04-0.08 in single individuals accurately, that is, health deteriorations of about 2-4 deficits (in our FI 44 ) would not be enough to be clearly differentiated from measurement error or the noise of the short-term fluctuations we found.More conservatively, the SDC in the FI that signifies a real deterioration or improvement in the FI in a presenting individual would need to amount to 0.13, or about 6 health deficits.Among frail older adults (FI > 0.20), it is even more difficult to measure and interpret individual-level health changes reliably.This considerably large measurement error of the FI, however, is unlikely to affect research interested in risk factors for FI trajectories such as sex, socioeconomic status, or BMI categories (49) as the SEM for group differences in FI trajectories will be much smaller than for single individuals.This applies even if reversible fluctuations are more prevalent in some groups than others (50).The relatively large measurement error, may, however, limit the FI's potential for accurate individual-level monitoring, for example, based on electronic routine health data (13).It might be helpful to view any single FI score from an individual as just 1 data point in a long string of unmeasured FIs that may fluctuate considerably around the one realized measurement.To reduce the measurement error of the FI, (1) more health deficits could be used, (2) more test-based indicators, which come with less measurement error than self-reports, could be incorporated, and (3) information loss could be reduced by avoiding dichotomization of health deficits if possible (51).Furthermore, future research should systematically assess which health deficits are fueling the observed short-term instability of the FI, and weigh their added value for the FI, for example by assessing the loadings of individual health deficits on the FI, against the instability associated with such indicators.Cooper et al. (48), for example, decided to remove patient-reported low mood in their clinical implementation of the FI due to its short-term variability.
However, the aforementioned within-person FI fluctuations, which have been described earlier (52) and which appear related to the FI level, could also be more than just noise (50).Not only could these FI fluctuations reflect chains of discrete health transitions over weeks and months, for example, from high functioning to acute illness or injury, followed by hospitalization, and recovery (53), but they may also be driven by age-related fluctuations inherent in disability (54), somatic symptoms (55), or cognition (56), which tend to be also associated with negative health outcomes.Hence, future studies should not only investigate how these instabilities come about and how to limit their influence but also to find out whether these seemingly stochastic fluctuations could be a relevant characteristic of system failure on their own.
The current study has several strengths.We used a nationwide cohort study of community-dwelling older adults where the FI was assessed multiple times over 2-week periods, and the sample size was large for a reliability study.Also, this is the first time that information on all 3 properties of reliability (25) (internal consistency, reliability, and measurement error) of the FI was reported within a single study.Noteworthy limitations include that although nationwide data were collected, there were selection effects insofar as women, higher educated, and younger persons were somewhat overrepresented in the FRAIL70+ sample.Such selection effects, however, are common in health and aging survey studies, and we consider it unlikely that these affected the estimation of the reliability measures substantively.Furthermore, the longitudinal FI 44 consisted only of self-reported health problems except for 3 cognitive tests, which could influence the extent of short-term FI fluctuations, and in turn, may have affected our reliability estimates.Given the smaller measurement error of physical performance tests compared to self-reports, our results can therefore be interpreted as a conservative, lower-end estimate of the FI's reliability.

Conclusion
Both internal consistency as well as test-retest reliability were good, that is, the FI differentiates well between communitydwelling older adults, which is an important requirement for risk stratification for both research and clinical purposes.Measurement error was considerable though, which means that smaller FI changes among individuals cannot be identified reliably.Furthermore, we uncovered considerable reversible short-term fluctuations in the FI which merit further study.manufacturers (Hollister, INmune, Novartis, Takeda) on individualized outcome measurement, but not on frailty.In 2020, on behalf of Ardea Outcomes, he attended an advisory board meeting with Nutricia on dementia.He is associate director of the Canadian Consortium on Neurodegeneration in Aging, and special advisor to the President of Cape Breton University on frailty and aging.(Both are unpaid positions.)The other authors declare no conflict.

Figure 1 .
Figure 1.Descriptive statistics of the baseline frailty index (FI 49 ) and the longitudinal frailty index (FI 44 ).FI 49 = frailty index at baseline based on 49 health deficits; FI 44 = longitudinal frailty index based on the same 44 health deficits in all 7 repeated assessment.Estimated overall trajectory in plot E is the fitted mean trajectory based on a linear mixed model, and the light gray shaded area indicates 95% confidence intervals.

Figure 2 .
Figure 2. Repeated frailty index (FI 44 ) assessments by participant.FI 44 = longitudinal frailty index based on the same 44 health deficits in all 7 assessments.Points show repeated FI 44 assessments for each person, each line represents 1 participant.Participants are ordered according to their mean FI 44 .

Figure 3 .
Figure 3. Correlations between subsequent frailty index (FI 44 ) measurements.FI = frailty index based on 44 health deficits, n = sample size in paired assessments, r = Pearson's correlation coefficient, values in parentheses are 95% confidence intervals.Dashed lines indicate the cut-off to differentiate between nonfrail and frail older adults.

Figure 4 .
Figure 4. Limits of agreement between subsequent frailty index (FI 44 ) measurements (Bland-Altman plots).FI = frailty index based on 44 health deficits.Solid lines shows the extent of systematic bias between paired assessments, dashed lines indicate upper and lower limit of agreement which contains 95% of the paired FI differences, shaded area indicates 95% confidence intervals.

Table 1 .
Descriptive Statistics of the Frailty Index (FI 44 ) by Measurement Occasion