Modeling Risk Factors for Intraindividual Variability: A Mixed-Effects Beta-Binomial Model Applied to Cognitive Function in Older People in the English Longitudinal Study of Ageing

Abstract Cognitive functioning in older age profoundly impacts quality of life and health. While most research on cognition in older age has focused on mean levels, intraindividual variability (IIV) around this may have risk factors and outcomes independent of the mean value. Investigating risk factors associated with IIV has typically involved deriving a summary statistic for each person from residual error around a fitted mean. However, this ignores uncertainty in the estimates, prohibits exploring associations with time-varying factors, and is biased by floor/ceiling effects. To address this, we propose a mixed-effects location scale beta-binomial model for estimating average probability and IIV in a word recall test in the English Longitudinal Study of Ageing. After adjusting for mean performance, an analysis of 9,873 individuals across 7 (mean = 3.4) waves (2002–2015) found IIV to be greater at older ages, with lower education, in females, with more difficulties in activities of daily living, in later birth cohorts, and when interviewers recorded issues potentially affecting test performance. Our study introduces a novel method for identifying groups with greater IIV in bounded discrete outcomes. Our findings have implications for daily functioning and care, and further work is needed to identify the impact for future health outcomes.

Web Table 1.Summary statistics for groups of participants subdivided by whether included in the final cohort (i.e., the analytical data set) or not (all were ELSA core sample with a date of birth ≥65 years prior to wave 7).b Not included in model due to: (a) the response being recorded by proxy, (b) the interviewer recording a memory problem, or (c) having missing data for any of the variables included in the model c Memory problem defined as Alzheimer's disease or dementia, organic brain syndrome, senility or any other serious memory impairment."No" / "Unknown" calculated from waves in which a response was recorded, only.If, instead, they are calculated from all waves in which the participant was eligible to respond, but did not necessarily do so (but was not recorded as dead) -in which case a smaller percentage of observations are made -then for those who recorded a response for an eligible wave, but not included in model, No: n = 166, Unknown: n = 208; and for those included in the model, No: n = 5897; Unknown: 3580.d These are calculated from all waves in which the participant recorded a response, but not all responses were necessarily included in the model, hence the count for "Yes" and "Unknown" can be greater than 0. Surveys in which memory problems recorded were not included in model.When a memory problem was recorded, no subsequent waves for that participant were included in the model.2. Summary statistics for groups of participants subdivided by mortality status, and whether last survey included in model occurred in wave 7 or not.Equation 1 depicts the mathematical formulation for the beta-binomial model used in our final analyses, with memory test score repeatedly-measured on occasion i (i = 1, …,   ), within individual j (j = 1, …, J).N corresponds to the denominator of the test score (i.e. the number of trials; N = 20, in our case).X 1 , … , X  denote the P predictors added as fixed (population) effects for average probability   (with logit link).These include age, as a non-linear function if appropriate, together with the other covariates as discussed in the main manuscript, and their interactions with the age term(s), again as appropriate.The number of ADL (activities of daily living) difficulties and Test Issues (whether the interviewer recorded any factors which may have impaired the participants' performance during the cognitive tests) were fitted both as a personlevel mean, to estimate the between-individual effect, and as a person-level mean centred variable, to estimate the within-individual effect (1) (although we investigated the interaction of age with ALD Difficulties, and of age with Test Issues, in their original, untransformed state, i.e. prior to decomposing them in this manner). 0 and  1 denote the individual-level random effects for the intercept for   , and also for the slope of age (as a linear term) for   , respectively.The dispersion parameter   is allowed to depend on the fixed effects as indicated (i.e.those with coefficients  1 , … ,  7 ), and also on an individuallevel random effect  2 , with a log-link ensuring that it remains positive.The random effects are assumed to come from a multivariate Normal distribution, with unstructured covariance matrix ∑  , and with corresponding standard deviations and correlations denoted σ and ρ, respectively (with subscripts reflecting the random effects they pertain to).

Web Table
Note that there is an alternative parameterisation of the beta-binomial, with the shape of the beta distribution described by parameters a and b.When a = b = 1 then every probability between 0 and 1 is equally likely (describing a uniform distribution), whilst when a + b increases then the beta-binomial converges on the binomial distribution and the variance of the underlying heterogeneity decreases (2).The parameterisation we used is often used in preference due to its more readily-interpretable nature, with p = average probability = a / (a + b) and  = dispersion = a + b.
Each model was fitted using four chains.Convergence to the target distribution was judged via visual diagnostics, the presence (and number and type) of any divergent transitions, and the value of split- ̂ (with ≈ 1 suggesting convergence) (6).The chains were run for such a length that (unless otherwise indicated) the bulk effective sample size (ESS) for each parameter of interest was a minimum of approximately 100 times the number of chains, indicating a reasonable number of independent draws from the posterior distribution.For models where random (individual) effects were saved (for leaveone-out cross validation, for example), thinning was applied so that the resulting chains were of a size more amenable to post-processing.
We used the priors specified by default by brms (4).These included: an improper flat prior over the reals for the population-level (fixed) effects; a half student-t prior with 3 degrees of freedom and a scale parameter that depends on the standard deviation of the response after applying the link function for standard deviation of the random effects; an Lewandowski-Kurowicka-Joe (LKJ) prior with an eta of one for the correlations between the random (individual) effects; and a gamma(0.01,0.01) prior for  (used in some initial analyses for model selection where  was assumed to be constant).
As sensitivity analyses, we cross-checked evidence for the selection of fixed effects against models which assumed Gaussian residual error (but which otherwise had the equivalent fixed and random effects).Such models which assumed constant within-individual error (used when selecting a non-linear effect of age) were fitted using MLwiN (v.3.05),via R2MLwiN (v.0.8-7) in R (v.4.0.2) (7-9).When these were instead mixed-effects location scale Gaussian models (employed to cross-check selection of interaction terms), they were fitted using Bayesian estimation via MCMC methods in Stan (2.21.0) (3), using the brms package (2.16.1) (4), in R (4.1.0)(5).

Comparing model fit
For the Bayesian models, model fit was assessed using Pareto smoothed importance sampling leaveone-out (PSIS-LOO) cross validation (CV) (10).PSIS is a method of approximating the average predictive accuracy of the training dataset (all the data bar one observation) for each observation left out (approximating the average predictive accuracy if we were to do this for each observation (fold), in turn, in the dataset).PSIS-LOO CV calculates the expected log predictive density (ELPD) as an estimate of outof-sample predictive fit, where the model with the largest (most positive) ELPD has more support.
When some values of the estimate of the diagnostic shape parameter k of the generalised Pareto distribution (Pareto- ̂) are > 0.7, then PSIS-LOO becomes less accurate, and K-fold CV is recommended as an alternative approach (10).K-fold differs from PSIS-LOO CV in that (a) the folds typically consist of many more than just one observation and (b) predictive accuracy is estimated from fitting many models (each to a dataset minus a particular fold) and then empirically testing predictive accuracy on the leftout fold (rather than estimating this predictive accuracy, as in PSIS-LOO).For the models in which we found (a small percentage of) values of Pareto- ̂ to be > 0.7, then when possible we checked PSIS-LOO CV against K-fold CV, using K = 10 folds.
For the sensitivity analyses which assumed Gaussian residual error (see Model estimation) and constant within-individual error, we used the Akaike information criterion (AIC), with a reduction in AIC of ≥ 4 indicating the target model had more support (11) (for the sensitivity analyses which assumed Gaussian residual error but had a mixed-effects location scale (MELS) parameterisation, we instead used PSIS-LOO and K-fold CV, as described above).

Choice of non-linear function of age for the location of the outcome
Initial analyses were conducted to investigate the best-fitting function of age for the average probability parameter p.These were multilevel beta-binomial models, with measurement occasion (level 1) nested within participant (at level 2) (12).These models had a random intercept, and also a random slope for age (in decades) which was a linear term across its full range in the random part of the model, centred around its mean.They also included fixed effects for cohort and sex.
To investigate non-linear functions of age, we fitted restricted (aka natural) cubic splines of the covariate in the fixed part of the model for p.These consist of splines (sections of the regression line), between knots (pre-specified values of age where the splines are connected).The splines are cubic polynomials (although see below), with their first and second derivatives continuous at the knots, allowing the splines to smoothly-transition into each other.This property also allows them to be relatively insensitive to the location of knots, although the number of knots remains an important consideration: we used knot placements as recommended by Harrell, for models with k = 3 through to k = 7 knots (13).'Restricted' refers to the ends (before the first knot, and after the last knot) of the fitted line, which are restricted to be linear.In distributions with long tails, this reduces the influence of outliers.
We compared the fit of these models to each other, and also to a model with age as a linear term across its full range in the fixed part of the model for p (i.e. a model with no restricted cubic splines), using model fit statistics (see Model estimation and Comparing model fit, above) together with qualitative assessment of plotted predictions and posterior predictive checks (14).
There was considerably more support for the beta-binomial models which allowed a non-linear function of age (via restricted cubic splines) compared to a model which instead fitted age as a linear term across its full range: for example, the model with k = 4 knots had the largest ELPD (in keeping with having most support), with the model with linear age having an ELPD -247.6 (SE: 20.2) lower.Otherwise, there was relatively little to choose between the restricted cubic spline models, with each having an ELPD slightly lower than the model with k = 4 knots, as follows: k = 6, ELPD -1.5 (SE: 7.8); k = 3 ELPD -7.3 (SE: 7.9); k = 5, ELPD -11.8 (SE: 7.8), with relatively large standard error for these differences.The model with k = 7 knots did not mix well, indicating this relatively elaborate parameterisation was not a practical one (a run of 4 chains, each with a warmup of 2000 iterations, and 2000 post-warmup iterations, took five days to run with chains in parallel, and yielded a minimum bulk ESS of 72, considerably lower than 100 times the number of chains (see Model estimation)).
A similar conclusion was drawn from the sensitivity analysis using the Gaussian models, which found the AIC was similar for k = 4 (AIC 166020.5),k = 5 (AIC 166018.4),k = 6 (AIC 166019.4)and k = 7 (AIC 166019.2) knots (although higher, indicating a poorer fit, for k = 3 (AIC 166055.4)knots, and considerably higher for the model in which age was linear across its full range (AIC 166218.1)).In these sensitivity analyses, the model with k = 4 was narrowly-favoured by a closed-test procedure (15).
There was therefore considerable evidence that the relationship between age and the location of the word recall test score was non-linear.The restricted cubic spline models all described a concave shape, with the rate of deterioration in mean word recall test score tending to increase with age, with close correspondence between the beta-binomial models and the sensitivity analyses using Gaussian models.For the latter (Gaussian) models, this mean pattern was further confirmed using fractional polynomials (16), in which models with the lowest AIC described substantively the same pattern 1 .
When choosing the number of knots in restricted cubic spline models, k = 4 or k = 5 provide a good choice for most datasets (13).Qualitative assessment of plotted predictions, and posterior predictive checks, indicated close agreement between the beta-binomial models with k = 4 and k = 5 knots, with no substantive difference between these and predictions from more elaborate parameterisations.Given that k = 4 tended to come out equally (if not more) favourably than alternative parameterisations in the formal model selection exercises described above, and that it offers a good balance between a relatively parsimonious parameterisation on the one hand (when more complex functions may be more liable to present estimation issues when further elaborating the model) and sufficient flexibility to accommodate a range of patterns on the other ( 13), a restricted cubic spline model with k = 4 knots was judged the best choice as a function for the fixed (population) effect of age for average probability, p.
Once we had decided upon a non-linear function of age in the fixed part of the model for p, we also explored non-linear functions of age in the random part of the mean function, and also in the fixed part of the intraindividual variability function of the mixed-effects location scale models.However, given that (a) these models tended to have convergence problems, and (b) the relationship between the location of the word test score and age differed only modestly from a linear relationship, a more parsimonious model was favoured, in which age was included as a linear term across its full range in both the random part of the model for p, and in the fixed part of the model for intraindividual variability too.

Testing for interactions
To test for differences in the effect of covariates (sex, educational qualification, the number of ADLs with which the respondent reported difficulty, and whether the interviewer reported whether there were any factors which may have impaired the participants' performance during the cognitive tests) on average probability (p) across age, a beta-binomial model with main effects only was compared to models in which the age terms (linear term and two further spline terms) in the fixed part of the function for p were interacted with each of the covariates in turn.
Using PSIS-LOO cross validation (see Comparing model fit, above) this found that the standard error of the ELPD difference was considerably larger than 2 * the absolute value of the point estimate, indicating that there was no reliable evidence that the models were improved by fitting any of the interaction terms (sex interaction: ELPD diff -10.6 (SE diff 11.7); education interaction: ELPD diff 8.9 (SE diff 11.2); 1 In the fractional polynomial models, age was fitted as a transformed covariate.It was transformed by powers from the following set: -2, -1, -0.5, 0, 0.5, 1, 2, 3 (where 0 corresponds to the logarithmic transformation, and 1 corresponds to no transformation).We fitted a series of first-degree (FP1), and a series of second-degree (FP2), fractional polynomial models.For the FP1 models, only one power term is included for age in the model.Each of the powers from the set described above is used to transform age in a separate model, so eight models are fitted.For the FP2 models, two powers of the age are added to the model (with powers p and q: β 0 + β 1 t p + β 2 t q ; NB if p = q, then β 0 + β 1 t p + β 2 t p log(t)).We explored each power combination, so 36 models were fitted (44 models in total).ADL interaction: ELPD diff -7.4 (SE diff 11.7); reported issues with cognitive tests interaction: ELPD diff -0.3 (SE diff 10.9)).Sensitivity analyses were conducted with Gaussian mixed-effects location scale models, in which the predictors were the same as in the beta-binomial case except now modelling the mean outcome on the original scale instead of p, and modelling log(within-individual SD) instead of log(θ).This similarly found no reliable evidence that the models were improved by fitting any of the interaction terms in the function for the mean outcome (either via PSIS-LOO, nor via K-fold crossvalidation using k = 10 folds, stratified on individual).
As indicated above, whilst investigating the effect of a particular (target) interaction, we did not adjust for the effect of other interactions (for reasons of parsimony and interpretability, we limited our exploration to a defined subset of the full factorial model space).

Sensitivity of model results to inclusion of people observed very seldom
To explore the sensitivity of the results presented in the main manuscript to the inclusion of people observed very seldom (once or twice), we conducted a sensitivity analysis on a subset consisting of only those observed three times or more.As Web Table 3 indicates, the substantive results were similar between the two models (see Web Table 4), albeit with the size of the effect of education on IIV (log(θ)) in the model of people with three or more observations only estimated to be smaller, and with relatively more uncertainty.
Web Table 3. Results (untransformed estimates) from beta-binomial mixedeffects location scale model fitted to participants with 3 or more repeated measures only.

Number
Number of individuals in ELSA core sample with a date of birth ≥65 years prior to wave 7 = 11,999 Number of individuals = 10,341 Number of individuals = 10,058 Excluding (n = 24,263) waves in which a nonresponse was recorded (n = 35,756 waves remaining) Excluding (n = 392) waves with a missing covariate or missing outcome data (n = 33,491 waves remaining) Excluding (n = 405) waves where a memory problem was recorded (n = 33,883 waves remaining) Number of individuals = 10,005 Number of individuals = 9,873 Excluding (n = 1,468) waves in which a response was recorded by proxy (n = 34,288 waves remaining) Web Figure 1.Flowchart detailing exclusions from final cohort (modeled data set).

Table 4 .
(1)t of the model for p, and in the fixed part of the model for the dispersion parameter (θ), age was added as a linear term only.Cohort: year turned 65, centred around 1999, in decades.For highest educational qualification, the reference category is no educational qualification.Fitted as a person-level mean, to estimate the between-individual effect, and also as a person-level mean centred variable, to estimates the within-individual effect(1).When coefficients are estimated as negative in the linear model for the dispersion parameter (log(θ)), this implies that dispersion is greater (i.e.there is greater within-individual variability) with higher values of the associated covariate.(When θ = 2, every probability equally likely; if θ > 2, distribution of probabilities grows more concentrated; if θ < 2 distribution is so dispersed that extreme probabilities near zero and 1 more likely than the mean).Results (untransformed estimates) from beta-binomial mixedeffects location scale model (from main article).
random b c e f