-
PDF
- Split View
-
Views
-
Cite
Cite
Alden L Gross, Alexandra M Kueider-Paisley, Campbell Sullivan, David Schretlen, International Neuropsychological Normative Database Initiative, Comparison of Approaches for Equating Different Versions of the Mini-Mental State Examination Administered in 22 Studies, American Journal of Epidemiology, Volume 188, Issue 12, December 2019, Pages 2202–2212, https://doi.org/10.1093/aje/kwz228
- Share Icon Share
Abstract
The Mini-Mental State Examination (MMSE) is one of the most widely used cognitive screening tests in the world. However, its administration and content differs by country and region, precluding direct comparison of scores across different versions. Our objective was to compare 2 methods of deriving comparable scores across versions of the MMSE. Between 1981 and 2012, investigators in the International Neuropsychological Normative Database Initiative collected MMSE scores on 122,512 persons from 47 studies conducted in 35 countries. We used MMSE data from 80,559 adults aged 41–99 years from 22 studies that provided item-level response data. We first equated 14-point, 15-point, 18-point, 19-point, and 23-point versions of the MMSE to the original 30-point version using coarse equipercentile equating methods that preserved differences across continents, age groups, and durations (years) of education. We then derived more precise item response theory–based scores using item-level responses to MMSE component items. We compared the 2 score-equating approaches using correlation and Bland-Altman plots. Both test-equating approaches were highly correlated with each other (r = 0.73) and with raw MMSE point totals. Bland-Altman plots revealed minimal evidence of systematic differences between the approaches. Our findings support the use of equipercentile equating when item-level data are unavailable to facilitate development of international test norms.
The Mini-Mental State Examination (MMSE) is one of the most widely used cognitive screening tests in the world (1). Current versions of the MMSE consist of a 30-point test that assesses orientation, attention, working memory, word recall, language, and visuoconstruction. While numbers of test items and possible total scores vary across the different versions of the MMSE used around the world, there is epidemiologic and clinical interest in cross-national comparisons of MMSE versions to glean insights on cognitive aging throughout the world (2). Identification of cross-national differences in mental status might help identify correlates of cognitive impairment that are potentially modifiable. Clinically, equating scores across different forms of the MMSE would facilitate the development of global norms to determine whether an individual’s performance was within expectation for age, sex, educational attainment, nationality, language, and MMSE version. Normative scores help researchers and clinicians differentiate between normal and abnormal cognitive functioning, estimate disease severity, and guide treatment. The International Neuropsychological Normative Database Initiative (INNDI) was established to pool global data on widely used cognitive tests with the aim of constructing global norms that account for an examinee’s demographic background, nationality, and language. Consistent with the aims of precision medicine, global regression-based norms can help benchmark expected cognitive performance based on the combination of a person’s demographic characteristics, nationality, and language.
A complication in developing global norms for the MMSE is that translations of the original test frequently entail modifications of the test items to enhance cultural relevance and measurement of global cognitive function. For example, one question used to assess orientation asks for the name of the “county” where the test is being administered, but this is not a meaningful geographical unit in many countries or even in some regions of the United States, where the MMSE originated. This and other questions often require modification or elimination because of cultural or linguistic differences. As another example, the MMSE used in the African Cross-National Study asks examinees to calculate monetary change in lieu of the serial subtractions task in the original MMSE (3). Despite this, the range of possible scores for this version is the same as the original (i.e., 0–30 points). In other countries, translators have found it necessary to exclude test questions that are culturally inappropriate. For example, in the Costa Rican Longevity and Healthy Aging Study, asking examinees to identify the season of the year (fall, winter, spring, or summer) is not appropriate, because this construct is not recognized as an element of temporal orientation in Costa Rica as used in the original MMSE. Excluding this or other questions will reduce the range of possible scores, making comparisons across different versions of the MMSE challenging. Most adaptations maintain a total of 30 possible points, but some versions reduce the maximum possible score to as few as 14 points.
The optimal approach for equating scores across different versions of the MMSE is to use item-level equating based on item response theory (IRT) models (4). In this approach, each MMSE item is weighted differently based on its correlation with other items. On the basis of empirical data, IRT also assigns each item relative locations along the latent trait (e.g., global cognitive function for the MMSE) that correspond to estimated difficulty. Easier items that have a high proportion of correct responses in a sample (e.g., repeating 3 words) provide more precision for scores at the lower range of global cognitive function than harder items that fewer people get correct, such as recalling the same 3 words later.
While IRT methods are ideal for test equating, IRT can only be used when item-level data are available. In INNDI, researchers in 25 studies representing nearly 40,000 adults did not record item-level scores for the MMSE. When only summary scores are available from studies that used different versions of the MMSE, equipercentile equating is the next most sensible approach. This method equates the percentile ranks of a particular score across different versions of a test (5). This approach is especially appropriate and useful when the scores on different versions of a test are not normally distributed (6), as is the case for the MMSE.
Our objective in the current study was to compare 2 methods of deriving comparable scores for the MMSE across different versions—IRT and equipercentile equating. We used MMSE data on a pooled sample of 80,559 participants for whom both total scores and item-level data were available. We hypothesized that equipercentile equating and IRT approaches would agree, thus providing justification for using equipercentile equating in a larger sample of studies with only summary scores for the MMSE. Thus, this work will ultimately enable future development of norms and international comparisons of mental status.
METHODS
Participants
Data were collected through the INNDI (http://inndi.org/). This initiative presently has cognitive testing data on 307,458 individuals from 64 studies representing 52 countries in North America, South America, Africa, Asia, Australia, and Europe. Institutional review boards at parent study institutions approved the study protocols. Many of the data sets were obtained via the Inter-University Consortium for Political and Social Research (Ann Arbor, Michigan). The goal of INNDI is to pool representative samples of reasonably healthy adults in different countries. Thus, we worked closely with contributors to exclude participants with dementia and other neurological illnesses/conditions (e.g., Parkinson disease, epilepsy, multiple sclerosis, and a history of stroke or severe traumatic brain injury), severe mental illnesses (e.g., schizophrenia or affective psychosis), and alcohol or drug dependence. We also excluded persons who were being treated with antipsychotic, anticonvulsant, or anti-Parkinsonian medications. For the present study, we excluded persons younger than 41 years of age or older than 99 years of age and those who were missing data on education. This left 247,320 adults, from whom we excluded persons who were not administered the MMSE (n = 124,808) and those without item-level response data for the MMSE (n = 41,953). This exclusion left 80,559 individuals from 22 studies carried out in 19 countries for the present analyses. Table 1 lists the included studies by continent, and the Web Appendix (available at https://academic.oup.com/aje) provides details on and applicable references for each study. The only continent with missing data was Australia, for which contributing studies had only retained total scores for the MMSE instead of item-level scores.
Numbers of Participants With Mini-Mental State Examination Scores Included in an Analysis of Approaches for Equating Different Versions of the Examination, by Study and Continent (n = 80,559), 1981– 2012
Continent and Studya . | Point Total for MMSE Version . | Total . | ||||
---|---|---|---|---|---|---|
14 . | 18 . | 19 . | 23 . | 30 . | ||
North America | 29,986 | |||||
Aging, Brain, and Cognition Study | 0 | 0 | 0 | 0 | 227 | |
ACTIVE | 0 | 0 | 0 | 0 | 2,800 | |
National Alzheimer’s Coordinating Center | 0 | 0 | 0 | 0 | 8,433 | |
Epidemiologic Catchment Area Study | 0 | 0 | 0 | 0 | 6,390 | |
HEPESE | 0 | 0 | 0 | 0 | 1,654 | |
SABE | 0 | 0 | 4,401 | 0 | 0 | |
MYHAT | 0 | 0 | 0 | 0 | 1,982 | |
MoVIES | 0 | 0 | 0 | 0 | 1,668 | |
CRELES | 2,234 | 0 | 0 | 0 | 0 | |
Aging, Demographics and Memory Study | 0 | 0 | 0 | 0 | 197 | |
Africa | 2,381 | |||||
Annerine Roos study | 0 | 0 | 0 | 0 | 86 | |
African Cross-National Study | 0 | 2,295 | 0 | 0 | 0 | |
Asia | 22,459 | |||||
Chinese Longitudinal Healthy Longevity Survey | 0 | 0 | 0 | 5,988 | 0 | |
Handan Eye Study | 0 | 0 | 0 | 0 | 3,808 | |
African Cross-National Study | 0 | 1,079 | 0 | 0 | 0 | |
Korean Longitudinal Study of Aging | 0 | 0 | 0 | 0 | 9,723 | |
Bangladesh Health Survey | 0 | 0 | 0 | 1,861 | 0 | |
Europe | 15,286 | |||||
Whitehall II Cohort Study | 0 | 0 | 0 | 0 | 5,745 | |
Cognitive Function and Aging Study | 0 | 0 | 0 | 0 | 9,028 | |
Hertfordshire Cohort Study | 0 | 0 | 0 | 0 | 334 | |
Hertfordshire Aging Study | 0 | 0 | 0 | 0 | 179 | |
South America | 10,447 | |||||
PREHCS | 4,468 | 0 | 0 | 0 | 0 | |
SABE | 0 | 0 | 4,828 | 0 | 0 | |
Boston Puerto Rican Health Study | 0 | 0 | 0 | 0 | 1,151 | |
Total | 6,702 | 3,374 | 9,229 | 7,849 | 53,405 | 80,559 |
Continent and Studya . | Point Total for MMSE Version . | Total . | ||||
---|---|---|---|---|---|---|
14 . | 18 . | 19 . | 23 . | 30 . | ||
North America | 29,986 | |||||
Aging, Brain, and Cognition Study | 0 | 0 | 0 | 0 | 227 | |
ACTIVE | 0 | 0 | 0 | 0 | 2,800 | |
National Alzheimer’s Coordinating Center | 0 | 0 | 0 | 0 | 8,433 | |
Epidemiologic Catchment Area Study | 0 | 0 | 0 | 0 | 6,390 | |
HEPESE | 0 | 0 | 0 | 0 | 1,654 | |
SABE | 0 | 0 | 4,401 | 0 | 0 | |
MYHAT | 0 | 0 | 0 | 0 | 1,982 | |
MoVIES | 0 | 0 | 0 | 0 | 1,668 | |
CRELES | 2,234 | 0 | 0 | 0 | 0 | |
Aging, Demographics and Memory Study | 0 | 0 | 0 | 0 | 197 | |
Africa | 2,381 | |||||
Annerine Roos study | 0 | 0 | 0 | 0 | 86 | |
African Cross-National Study | 0 | 2,295 | 0 | 0 | 0 | |
Asia | 22,459 | |||||
Chinese Longitudinal Healthy Longevity Survey | 0 | 0 | 0 | 5,988 | 0 | |
Handan Eye Study | 0 | 0 | 0 | 0 | 3,808 | |
African Cross-National Study | 0 | 1,079 | 0 | 0 | 0 | |
Korean Longitudinal Study of Aging | 0 | 0 | 0 | 0 | 9,723 | |
Bangladesh Health Survey | 0 | 0 | 0 | 1,861 | 0 | |
Europe | 15,286 | |||||
Whitehall II Cohort Study | 0 | 0 | 0 | 0 | 5,745 | |
Cognitive Function and Aging Study | 0 | 0 | 0 | 0 | 9,028 | |
Hertfordshire Cohort Study | 0 | 0 | 0 | 0 | 334 | |
Hertfordshire Aging Study | 0 | 0 | 0 | 0 | 179 | |
South America | 10,447 | |||||
PREHCS | 4,468 | 0 | 0 | 0 | 0 | |
SABE | 0 | 0 | 4,828 | 0 | 0 | |
Boston Puerto Rican Health Study | 0 | 0 | 0 | 0 | 1,151 | |
Total | 6,702 | 3,374 | 9,229 | 7,849 | 53,405 | 80,559 |
Abbreviations: ACTIVE, Advanced Cognitive Training for Independent and Vital Elderly; CRELES, Costa Rican Longevity and Healthy Aging Study; HEPESE, Hispanic Established Populations for Epidemiologic Studies of the Elderly; MMSE, Mini-Mental State Examination; MoVIES, Monongahela Valley Independent Elders Survey; MYHAT, Monongahela-Youghiogheny Healthy Aging Team; PREHCS, Puerto Rican Elderly Health Conditions Study; SABE, Survey of Health, Well-Being, and Aging in Latin America and the Caribbean.
a Details on the listed studies are provided in the Web Appendix (https://academic.oup.com/aje).
Numbers of Participants With Mini-Mental State Examination Scores Included in an Analysis of Approaches for Equating Different Versions of the Examination, by Study and Continent (n = 80,559), 1981– 2012
Continent and Studya . | Point Total for MMSE Version . | Total . | ||||
---|---|---|---|---|---|---|
14 . | 18 . | 19 . | 23 . | 30 . | ||
North America | 29,986 | |||||
Aging, Brain, and Cognition Study | 0 | 0 | 0 | 0 | 227 | |
ACTIVE | 0 | 0 | 0 | 0 | 2,800 | |
National Alzheimer’s Coordinating Center | 0 | 0 | 0 | 0 | 8,433 | |
Epidemiologic Catchment Area Study | 0 | 0 | 0 | 0 | 6,390 | |
HEPESE | 0 | 0 | 0 | 0 | 1,654 | |
SABE | 0 | 0 | 4,401 | 0 | 0 | |
MYHAT | 0 | 0 | 0 | 0 | 1,982 | |
MoVIES | 0 | 0 | 0 | 0 | 1,668 | |
CRELES | 2,234 | 0 | 0 | 0 | 0 | |
Aging, Demographics and Memory Study | 0 | 0 | 0 | 0 | 197 | |
Africa | 2,381 | |||||
Annerine Roos study | 0 | 0 | 0 | 0 | 86 | |
African Cross-National Study | 0 | 2,295 | 0 | 0 | 0 | |
Asia | 22,459 | |||||
Chinese Longitudinal Healthy Longevity Survey | 0 | 0 | 0 | 5,988 | 0 | |
Handan Eye Study | 0 | 0 | 0 | 0 | 3,808 | |
African Cross-National Study | 0 | 1,079 | 0 | 0 | 0 | |
Korean Longitudinal Study of Aging | 0 | 0 | 0 | 0 | 9,723 | |
Bangladesh Health Survey | 0 | 0 | 0 | 1,861 | 0 | |
Europe | 15,286 | |||||
Whitehall II Cohort Study | 0 | 0 | 0 | 0 | 5,745 | |
Cognitive Function and Aging Study | 0 | 0 | 0 | 0 | 9,028 | |
Hertfordshire Cohort Study | 0 | 0 | 0 | 0 | 334 | |
Hertfordshire Aging Study | 0 | 0 | 0 | 0 | 179 | |
South America | 10,447 | |||||
PREHCS | 4,468 | 0 | 0 | 0 | 0 | |
SABE | 0 | 0 | 4,828 | 0 | 0 | |
Boston Puerto Rican Health Study | 0 | 0 | 0 | 0 | 1,151 | |
Total | 6,702 | 3,374 | 9,229 | 7,849 | 53,405 | 80,559 |
Continent and Studya . | Point Total for MMSE Version . | Total . | ||||
---|---|---|---|---|---|---|
14 . | 18 . | 19 . | 23 . | 30 . | ||
North America | 29,986 | |||||
Aging, Brain, and Cognition Study | 0 | 0 | 0 | 0 | 227 | |
ACTIVE | 0 | 0 | 0 | 0 | 2,800 | |
National Alzheimer’s Coordinating Center | 0 | 0 | 0 | 0 | 8,433 | |
Epidemiologic Catchment Area Study | 0 | 0 | 0 | 0 | 6,390 | |
HEPESE | 0 | 0 | 0 | 0 | 1,654 | |
SABE | 0 | 0 | 4,401 | 0 | 0 | |
MYHAT | 0 | 0 | 0 | 0 | 1,982 | |
MoVIES | 0 | 0 | 0 | 0 | 1,668 | |
CRELES | 2,234 | 0 | 0 | 0 | 0 | |
Aging, Demographics and Memory Study | 0 | 0 | 0 | 0 | 197 | |
Africa | 2,381 | |||||
Annerine Roos study | 0 | 0 | 0 | 0 | 86 | |
African Cross-National Study | 0 | 2,295 | 0 | 0 | 0 | |
Asia | 22,459 | |||||
Chinese Longitudinal Healthy Longevity Survey | 0 | 0 | 0 | 5,988 | 0 | |
Handan Eye Study | 0 | 0 | 0 | 0 | 3,808 | |
African Cross-National Study | 0 | 1,079 | 0 | 0 | 0 | |
Korean Longitudinal Study of Aging | 0 | 0 | 0 | 0 | 9,723 | |
Bangladesh Health Survey | 0 | 0 | 0 | 1,861 | 0 | |
Europe | 15,286 | |||||
Whitehall II Cohort Study | 0 | 0 | 0 | 0 | 5,745 | |
Cognitive Function and Aging Study | 0 | 0 | 0 | 0 | 9,028 | |
Hertfordshire Cohort Study | 0 | 0 | 0 | 0 | 334 | |
Hertfordshire Aging Study | 0 | 0 | 0 | 0 | 179 | |
South America | 10,447 | |||||
PREHCS | 4,468 | 0 | 0 | 0 | 0 | |
SABE | 0 | 0 | 4,828 | 0 | 0 | |
Boston Puerto Rican Health Study | 0 | 0 | 0 | 0 | 1,151 | |
Total | 6,702 | 3,374 | 9,229 | 7,849 | 53,405 | 80,559 |
Abbreviations: ACTIVE, Advanced Cognitive Training for Independent and Vital Elderly; CRELES, Costa Rican Longevity and Healthy Aging Study; HEPESE, Hispanic Established Populations for Epidemiologic Studies of the Elderly; MMSE, Mini-Mental State Examination; MoVIES, Monongahela Valley Independent Elders Survey; MYHAT, Monongahela-Youghiogheny Healthy Aging Team; PREHCS, Puerto Rican Elderly Health Conditions Study; SABE, Survey of Health, Well-Being, and Aging in Latin America and the Caribbean.
a Details on the listed studies are provided in the Web Appendix (https://academic.oup.com/aje).
Variables
The MMSE is a test of global cognitive function. It consists of items that assess orientation to time and place, word repetition and recall, attention and concentration, object naming, sentence writing, phrase repetition, following directions, reading a sentence, and design copying. Scores for items are summed together into a total score. Web Table 1 lists the items by continent and study.
Sex was coded as a binary variable (female, male). Age and education were coded in years. For studies that did not record duration (years) of education, we approximated duration of education for individuals on the basis of the United Nations Educational, Scientific and Cultural Organization’s International Standard Classification of Education manual (7). This manual contains detailed educational information for 29 Organisation for Economic Co-operation and Development countries to allow for comparability of education across countries.
Analysis plan
To account for different versions of the MMSE across studies, we conducted equipercentile and IRT equating. We then compared scores derived from the 2 approaches using Pearson correlations and Bland-Altman plots (8).
Equipercentile equating.
Equipercentile equating provides a nonparametric transformation of versions of the MMSE with maximum scores of less than 30 such that each score for an alternate version is equated to the corresponding score on the 30-point MMSE that has the same percentile rank (5). In this way, the resulting score distribution among people administered a non–30-point version of the MMSE will be the same as that among people administered the 30-point version. It may be considered a coarse approach to test equating that can be applied in studies for which item-level MMSE data are not available. To preserve potentially clinically and scientifically meaningful differences in cognitive status by continent, age, and duration of education, we first derived an equating algorithm in a restricted sample and then applied the equating algorithm in the full sample by world region.
To derive an equating algorithm separately for each continent (Table 1), we selected a subsample of persons from each continent who had common ranges of age and education across the different MMSE score totals. The goal of this step was to define a subset of participants in each world region with similar age and education, such that any observed differences in MMSE performance must be attributable to version differences and not to age, education, world region, or related background characteristics. In this equating sample, we then calculated percentiles of each score for each MMSE version. For example, a score of 23 on a 30-point version of the MMSE approximates the 20th percentile in the data available (Figure 1). This is the same percentile rank as a score of 14 on the 19-point version of the MMSE (Figure 1). We used the “equate” package in R and a log-linear smoother to smooth out the equated score distributions (9). Smoothing is a statistical procedure that preserves the underlying shape of the test-score distribution with few assumptions; resulting distributions without the log-linear smoothing function might appear less smooth, although sample sizes were sufficiently large in this study that the smoothing made little difference in the final results.

Scores and percentile ranks for 19- and 30-point versions of the Mini-Mental State Examination (MMSE) (n = 80,559), 1981– 2012. As an illustration of equipercentile equating, the figure shows corresponding scores (y-axes) and percentile ranks (x-axis) for the 19-point (right axis) and 30-point (left axis) versions of the MMSE. For example, the hatched line (|||) with arrows shows that approximately 20% of participants had a score of <23 on the 30-point version of the MMSE and a score of <14 on the 19-point version of the MMSE.
After identifying the percentiles corresponding to every possible score separately by continent in the restricted sample, we then applied the equipercentile algorithm to the full sample to equate MMSE scores on a 30-point scale in each region’s total sample. By applying the equipercentile equating algorithm developed in the restricted sample for each world region to the full sample, we preserved differences in the full sample that were attributable to age, education, and related background characteristics while removing MMSE version differences (6).
IRT equating.
Next, using item-level MMSE data, we derived IRT-based MMSE scores using a 2-parameter logistic graded response model (10). The model is equivalent to a unidimensional confirmatory factor analysis with categorical indicators (11, 12). Factor scores were computed using the regression-based method and represent underlying respondent ability based on available items (13). We linearly scaled the factor to a range between 0 and 30 to be comparable to the 30-point equipercentile-equated MMSE. We used Mplus software, version 8.1 (Muthén & Muthén, Los Angeles, California) (14).
IRT represents an empirical approach to scoring performance. Questions correctly answered by large proportions of individuals provide more information at the easy end of the latent cognitive ability spectrum. Answering such an “easy” question incorrectly implies considerable impairment. In contrast, performing poorly on an item that most other people also answered wrong implies less impairment than missing the easier item. IRT determines the location along the latent trait where each MMSE item provides the most information (item location), as well as how well each item correlates with the underlying trait measured by the entire test (the item’s weight). IRT assumes that a continuous, normally distributed latent variable underlies item responses; however, because IRT does not assume that item locations are distributed evenly over the underlying cognitive ability, a resulting distribution of factor scores representing underlying respondent ability based on available items need not necessarily be normal if items are all “easy” for the given sample.
Differential item functioning of items in the IRT model.
IRT assumes that item weights and locations are the same in all subgroups of the population. As the metric on a ruler must be invariant across different objects to be measured, performance on a test should be contingent only on the underlying cognitive ability being estimated, and not due to extraneous factors. We tested and adjusted for differential item functioning (DIF) by continent using alignment analysis methods (15). Alignment analysis detects invariance in item factor loadings and thresholds across multiple groups—here, continents—based on confirmatory factor analysis. The approach first fits a factor analysis model across continent groups in which all items are assumed to be free of DIF. Next, the procedure systematically tests each factor loading and threshold to identify which differs by continent, controlling for other cognitive indicators in the model. The alignment method prefers solutions that have a few larger noninvariant parameters rather than solutions with many moderately noninvariant parameters (15).
Demographic Characteristics of Sample Participants From 5 Continents in an Analysis of Approaches for Equating Different Versions of the Mini-Mental State Examination (n = 80,559), 1981– 2012
Continent . | Mean Duration of Education, years (10th–90th Percentiles) . | Mean Age, years (5th–95th Percentiles) . | Female Sex . | Point Total(s) of MMSE Version(s) Available . | |
---|---|---|---|---|---|
No. of Persons . | % . | ||||
North America | 10.8 (2.0–18.0) | 70.5 (50.0–87.0) | 18,257 | 61 | 14, 19, 30 |
Africa | 2.3 (0.0–11.0) | 68.2 (60.0–81.0) | 1,054 | 44 | 18, 30 |
Asia | 5.5 (0.0–16.0) | 68.1 (46.0–95.0) | 11,748 | 52 | 18, 23, 30 |
Europe | 11.9 (8.0–20.0) | 69.7 (54.0–85.0) | 7,505 | 49 | 30 |
South America | 6.4 (1.0–15.0) | 69.7 (56.0–85.0) | 5,846 | 56 | 14, 19, 30 |
Continent . | Mean Duration of Education, years (10th–90th Percentiles) . | Mean Age, years (5th–95th Percentiles) . | Female Sex . | Point Total(s) of MMSE Version(s) Available . | |
---|---|---|---|---|---|
No. of Persons . | % . | ||||
North America | 10.8 (2.0–18.0) | 70.5 (50.0–87.0) | 18,257 | 61 | 14, 19, 30 |
Africa | 2.3 (0.0–11.0) | 68.2 (60.0–81.0) | 1,054 | 44 | 18, 30 |
Asia | 5.5 (0.0–16.0) | 68.1 (46.0–95.0) | 11,748 | 52 | 18, 23, 30 |
Europe | 11.9 (8.0–20.0) | 69.7 (54.0–85.0) | 7,505 | 49 | 30 |
South America | 6.4 (1.0–15.0) | 69.7 (56.0–85.0) | 5,846 | 56 | 14, 19, 30 |
Abbreviation: MMSE, Mini-Mental State Examination.
Demographic Characteristics of Sample Participants From 5 Continents in an Analysis of Approaches for Equating Different Versions of the Mini-Mental State Examination (n = 80,559), 1981– 2012
Continent . | Mean Duration of Education, years (10th–90th Percentiles) . | Mean Age, years (5th–95th Percentiles) . | Female Sex . | Point Total(s) of MMSE Version(s) Available . | |
---|---|---|---|---|---|
No. of Persons . | % . | ||||
North America | 10.8 (2.0–18.0) | 70.5 (50.0–87.0) | 18,257 | 61 | 14, 19, 30 |
Africa | 2.3 (0.0–11.0) | 68.2 (60.0–81.0) | 1,054 | 44 | 18, 30 |
Asia | 5.5 (0.0–16.0) | 68.1 (46.0–95.0) | 11,748 | 52 | 18, 23, 30 |
Europe | 11.9 (8.0–20.0) | 69.7 (54.0–85.0) | 7,505 | 49 | 30 |
South America | 6.4 (1.0–15.0) | 69.7 (56.0–85.0) | 5,846 | 56 | 14, 19, 30 |
Continent . | Mean Duration of Education, years (10th–90th Percentiles) . | Mean Age, years (5th–95th Percentiles) . | Female Sex . | Point Total(s) of MMSE Version(s) Available . | |
---|---|---|---|---|---|
No. of Persons . | % . | ||||
North America | 10.8 (2.0–18.0) | 70.5 (50.0–87.0) | 18,257 | 61 | 14, 19, 30 |
Africa | 2.3 (0.0–11.0) | 68.2 (60.0–81.0) | 1,054 | 44 | 18, 30 |
Asia | 5.5 (0.0–16.0) | 68.1 (46.0–95.0) | 11,748 | 52 | 18, 23, 30 |
Europe | 11.9 (8.0–20.0) | 69.7 (54.0–85.0) | 7,505 | 49 | 30 |
South America | 6.4 (1.0–15.0) | 69.7 (56.0–85.0) | 5,846 | 56 | 14, 19, 30 |
Abbreviation: MMSE, Mini-Mental State Examination.
Alignment analysis detects differences, but it does not tell us why such differences may exist. There are many reasons why items may function differently on different continents or in different studies, including language translation issues, cultural differences by sample, administration differences, etc. We opted to evaluate DIF by continent rather than by study or other ways of grouping respondents, because the ultimate goal of INNDI is to provide region-specific norms for tests, and grouping by continent might better facilitate interpretations of item differences for cultural or regional reasons. Instead of continent, we also considered country, more subjective grouping of studies by regions more granular than continent (e.g., the Middle East by itself), and language of administration. Ultimately, we found that there was insufficient diversity of MMSE point totals within most countries. Subjective classification of studies by noncontinental world regions could expose the research to investigator biases, so we avoided geographical classifications narrower than continents. Further, no comparison between MMSE versions could be made within language groups, as a given study was usually administered in just one language. English was the predominant language (n = 36,739, or 46% of the sample), followed by Spanish (n = 17,540, solely from North and South America). Asia had the most variability in languages (9 languages, none of which were English: Mandarin, Bengali, Korean, Jin, Wu, Min, Gan, Cantonese, and Sichuanese).
Comparison of equipercentile and IRT equating approaches.
We compared equipercentile-equated MMSE scores and IRT-derived MMSE scores with raw MMSE total scores and with each other using correlations and Bland-Altman plots. To assess criterion validity, we compared linear regression coefficients for age and education using each of the scores, with the expectation that scores should be related to age and education similarly.

Scores and percentile ranks for different versions of the Mini-Mental State Examination (MMSE) used for equipercentile equating, by continent (n = 80,559) (North America (A), Africa (B), Asia (C), Europe (D), and South America (E)), 1981– 2012. Each panel shows cumulative probability plots for each raw point-total MMSE version. These plots illustrate the lack of overlap of the cumulative distributions of raw point-total version scores, which equipercentile equating is designed to address. All of the continents shown have a 30-point MMSE score (solid line). Asia (panel C) additionally has a 23-point version (hatched line (|||)), North and South America (panels A and E, respectively) have a 19-point version (hatched lines (|||)), Africa and Asia (panels B and C, respectively) have an 18-point version (dashed lines), and North and South America (panels A and E, respectively) have a 14-point version (dashed lines). Europe (panel D) has only a 30-point version.

Scatterplot showing equipercentile-equated Mini-Mental State Examination (MMSE) scores plotted against raw MMSE point total scores (n = 80,559), 1981– 2012. Pearson correlations (r) between the 30-point MMSE and equipercentile-equated MMSE scores were 0.93, 0.88, 0.95, and 0.97 for the 14-point (A), 18-point (B), 19-point (C), and 23-point (D) versions of the MMSE, respectively. Panel E shows the 30-point version of the MMSE, for which no equipercentile equating was necessary, and thus the graphed points form a perfect diagonal line. Nonlinear relationships between the raw point total and the equipercentile-equated score indicate different proportions of individuals with certain scores, which would lead to differences in scores associated with certain percentiles. For example, scores of 1–8 on the 14-point version of the MMSE all translated to a score near 0 on the equipercentile-equated 30-point version, suggesting that lower ends of the score range are not as informative as scores at the higher end of the range because too few individuals had such low scores to identify percentiles at the low end of the distribution.
RESULTS
Summary statistics for the distribution of age, sex, and education in the 80,559 participants are shown in Table 2. The mean age of adults on each continent was 68–70 years, and the 5th and 95th percentiles ranged from 46 years to 87 years. North America and Europe had the greatest average durations of education, while Africa had the lowest. At least 1 study on each continent administered a 30-point version of the MMSE. Other studies administered 14-, 18-, 19-, and 23-point versions (Table 2).
Equipercentile equating
Equipercentile equating produced scores on a 30-point scale in all studies; Web Table 2 provides score equivalencies for each equated point-total version of the MMSE. Equipercentile equating was not necessary for the European continent, because all contributing studies administered versions with 30-point totals. Panels A–E of Figure 2 show cumulative probability plots for each raw point-total MMSE version by continent (North America, Africa, Asia, Europe, and South America, respectively). These plots illustrate the lack of overlap of the cumulative distributions of raw point-total version scores, which equipercentile equating addressed. Figure 3 shows scatterplots of equipercentile-equated MMSE scores plotted against each point-score MMSE version, pooled across continents. These plots suggest that equipercentile equating yielded adequate overlap in distributions. The correction was not linear in all cases: For example, scores of 1–8 on the 14-point version all translated to a score of 0 on the equipercentile-equated 30-point version (Figure 3), suggesting that lower ends of the score range are not as informative as scores at the higher end of the range in this sample. Panels for the 18-point and 19-point versions in Figure 3 (panels B and C) illustrate that different equipercentile equating algorithms were derived for different continents, as accommodated by our approach. As expected, the equipercentile-equated MMSE was highly correlated with each point-total MMSE version (all r’s > 0.88) (Figure 3).
Pearson Correlations (r) Between Raw Mini-Mental State Examination Scores and Scores Equated Using 2 Different Methods (n = 80,559), 1981– 2012a
Type of Equating . | Point Total for MMSE Version . | ||||
---|---|---|---|---|---|
14(n = 6,702) . | 18(n = 3,374) . | 19(n = 9,229) . | 23(n = 7,849) . | 30(n = 53,405) . | |
Equipercentile equating | 0.93 | 0.88 | 0.95 | 0.97 | 1.00 |
IRT equating | 0.88 | 0.91 | 0.92 | 0.67 | 0.88 |
Type of Equating . | Point Total for MMSE Version . | ||||
---|---|---|---|---|---|
14(n = 6,702) . | 18(n = 3,374) . | 19(n = 9,229) . | 23(n = 7,849) . | 30(n = 53,405) . | |
Equipercentile equating | 0.93 | 0.88 | 0.95 | 0.97 | 1.00 |
IRT equating | 0.88 | 0.91 | 0.92 | 0.67 | 0.88 |
Abbreviations: IRT, item response theory; MMSE, Mini-Mental State Examination.
a For each raw point-total version of the MMSE, correlations with equipercentile-equated MMSE scores and IRT-equated MMSE scores were calculated.
Pearson Correlations (r) Between Raw Mini-Mental State Examination Scores and Scores Equated Using 2 Different Methods (n = 80,559), 1981– 2012a
Type of Equating . | Point Total for MMSE Version . | ||||
---|---|---|---|---|---|
14(n = 6,702) . | 18(n = 3,374) . | 19(n = 9,229) . | 23(n = 7,849) . | 30(n = 53,405) . | |
Equipercentile equating | 0.93 | 0.88 | 0.95 | 0.97 | 1.00 |
IRT equating | 0.88 | 0.91 | 0.92 | 0.67 | 0.88 |
Type of Equating . | Point Total for MMSE Version . | ||||
---|---|---|---|---|---|
14(n = 6,702) . | 18(n = 3,374) . | 19(n = 9,229) . | 23(n = 7,849) . | 30(n = 53,405) . | |
Equipercentile equating | 0.93 | 0.88 | 0.95 | 0.97 | 1.00 |
IRT equating | 0.88 | 0.91 | 0.92 | 0.67 | 0.88 |
Abbreviations: IRT, item response theory; MMSE, Mini-Mental State Examination.
a For each raw point-total version of the MMSE, correlations with equipercentile-equated MMSE scores and IRT-equated MMSE scores were calculated.
IRT equating
We used item-level data on 80,559 participants to fit a 2-parameter logistic IRT model and derived factor scores from that model. Item factor loadings, or weights, were scaled to a normal distribution with mean 0 and variance 1 and were graphed against their thresholds, or relative difficulty (see Web Figure 1). Loadings were high (above 0.5) for most items, indicating high intercorrelations. Item thresholds on the x-axis were distributed across approximately 3 standard deviation units of the lower range of MMSE performance. As is characteristic of screening instruments, this item coverage along the lower range of ability is responsible for the ceiling effects evident on the MMSE, as most community-living participants tend to perform well on the test.
DIF testing via the alignment method revealed some evidence of DIF in item thresholds. The differences could have been due to translation issues, sample selection, or other factors. For example, orientation to time proved to be an easier task in Africa than on other continents, while orientation to place was more difficult in Africa. Object naming was easier in Asia and Europe, while repeating a phrase, writing a sentence, and 3-word registration were more difficult in South America. To quantify the combined effect of these sources of detected DIF on IRT scores, we calculated the difference between factor scores from an IRT model assuming measurement invariance and the DIF-adjusted factor score and then calculated the proportion of observations whose difference fell outside 1 standard error of measurement. After we accounted for this DIF, 14% of observations had differences beyond 1 standard error of measurement.
Comparison of equipercentile and IRT equating approaches
The 2 approaches were both highly correlated (r’s > 0.88) with raw MMSE scores, except for the correlation of the IRT-equated MMSE with the 23-point version (r = 0.67) (Table 3). The 2 equating approaches were correlated at r = 0.73, indicating acceptable agreement. A Bland-Altman plot (Figure 4) suggested minimal evidence of unbalanced systematic differences across the distribution of MMSE scores. The average difference between IRT-equated and equipercentile-equated scores was 0 points, but differences ranged from −2 to 5. Web Figure 2 shows a pyramid plot of IRT-equated and equipercentile-equated MMSE scores which illustrates the skewed nature of the equated MMSE scores.

Bland-Altman plot comparing equipercentile-equated Mini-Mental State Examination (MMSE) scores with scores equated using the item response theory (IRT) method (n = 80,559), 1981– 2012. The difference between the equipercentile-equated MMSE score and the IRT-equated MMSE score (y-axis) is plotted against the average of the 2 scores (x-axis). r = 0.81; bias = 0.01; acceptance interval = 1.23.
We regressed equipercentile-equated and IRT-equated MMSE scores on age and education in regression models. Coefficients for equipercentile-equated and IRT-equated MMSE scores were associated, to similar magnitudes, with age (both β’s = −0.05 (standard errors, <0.01)) and education (both β’s = 0.47 (standard errors, <0.01)).
DISCUSSION
We equated scores on multiple versions of the MMSE administered in 22 studies across 19 countries using coarse equipercentile equating and more precise IRT methods. The 2 approaches yielded fairly consistent results and minimal evidence of systematic differences. We regard the IRT approach as optimal for test equating when item-level data are available. However, this study showed that equipercentile equating may be sufficient for facilitating cross-national comparisons as a prerequisite for the future development of international norms for the MMSE.
Our goal was to pool representative samples of generally healthy adults for purposes of developing robust norms specific to particular countries and continents. Thus, the normative sample for one part of the world may have different characteristics than a normative sample from another part of the world. The equipercentile equating approach we used accommodates these differences by estimating equating algorithms separately in different continents. The IRT equating approach did not accommodate such differences by continent, although one could speculate that items might have different intercorrelations among respondents and response patterns in different parts of the world. For example, previous studies have demonstrated differences that are attributable to language and cultural differences for certain items on the MMSE, such as spelling the word “world” backwards and following directions (16). Such measurement differences at the item level on the MMSE are unlikely to have affected inferences made in this study, because they do not affect correlations based on the other equating approach. Furthermore, item-level DIF rarely affects scores greatly because items tend to show bias in different directions (e.g., see Jones (16)).
Approaches for equating different versions of a test are particularly important in neuropsychological testing for both clinical practice and epidemiologic research. The gold standard indicator of progressive dementia is cognitive decline. However, identifying cognitive decline with certainty requires longitudinal assessment using cognitive tests that are reliable and valid measures for a population and setting of interest. Absent prior assessment for comparison, normative test scores among healthy persons with similar demographic profiles are used (17). However, detecting cognitive decline in this situation depends on the normative sample used for comparison. Samples used to create current norms rarely capture cultural and linguistic differences, especially across more than 2 or 3 target populations (18). At the global level, increasing the diagnostic precision of neurocognitive testing requires large numbers of reasonably healthy persons from many countries who are tested in their preferred language. The present study has taken an incremental step toward that goal for the MMSE.
A seemingly obvious equating approach would be to rescale non–30-point MMSE scores to 30 points using mean values and standard deviations. This approach is akin to linear equating, which leverages the mean and standard deviation to shift a variable’s distribution. This approach is more restrictive than equipercentile equating (which leverages empirical percentiles instead of the mean and standard deviation). It is appropriate for normally distributed variables, which the MMSE is not (6). Moreover, simply rescaling scores fails to provide a common metric across studies when combining test scores with different numbers of items across multiple studies: Any rescaling of the 19-point MMSE to a 30-point total would not change the granularity of possible point totals.
Strengths of this study include the diverse set of well-characterized studies that contributed to the INNDI collaboration, carefully collected item-level data for the MMSE, and comparison of classical and psychometrically sophisticated approaches for test equating. Several caveats of the present study should also be mentioned. Most notably, this study excluded many available samples, including all samples from the continent of Australia, because, while we have MMSE data from those studies (which were not part of the present study), the studies did not include item-level responses. This limitation underscores the utility of the equipercentile approach, which could have been used to equate different test versions. A second limitation is that when conducting equipercentile equating, we cannot be certain that point-total MMSE versions across studies all contain the same items. For example, for attention and calculation, a 30-point version of the MMSE might assess serial 7’s or spelling “world” backwards, or administrators might take the best performance among these subtasks. In India, attention and calculation were assessed using a simple currency subtraction task (“What is 20 paise minus 5 paise?”). The 3-step command on the MMSE is known to vary widely across studies, as does registration and recall of 3 objects. The equipercentile algorithm we used does not take these item differences into account. Despite this limitation, the equipercentile approach returned MMSE point totals that correlated highly with various MMSE total scores and with IRT-equated scores, which do account for item-level differences.
A final limitation of equipercentile equating generally is its coarseness: It has been described as more of a scale alignment procedure than a true equating approach, because it only converts scores to a common scale based on empirical percentiles, rather than to a common metric as IRT does (19). Thus, results from equipercentile equating are somewhat dependent on sample characteristics insofar as such characteristics affect distributions which in turn affect which scores are at particular percentiles. This is why care was taken at the outset of the INNDI project to restrict samples to a homogeneous group of healthy individuals from which robust norms are meaningful. Additionally, the large sample size of the pooled cohorts further assures robustness of the equipercentile solution derived here. Although this study included a range of population-based cohorts (e.g., Aging, Demographics and Memory Study) but also clinical cohorts (e.g., National Alzheimer’s Coordinating Center) which are not usually representative of community-living populations, the inclusion criteria for INNDI’s goal of creating robust norms does not require representative samples. Rather, healthy norms are most useful for derivation of robust norms (17).
In conclusion, approaches for equating different versions of the MMSE, based on coarse distribution-based equipercentile equating and finer item-level co-calibration using IRT, produced consistent scores in this study. Because item-level information about MMSE responses—required for IRT—are not available from all studies, equipercentile equating methods may facilitate comparison of MMSE scores across studies in which different versions were administered and when item-level data are not available. This method may thus promote development of international norms for assessment of mental status based on the MMSE.
ACKNOWLEDGMENTS
This work was supported by the National Institute on Aging (grant K01-AG050699 to A.L.G.).
Conflict of interest: none declared.
Abbreviations
- DIF
differential item functioning
- INNDI
International Neuropsychological Normative Database Initiative
- IRT
item response theory
- MMSE
Mini-Mental State Examination