The National Institutes of Health (NIH) Toolbox is a diverse set of brief measures assessing cognitive, motor, emotional, and sensory function in individuals ranging in age from 3 to 85. As shown by recent research, it has promising applications as new measure of cognition.
Following the NIH plan for developing unifying criteria for clinically based research studies (the NIH Blueprint for Neuroscience), the Toolbox is a composition of a set of existing standardized neuropsychological instruments (e.g., Rey AVLT). The goals of the Cognitive portion of this measure are to offer improved clinical screening of a variety of neurocognitive disorders and facilitate comparison across empirical designs. The entire Toolbox measure is computer mediated, although some components (sensory, motor) also employ a more hands on operating component.
The NIH Toolbox purports to monitor neurological, cognitive, behavioral, and emotional function and follows these domain constructs across the lifespan. The stated goal of these measures includes evaluating the efficacy of various interventions and treatment. Although the test has a main website (www.nihtoolbox.org), the tests are accessed via an alternate website, Assessment Center (http://www.assessmentcenter.net). Although the tests are free, one must apply (by submitting their credentials and purpose for using the tests) to have access to the Toolbox.
The Cognitive portion of the Toolbox is designed to measure the following cognitive functions: executive function, attention, episodic memory, language, processing speed, and working memory. It contains two batteries: the Toolbox Cognition Battery and the Early Childhood Cognition Battery. The Toolbox Cognition Battery (appropriate for ages 7+) contains the following tests within the various cognitive domains: Executive Function—Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test; Attention—Flanker; Episodic Memory—Picture Sequence Memory Test (supplementary measure—Rey Auditory Verbal Learning Test); Language—Picture Vocabulary Test, Oral Reading Recognition Test; Processing Speed—Pattern Comparison Processing Speed Test (supplementary measure—Oral Symbol Digit Test); and Working Memory—List Sorting Working Memory Test. The Early Childhood Cognition Battery (recommended for ages 3–6) includes the Dimensional Change Card Sort Test, Flanker, Picture Sequence Memory, and Picture Vocabulary. Each test in the battery comes with its own technical manual, detailing its psychometric properties. Some of the tests are adaptive in nature, meaning that they get harder or easier in response to the respondent's answers.
The current version of the Toolbox runs on a dual-screen monitor format. Originally, the test administration involved the requirement of touchscreens for responding; however, the test developers chose to switch to a keyboard data entry strategy (as opposed to touchscreen). Although the computerized nature of the test certainly has its advantages, many tasks need to be administered with two screens connected to one computer (with the respondent having one screen on which to visualize stimuli, while the examiner utilizes the other to read instructional prompts and on which to view respondent data as recorded). A VGA monitor cable is also needed. This extra material can be a bit cumbersome to transport and set-up off site, and makes administration more cumbersome (in our work with the measure, the resolution of monitors had to be constantly adjusted). An earlier version of Windows (e.g., Windows 7) also needs to be used. The Toolbox can only be run on a PC. The Cognitive portion of the Toolbox takes ∼30–35 min to administer, with a set-up time of ∼10 min and a take-down time of ∼5 min. After the test is completed, scores are automatically generated in the form of an Excel spreadsheet; both raw and normative scores are produced.
The test developers include a thorough test manual. This 26-page manual can be found on the test's website (www.nihtoolbox.org) and details the various validity and reliability estimates of the test. Section 10 contains normative data on all the tests included. More information on the psychometric characteristics of the test can be found in Dr. Sandra Weintraub's and colleagues (2013) article, published in Neurology. According to the test developers, this article covers such psychometric issues as internal consistency, test–retest reliability, and divergent and convergent validity. In regards to test–retest reliability, test–retest reliability was found to be strong for the entire sample and also for children ages 3–15 years and adults ages 20–85 years. These are presented in a table found in the article. Intraclass correlation coefficients for the entire sample on the NIH Toolbox measures ranged from 0.78 for the Picture Sequence Memory Test to 0.99 on the Oral Reading Recognition Test, with most other values falling >0.90 (Weintraub et al., 2013). Additionally, the effects of age on the Toolbox were assessed, with all NIH-TB Cognition Battery measures showing a robust association between test performance and age in the child group (r = .58 – .87), with the scores improving with age in this group. Interestingly, age and test scores (r = −.46 to −.65) on the remaining NIH-TB Cognition Battery measures were negatively associated, with lower scores at higher age levels, with the only exception to this being language tests. Convergent and discriminant validity were also assessed. In children from 3 to 6 years of age, all NIH-TB Cognition Battery measures were significantly correlated (ranging from r = .54 to .74) to a general measure of cognition employed by the researchers. The researchers termed this general measure of cognition “g;” this was obtained by averaging z scores of the Wechsler Preschool and Primary Scale of Intelligence—3rd edition Block Design subtest and the Peabody Picture Vocabulary Test—4th edition. For all NIH-TB CB instruments, correlations for convergent validity measures ranged from r =.48 to .93 (all p< .0001), which is very strong. Similarly, correlations for discriminant validity measures ranged from r = .05 to .30, indicating almost no correlation with other measure which capture different constructs.
Additionally, a prominent article is Ashkamoof's and colleagues (2014), which reveals the results of a normative sample of 1,020 participants between the ages of 3.0 and 20.9 years. Unfortunately, test–rest data and validity estimates are not included in this paper.
At this point, the Toolbox has not been validated against other neuropsychological measures, although the authors make a point for this in future work
Expert contributors to this battery include a team of neuropsychologists from Northwestern University (Evanston, IL), as well as other notable neuropsychologists from other areas of the country (Dr. Jennifer Manly, Dr. David Tulsky, Dr. Robert Heaton, among others). In addition to the involvement of neuropsychologists, there are many other important members of the team, including Edmond Bejeti, who is a great technical support asset (note: we have spoken with Ed many times and found him quite helpful). Help can be obtained via phone or at firstname.lastname@example.org.
The NIH Toolbox is open source software and is, therefore, free for qualifying neuropsychologists to use. The website does state that there are fees associated with user and technical support (e.g., for NIH sponsored research the cost is $1,500 per year for ≤100 subjects), although we were not assessed a fee.
The NIH Toolbox is not a diagnostic tool, an important point which is made explicit on the measure's website (www.assessmentcenter.com). It is unclear, however, to what degree the developers are comfortable with it being used as a standard neuropsychological test and, as such, being used to inform clinical decision-making. An informal poll of practicing neuropsychologists via the npsych listserve revealed that the measure, at present, is not often being used as a clinical measure of functioning and/or for diagnostic purposes (was this a poll your group undertook—and was this practicing neuropsychologist within your practice/hospital, or more extensive.)
Currently, there is a burgeoning research literature on the use of the Toolbox. The seminal article on the Toolbox, written by the developers of the test (Gershon, Wagster, Hendrie, Fox, Cook, & Nowinski, 2013) was published in 2013 in a special section of Neurology devoted to the test. There has also been additional research, also authored by the developers of the test, that assessed input on the test's inclusion criteria (Nowinski, Victorson, Debb, & Gershon, 2013). The article highlighted the fact that the measure was developed with explicit input from its users. A third article in this special issue of Neurology also highlights the “cutting edge” and multidisciplinary nature of the test (Hodes, Insel, & Landis, 2013). Recently, the multicenter Pediatric Imaging, Neurocognition, and Genetics Study group (Akshoomoof et al., 2014) evaluated the impact of age and sociocultural variables on test performance, through administration of the original (touchscreen) Toolbox to 1,020 subjects between the age of 3.0 and 20.9. Using general additive models of nonlinear age functions, while controlling for family SES and genetic ancestry factors, age was found to account for the majority of variance across Toolbox scores, indicating the sensitivity of the Toolbox to developmental effects. Limitations were observed in the presence of some ceiling effects in older children and some floor effects in executive function measures among the youngest children.
According to the test developers (R. Gershon, personal communication, January 21, 2014), recent modifications to the NIH Toolkit include updates to tests which are time dependent, to remove the possibility that Internet speed/delays can affect scores. New features under development include: upgrades to keep the tests current with updates to Windows and Internet Explorer; an offline version for all tests to be installed on a user's network or on an individual computer (eliminating the need to have Internet access during test administration, at the same time enabling local data storage); iPad versions of most of the tests, and the release of expanded “all-adjusted” norms.
While many of the tests are based upon Computer Adaptive Testing which enables assessment across the full range of a trait (typically with floor and ceiling effects), as noted above a couple of the NIH Toolbox instruments were found to have floor effects at the very lowest range of human functioning (e.g., for 3 year olds functioning below age 2 or for older adults who have severe cognitive impairment). New versions of these tests, which present even easier content to subjects who fail the practice items (thus removing these floor effects) are scheduled for validation testing later in 2014. It should be noted that the developers of the NIH Toolbox are working with researchers in numerous countries to translate the measure for use in these countries (all of the instruments are already available in both English and Spanish).
In a personal communication with the test developers, we have learned that the NIH Neuroscience Blueprint has committed funding for the next 4 years to ensure instrument availability and continued technology improvements. This grant includes a commitment to develop and implement a plan which insures long-term availability of the measure. To further increase availability, several testing companies have asked the developers to enable the use of the NIH Toolbox instruments within their existing testing paradigms; discussions that are currently ongoing. At the time of writing of this test review, three studies are listed on the Toolbox website.
In addition, the National Children's Study (NCS) has selected many of the NIH Toolbox instruments for inclusion in the initial Vanguard sample. The NCS is the largest United States epidemiological study of child development and the environment with an ultimate randomized sample size of 100,000 children and their parents, accrued as early as preconception and then followed to age 21. The NIH Toolbox instruments are uniquely qualified to be applicable across most of this age range, and for this reason the NCS is funding the creation of the iPad versions of the assessments in order to enable easy data collection in multiple geographies. In the future, the developers hope to use the results collected from the NCS.
In terms of educational opportunities regarding the Toolbox, there was a recent workshop for potential users of this test in Chicago. There is also a half-day workshop scheduled to take place at the Annual Meeting of the American Congress of Rehabilitation Medicine (ACRM), October 7–11, 2014, in Toronto.
Although the Toolbox is being used increasingly for research purposes, one wonders about the clinical utility of such a measure. As mentioned above, at the time of writing of this article, there has not been a great deal of information about the test being used clinically. It certainly appears to have significant potential as a fairly quick and diverse screening measure of cognitive functioning across diverse populations. We have had experience utilizing this measure in the screening of college student–athletes for Attention-Deficit/Hyperactivity Disorder (ADHD) and learning disabilities (LD).
In conclusion, the NIH Toolbox: Cognition is a well-developed computerized screening of cognitive functioning that has significant applications for both research and clinical neuropsychologists alike. Future revisions will most certainly expand positively on this measure, resulting in greater utility and efficacy.