Abstract

Planning ahead the consequences of future actions is a prototypical executive function. In clinical and experimental neuropsychology, disc-transfer tasks like the Tower of London (TOL) are commonly used for the assessment of planning ability. Previous psychometric evaluations have, however, yielded a poor reliability of measuring planning performance with the TOL. Based on theory-grounded task analyses and a systematic problem selection, the computerized TOL-Freiburg version (TOL-F) was developed to improve the task's psychometric properties for diagnostic applications. Here, we report reliability estimates for the TOL-F from two large samples collected in Mainz, Germany (n = 3,770; 40–80 years) and in Vienna, Austria (n = 830; 16–84 years). Results show that planning accuracy on the TOL-F possesses an adequate internal consistency and split-half reliability (>0.7) that are stable across the adult life span while the TOL-F covers a broad range of graded difficulty even in healthy adults, making it suitable for both research and clinical application.

Introduction

More than 30 years ago, the Tower of London (TOL) planning paradigm was introduced as a prototypical task for assessing executive functions (EF) and the control of behavior in nonroutine situations (Shallice, 1982). Impairments of EF occur in various clinical populations and significantly interfere with patients' everyday functioning (Chan, Shum, Toulopoulou, & Chen, 2008; Diamond, 2013; Shallice & Burgess, 1991). Hence, reliable assessment of EF is of great importance in clinical neuropsychology. Despite this, the psychometric properties of EF tasks are often unsatisfactory (Rabbitt, 1997; Strauss, Sherman, & Spreen, 2006). To some extent, this holds also true for the TOL that is administered in a multitude of variants and versions (Berg & Byrd, 2002). For instance, for the original 12-item TOL problem set, already Humes, Welsh, and Retzlaff (1997) reported a Cronbach's α = 0.25 and split-half reliability of r = .19, whereas Kafer and Hunter (1997) have generally questioned its validity given that a latent one-factor model of planning ability based on several TOL variables could not be fitted. Notably, this original 12-item TOL problem set is implemented in the Stockings of Cambridge (SOC) variant of the TOL in the CANTAB (Cambridge Neuropsychological Test Automated Battery; e.g., Owen, Downes, Sahakian, Polkey, & Robbins, 1990; Owen et al., 1995; Robbins et al., 1998), which is one of the most widely used, computer-assisted neuropsychological test batteries for the assessment of cognitive (dys)function in a broad range of neurological and psychiatric populations. In this respect, Lowe and Rabbitt (1998) reported test–retest reliabilities between r = .26 and .60 for different SOC scores in healthy older adults, which concurs with recent findings of Syväoja and colleagues (2015) on a test–retest reliability of r = .23 for SOC accuracy in healthy children.

In strong contrast to the widespread popularity of the TOL and its variants, reflected by the constantly rising number of currently almost 500 PubMed-listed publications (cf. Kaller, Rahm, Köstering, & Unterrainer, 2011) (A literature search in PubMed [http://www.ncbi.nlm.nih.gov/pubmed/] using the key term [“Tower of London” OR “Stockings of Cambridge”] revealed 494 publications [November 10, 2015].), only a remarkably small number of studies have addressed the development and evaluation of alternative problem sets in order to improve the task's psychometric properties (e.g., Culbertson & Zillmer, 1998a, 1998b; Kaller, Unterrainer, & Stahl, 2012; Köstering, Nitschke, Schumacher, Weiller, & Kaller, 2015; Schnirman, Welsh, & Retzlaff, 1998; Tunstall, O'Gorman, & Shum, 2014). A common approach to increase task reliability is to select items from a larger pool based on the item-total correlations and to reevaluate the internal consistency of the resulting set of items in an independent sample (cf. Culbertson & Zillmer, 1998a, 1998b; Schnirman et al., 1998; Tunstall et al., 2014). As a complementary approach, the theory-grounded compilation of a consistent item set builds on knowledge and assumptions from experimental psychology about the cognitive processes underlying successful task performance and the relevant task parameters that impose a systematic variation of item difficulty (see Primi, 2014, for an instructive example on assessing fluid intelligence).

As for the TOL, task difficulty is commonly explained by the minimum number of moves necessary to solve a given problem. Yet, numerous studies have identified several additional task parameters that considerably influence the difficulty of individual TOL problems (e.g., Berg, Byrd, McNamara, & Case, 2010; Carder, Handley, & Perfect, 2004; Kaller, Unterrainer, Rahm, & Halsband, 2004; Newman & Pittman, 2007; Unterrainer, Rahm, Halsband, & Kaller, 2005; Ward & Allport, 1997; for a review, see Kaller, Rahm, Köstering, et al., 2011). To give an illustrative example for this impact of problem structure on planning beyond the minimum number of moves, Fig. 1A and B depicts two five-move TOL problems that substantially differ in the imposed demands on planning ability and, as a direct consequence, also differ in their resulting item difficulties. Figure 1C further demonstrates that six-move TOL problems, dependent on their underlying structural properties, are not necessarily harder to solve than five-move TOL problems (cf. Kaller, Unterrainer, & Stahl, 2012).

Fig. 1.

(A–C) Three examples for Tower of London problems requiring a minimum of either five (A and B) or six (C) moves for optimal solution. The reader is kindly asked to solve the three problems by transforming the respective start state into the goal state. Despite an equal number of five moves, most people experience the problem in (A) much easier to solve than that in (B). Moreover, the five-move problem in (B) appears even more difficult than the six-move problem in (C), although the commonly applied operationalization of task difficulty in terms of the minimum number of moves would predict the opposite (adapted after Kaller, Rahm, Köstering, et al., 2011; Kaller, Unterrainer, & Stahl, 2012). These three example problems hence provide a vivid illustration that the minimum number of moves is not sufficient for specifying the task difficulty of the Tower of London (see Kaller, Rahm, Köstering, et al., 2011, for a review).

Fig. 1.

(A–C) Three examples for Tower of London problems requiring a minimum of either five (A and B) or six (C) moves for optimal solution. The reader is kindly asked to solve the three problems by transforming the respective start state into the goal state. Despite an equal number of five moves, most people experience the problem in (A) much easier to solve than that in (B). Moreover, the five-move problem in (B) appears even more difficult than the six-move problem in (C), although the commonly applied operationalization of task difficulty in terms of the minimum number of moves would predict the opposite (adapted after Kaller, Rahm, Köstering, et al., 2011; Kaller, Unterrainer, & Stahl, 2012). These three example problems hence provide a vivid illustration that the minimum number of moves is not sufficient for specifying the task difficulty of the Tower of London (see Kaller, Rahm, Köstering, et al., 2011, for a review).

Based on comprehensive problem space analyses and empirical data, Kaller, Rahm, Köstering, et al. (2011) have, therefore, suggested a standard TOL problem set consisting of four-, five-, six-, and seven-move TOL problems (eight problems each, 32 items in total) and comprising a systematic variation of several main structural task parameters while further parameters were kept constant. This theory-grounded compilation of a TOL problem set resulted in a satisfactory increase in the split-half reliability (r = .72) and internal consistency (α = 0.69) of TOL accuracy (Kaller, Unterrainer, & Stahl, 2012). The broad range of item difficulty together with an almost perfectly linear decrease in accuracy with increasing level of minimum moves was further shown to discriminate well between high- and low-performing subjects (Kaller, Unterrainer, & Stahl, 2012). However, item-specific analyses also indicated potential for further improvement, which resulted in the development of the computerized TOL-Freiburg Version (TOL-F; Kaller, Unterrainer, Kaiser, Weisbrod, & Aschenbrenner, 2012). The TOL-F constitutes an adaptation of the standard 32-item TOL problem set suggested by Kaller, Rahm, Köstering, et al. (2011) and was specifically improved to be suitable for diagnostic purposes and clinical use. In this regard, the systematic selection of 24 instead of 32 problems for the TOL-F aimed at achieving a reasonable trade-off between a satisfactory reliability, a sufficiently broad range of item difficulties, and an adequate test economy in terms of a relatively short (and clinically practicable) test duration.

Here, we present psychometric data on the TOL-F from two large-scale samples, which were collected in Mainz, Germany and Vienna, Austria. Given the substantial reduction from 32 to 24 items, a main objective concerned whether planning accuracy measured with the TOL-F features a comparably adequate reliability as with its more extensive precursor (Kaller, Unterrainer, & Stahl, 2012). As previous evaluations with the standard TOL problem set were solely based on student samples with a restricted representativeness, we further aimed to investigate whether the TOL-F's psychometric properties generalize to the population. More specifically, we assessed whether the TOL-F ensures a reliable assessment of individual planning ability irrespective of a subject's age. Finally, although previous reports have revealed substantial differences in planning ability across the life span (e.g., Albert & Steinberg, 2011; Allamanno, Della Sala, Laiacona, Pasetti, & Spinnler, 1987; Bugg, Zook, DeLosh, Davalos, & Davis, 2006; De Luca et al., 2003; Köstering, Stahl, Leonhart, Weiller, & Kaller, 2014; Krikorian, Bartok, & Gay, 1994; Luciana, Collins, Olson, & Schissel, 2009; Peña-Casanova et al., 2009; Robbins et al., 1998; Zook, Welsh, & Ewing, 2006), results on sex differences are equivocal: Some studies report significant effects (De Luca et al., 2003; Peña-Casanova et al., 2009; Rönnlund, Lövden, & Nilsson, 2001; Unterrainer et al., 2013), whereas most studies have failed to find performance differences between male and female children and adolescents (Albert & Steinberg, 2011; Culbertson & Zillmer, 1998b; Krikorian et al., 1994; Luciana et al., 2009; Tunstall et al., 2014) or adults (Allamanno et al., 1987; Tunstall et al., 2014; Zook et al., 2006). Thus, another intention was to investigate sex-related differences in planning ability across life span.

Methods

Samples

Mainz sample

The Mainz sample originated from assessments with the TOL-F that were part of a large-scale prospective and population-based epidemiological study conducted at the University Medical Center Mainz with primary focus on cardiovascular health (Gutenberg Health Study; http://www.gutenberghealthstudy.org/). In the baseline examination between April 2007 and March 2012, the Gutenberg Health Study assessed a representative population sample of approximately 15,000 individuals from the city of Mainz and the district of Mainz-Bingen, Germany, which was equally stratified for sex, residence (urban and rural), and age decades. These individuals are currently being reassessed in a second examination that started in April 2012 and for which the TOL-F version as described subsequently was added in June 2012 as a measure of complex cognition. The present analyses were based on a total of 3,804 subjects who participated in the second run of the Gutenberg Health Study and were tested between June 14, 2012 and December 19, 2013. Subjects' age ranged between 40 and 80 years. Exclusion criteria concerned insufficient knowledge of the German language and physical or psychological inability to participate in the examinations at the study center. The Gutenberg Health Study was approved by local ethics authorities. Written informed consent was obtained from all subjects prior to participation.

Data inspection revealed 34 cases (0.894%) of the Mainz sample with a null performance, indicating that no meaningful data had been collected. These were excluded before the analyses so that the final Mainz sample consisted of n = 3,770 included subjects.

Vienna sample

The Vienna sample was collected by the SCHUHFRIED GmbH as part of various norm and validation studies. The present analyses were based on a total of 830 subjects that were tested between May 30, 2011 and June 3, 2013. Subjects' age ranged between 16 and 84 years. Participants were mainly recruited from the subject database of the SCHUHFRIED Test and Research Center constituting a nonprobability selection. Exclusion criteria concerned reports of severe neurological or psychiatric diseases and/or insufficient familiarity with using a computer mouse (as assessed with a specialized computerized test before the testing was started). Data acquisition complied with local institutional research standards for human research and was completed in accordance with the Helsinki Declaration (http://www.wma.net/en/30publications/10policies/b3/). Written informed consent was obtained from all subjects prior to participation.

Age groups and sample characteristics

Besides individual age and sex, subjects were also characterized by their highest achieved education level assessed by a five-item scale with the following levels: An educational level of 1 corresponded to 8 or less years of schooling and was typically applied to participants who completed elementary school, but did not obtain higher education. An educational level of 2 was used to classify participants who completed 9 years of schooling, but without vocational training. An educational level of 3 corresponded to 10–12 years of education and the completion of vocational training. An educational level of 4 was used to denote the completion of high school and the qualification for university entrance. An educational level of 5 was assigned if a participant had obtained an academic degree. Information on education level was available for all but one subject from the Mainz sample.

In order to assess effects of age on planning ability, both samples were further subdivided into age groups. The Mainz sample (n = 3,770) was divided into eight 5-year groups between 40 and 80 years of age, covering an age range from mid- to late adulthood. Given the smaller overall sample size of the Vienna sample (n = 830), the respective age cohorts were formed in 10-year intervals ranging from 20 to 70 years of age. Subjects in the Vienna sample with an age below 20 and above 70 years comprised two additional age groups that had however substantially smaller sizes than the other age groups and that did not exactly cover the intended 10-year intervals (ranging from 16.00 to 19.83 and 70.17 to 84.00 years of age, respectively). An overview on the descriptive information for age, sex, and education level of the two overall samples as well as of the resulting subgroups is provided in Table 1.

Table 1.

Sample descriptives and reliability estimates for the Mainz and Vienna samples on the TOL-F

Sample Sample descriptives
 
Reliability estimates
 
n Age M ± SD (years) Sex f, m (nEd. Lvl. 1, 2, 3, 4, 5 (nλ2 λ3 (α) λ4 ωtot glb 
Mainz 
 Overall sample 3,770 59.96 ± 10.66 1837, 1933 21, 574, 1,803, 344, 1,027 0.718 0.713 0.743 0.730 0.755 
 40.00–44.99 years 350 43.01 ± 1.25 175, 175 1, 29, 124, 43, 153 0.663 0.655 0.745 0.670 0.759 
 45.00–49.99 years 466 47.50 ± 1.49 296, 170 2, 42, 208, 74, 140 0.612 0.601 0.700 0.607 0.708 
 50.00–54.99 years 539 52.35 ± 1.50 205, 334 3, 63, 228, 76, 168 0.679 0.672 0.721 0.693 0.768 
 55.00–59.99 years 556 57.43 ± 1.49 284, 272 3, 71, 251, 55, 176 0.632 0.622 0.722 0.648 0.700 
 60.00–64.99 years 525 62.47 ± 1.34 251, 274 3, 100, 260, 24, 138 0.670 0.663 0.707 0.684 0.763 
 65.00–69.99 years 490 67.43 ± 1.40 243, 247 1, 79, 282, 29, 99 0.697 0.689 0.776 0.713 0.778 
 70.00–74.99 years 493 72.57 ± 1.32 232, 261 1, 114, 269, 26, 83 0.671 0.661 0.750 0.689 0.757 
 75.00–79.99 years 351 77.16 ± 1.39 151, 200 7, 76, 181, 17, 70 0.725 0.716 0.807 0.741 0.814 
Vienna 
 Overall sample 830 43.46 ± 16.91 455, 375 4, 130, 336, 246, 114 0.662 0.656 0.716 0.677 0.730 
 Below 20.00 years 67 17.79 ± 1.15 37, 30 3, 46, 4, 14, 0 0.709 0.683 0.861 0.713 0.832 
 20.00–29.99 years 148 24.94 ± 2.70 77, 71 0, 26, 25, 80, 17 0.559 0.533 0.755 0.563 0.716 
 30.00–39.99 years 158 34.50 ± 3.18 89, 69 0, 24, 53, 43, 38 0.651 0.635 0.800 0.669 0.776 
 40.00–49.99 years 147 44.96 ± 2.75 72, 75 0, 10, 82, 34, 21 0.699 0.685 0.805 0.718 0.799 
 50.00–59.99 years 136 54.57 ± 2.71 73, 63 0, 11, 73, 34, 18 0.674 0.655 0.833 0.663 0.804 
 60.00–69.99 years 117 64.60 ± 2.73 68, 49 1, 4, 65, 28, 19 0.698 0.680 0.806 0.652 0.819 
 Above 70.00 years 57 74.14 ± 3.42 39, 18 0, 9, 34, 13, 1 0.596 0.547 0.826 0.633 0.791 
Sample Sample descriptives
 
Reliability estimates
 
n Age M ± SD (years) Sex f, m (nEd. Lvl. 1, 2, 3, 4, 5 (nλ2 λ3 (α) λ4 ωtot glb 
Mainz 
 Overall sample 3,770 59.96 ± 10.66 1837, 1933 21, 574, 1,803, 344, 1,027 0.718 0.713 0.743 0.730 0.755 
 40.00–44.99 years 350 43.01 ± 1.25 175, 175 1, 29, 124, 43, 153 0.663 0.655 0.745 0.670 0.759 
 45.00–49.99 years 466 47.50 ± 1.49 296, 170 2, 42, 208, 74, 140 0.612 0.601 0.700 0.607 0.708 
 50.00–54.99 years 539 52.35 ± 1.50 205, 334 3, 63, 228, 76, 168 0.679 0.672 0.721 0.693 0.768 
 55.00–59.99 years 556 57.43 ± 1.49 284, 272 3, 71, 251, 55, 176 0.632 0.622 0.722 0.648 0.700 
 60.00–64.99 years 525 62.47 ± 1.34 251, 274 3, 100, 260, 24, 138 0.670 0.663 0.707 0.684 0.763 
 65.00–69.99 years 490 67.43 ± 1.40 243, 247 1, 79, 282, 29, 99 0.697 0.689 0.776 0.713 0.778 
 70.00–74.99 years 493 72.57 ± 1.32 232, 261 1, 114, 269, 26, 83 0.671 0.661 0.750 0.689 0.757 
 75.00–79.99 years 351 77.16 ± 1.39 151, 200 7, 76, 181, 17, 70 0.725 0.716 0.807 0.741 0.814 
Vienna 
 Overall sample 830 43.46 ± 16.91 455, 375 4, 130, 336, 246, 114 0.662 0.656 0.716 0.677 0.730 
 Below 20.00 years 67 17.79 ± 1.15 37, 30 3, 46, 4, 14, 0 0.709 0.683 0.861 0.713 0.832 
 20.00–29.99 years 148 24.94 ± 2.70 77, 71 0, 26, 25, 80, 17 0.559 0.533 0.755 0.563 0.716 
 30.00–39.99 years 158 34.50 ± 3.18 89, 69 0, 24, 53, 43, 38 0.651 0.635 0.800 0.669 0.776 
 40.00–49.99 years 147 44.96 ± 2.75 72, 75 0, 10, 82, 34, 21 0.699 0.685 0.805 0.718 0.799 
 50.00–59.99 years 136 54.57 ± 2.71 73, 63 0, 11, 73, 34, 18 0.674 0.655 0.833 0.663 0.804 
 60.00–69.99 years 117 64.60 ± 2.73 68, 49 1, 4, 65, 28, 19 0.698 0.680 0.806 0.652 0.819 
 Above 70.00 years 57 74.14 ± 3.42 39, 18 0, 9, 34, 13, 1 0.596 0.547 0.826 0.633 0.791 

Note: Underscores with bold values and solid lines denote the indices with highest and lowest estimates on reliability for the respective (sub)samples. Ed. Lvl., education level.

Tower of London-Freiburg Version

Task description

The TOL-F (Kaller, Unterrainer, Kaiser, et al., 2012) is implemented in the Vienna Test System (http://www.schuhfried.com/vienna-test-system-vts/) as a computerized pseudo-realistic representation of the originally wooden configuration of the TOL (Fig. 2) as it was introduced by Shallice (1982). That is, the tower configurations in the TOL-F consist of three rods of different height with the left, middle, and right rod being capable to accommodate up to three, two, and one ball, respectively. The tower configurations further comprise three balls colored in red, yellow, and blue. Thus, the green ball from the original TOL version was replaced by a yellow ball in the TOL-F in order to ensure applicability to subjects with red-green dyschromatopsia.

Fig. 2.

Physical layout of the computerized tower configurations in the TOL-F mimicking a realistic representation of the original wooden tower and balls.

Fig. 2.

Physical layout of the computerized tower configurations in the TOL-F mimicking a realistic representation of the original wooden tower and balls.

As in most disc-transfer planning tasks, individual problem items consist of a start and a goal state. In the TOL-F, these are presented in the lower and upper halves of the computer screen, respectively. Subjects have to transform the start into the goal state in the minimum number of moves which, in the TOL-F, is always indicated to the left of the start state. Written instructions indicate that only one ball may be moved at a time, that balls cannot be placed beside the rods, that only the top-most ball can be moved in case several balls are stacked on a rod, and that the rods differ in their capacities of accommodating one, two, or three balls at maximum. Note that the computer program does not allow breaking these rules, but records any attempts to do so. Instructions further emphasize that problems have to be solved in the minimum number of moves and that participants should always plan ahead the problem solution before starting with movement execution.

In order to transfer the start into the goal state, the TOL-F can be worked on by computer mouse (Vienna sample) or by touch screen (Mainz sample). In these two possible response modes, a ball is picked up either by placing the cursor over the ball and clicking the left button of the computer mouse or simply by clicking the ball via finger touch. The selected ball is then encircled by a transparent whitish corona and can be moved to another rod. The respective rod is likewise selected either by computer mouse or by finger touch.

Three different time limits per trial can be applied (1 min, 3 min, unlimited) in order to accommodate differences in cognitive and/or motor speed. For the present investigation, the time limit was set to 1 min, which corresponds to the original suggestion of Shallice (1982) and was found to be sufficient, for instance, for assessing planning ability in healthy children (Unterrainer et al., 2015), young (Kaller, Unterrainer, & Stahl, 2012) and elderly healthy adults (Köstering, Stahl, et al., 2014), child patients with autism spectrum disorders and attention deficit hyperactivity disorder (Unterrainer et al., in press), Parkinson patients (McKinlay et al., 2008), as well as patients suffering from stroke and mild cognitive impairment (Köstering, Schmidt, et al., 2015). In order to avoid unnecessary frustration (and a reduced compliance and/or motivation in subsequent tests, for instance, in a clinical setting), the TOL-F allows for an automatic cancellation of the test if the time limit is exceeded three times in a row. Given a linear increase of item difficulty across the minimum number of moves (cf. Kaller, Unterrainer, & Stahl, 2012), the rationale is that subjects who struggle to cope with a certain level of item difficulty will also do so in even more complex problems. Automatic cancellation was activated for the assessments both in the Mainz and Vienna samples. Further note that for the TOL-F assessments in the Mainz sample, a time limit of 20 min for the test administration (exclusive of instructions and pauses between problem items) was requested by the study board so as to avoid delays in the subjects' schedules (see also Experimental Procedures).

Problem set

The selection of problem items for the TOL-F was based on psychometric evaluations (Kaller, Unterrainer, & Stahl, 2012) of a recently suggested standard problem set of four- to seven-move problems for investigations with the TOL task (Kaller, Rahm, Köstering, et al., 2011). The basic idea behind the suggested composition was to develop a problem set that (i) considerably enhances the poor psychometric properties of the original TOL problem set (Humes et al., 1997; see also Kafer & Hunter, 1997; Lowe & Rabbitt, 1998; Syväoja et al., 2015) and that (ii) yields a broad range of problem difficulty—but with a linear increase across the minimum number of moves—to allow for diagnostic and research applications in a multitude of healthy and clinical samples with inherently different levels of planning ability. To this end, the structural problem parameters search depth and goal hierarchy accounting for most variation in problem difficulty were systematically varied within the levels of minimum moves, whereas other problem parameters were kept as constant as possible (for more details, see Kaller, Rahm, Köstering, et al., 2011; Kaller, Unterrainer, & Stahl, 2012). As intended, the standard problem set was found to have a satisfactory reliability (split-half r = .72; Cronbach α = 0.69) and to follow a perfectly linear increase of difficulty. However, item-specific analyses indicated several possibilities for further improvement. In consequence, one of the nested families of problem items (see Kaller, Unterrainer, & Stahl, 2012) was exchanged during the development of the TOL-F, the seven-move problems were removed due to their minor contributions for discriminating high- and low-achieving subjects, and the presentation order of problems within each level of minimum moves was altered. The TOL-F problem set hence constitutes a refinement of this previously suggested standard problem set and consists of four-, five-, and six-move problems (eight problem items each). The final selection of problem items is listed in Table 2.

Table 2.

Item characteristics of the standardized TOL-F problem set

Item Start state Goal state Minimum moves Search depth Goal hierarchy 
#01 54 31 1 (high) Unambiguous 
#02 42 23 0 (low) Partially ambiguous 
#03 34 13 1 (high) Partially ambiguous 
#04 12 65 0 (low) Completely ambiguous 
#05 55 41 1 (high) Unambiguous 
#06 16 24 0 (low) Partially ambiguous 
#07 25 32 1 (high) Partially ambiguous 
#08 36 15 0 (low) Completely ambiguous 
#09 33 11 2 (high) Unambiguous 
#10 53 13 1 (low) Partially ambiguous 
#11 23 43 2 (high) Partially ambiguous 
#12 55 15 1 (low) Completely ambiguous 
#13 42 21 2 (high) Unambiguous 
#14 54 34 1 (low) Partially ambiguous 
#15 52 12 2 (high) Partially ambiguous 
#16 55 35 1 (low) Completely ambiguous 
#17 22 41 3 (high) Unambiguous 
#18 34 53 2 (low) Partially ambiguous 
#19 42 63 3 (high) Partially ambiguous 
#20 46 25 2 (low) Completely ambiguous 
#21 23 61 3 (high) Unambiguous 
#22 43 64 2 (low) Partially ambiguous 
#23 13 32 3 (high) Partially ambiguous 
#24 22 55 2 (low) Completely ambiguous 
Item Start state Goal state Minimum moves Search depth Goal hierarchy 
#01 54 31 1 (high) Unambiguous 
#02 42 23 0 (low) Partially ambiguous 
#03 34 13 1 (high) Partially ambiguous 
#04 12 65 0 (low) Completely ambiguous 
#05 55 41 1 (high) Unambiguous 
#06 16 24 0 (low) Partially ambiguous 
#07 25 32 1 (high) Partially ambiguous 
#08 36 15 0 (low) Completely ambiguous 
#09 33 11 2 (high) Unambiguous 
#10 53 13 1 (low) Partially ambiguous 
#11 23 43 2 (high) Partially ambiguous 
#12 55 15 1 (low) Completely ambiguous 
#13 42 21 2 (high) Unambiguous 
#14 54 34 1 (low) Partially ambiguous 
#15 52 12 2 (high) Partially ambiguous 
#16 55 35 1 (low) Completely ambiguous 
#17 22 41 3 (high) Unambiguous 
#18 34 53 2 (low) Partially ambiguous 
#19 42 63 3 (high) Partially ambiguous 
#20 46 25 2 (low) Completely ambiguous 
#21 23 61 3 (high) Unambiguous 
#22 43 64 2 (low) Partially ambiguous 
#23 13 32 3 (high) Partially ambiguous 
#24 22 55 2 (low) Completely ambiguous 

Note: Start and goal states are reported in the notation suggested by Berg and Byrd (2002). For search depth, numbers indicate the number of initially to be accomplished intermediate moves, whereas descriptions in parentheses refer to the level of search depth with regard to the respective minimum number of moves. For further information on structural problem parameters in the TOL task, please refer to the overview provided in Kaller, Rahm, Köstering, et al. (2011).

Before examination of planning ability using these problems, task instructions are explained and verified in the TOL-F in a prelude with simple two- and three-move problems. Planning ability can be assessed with three parallel forms, with problems across parallel forms being structurally identical, but featuring systematic permutations of ball colors, hence resulting in visually dissimilar problems that are, however, completely equivalent in terms of problem difficulty (cf. Berg & Byrd, 2002; Unterrainer et al., 2005). That is, although all participants of the Mainz and Vienna samples were tested with the parallel form A, derived psychometric properties can be nonetheless applied to parallel forms B and C.

Experimental procedures

Participants of the Mainz sample were tested individually in a quiet air-conditioned room at the Gutenberg Health Study Center of the University Medical Center Mainz. Each participant was instructed by an experienced examiner who was present during the whole testing session. To overcome resistance against computer usage and to facilitate first-time handling of the computer especially in older and inexperienced participants, the Mainz sample was assessed by using a touch-sensitive screen as input device. The duration of the overall test session was limited to 20 min (exclusive of instructions and pauses between problem items) and a time limit of 60 s was specified for completion of each single problem.

The data collection for the Vienna sample was carried out as part of standardization studies at the SCHUHFRIED Test and Research Center and at the Teaching & Research Lab of the University of Vienna. All data were collected in supervised testing sessions with small groups of two to five subjects at maximum, which typically took 60–90 min to complete (inclusive additional tests). Data for the TOL-F were collected with a test version that used a computer mouse as input device. This test version did not contain an overall time limit, but a time limit of 60 s was imposed for completing each trial.

Taken together, data collections in Mainz and Vienna differed in the following three aspects: Individual testing versus small groups, touch screen versus computer mouse as input device, and the application of a 20 min overall time limit versus no such restrictions, respectively.

Dependent measures

For assessment of individual planning ability with the TOL-F, overall planning accuracy, defined as the percentage of problems that were correctly solved in the minimum number of moves, is regarded as the primary outcome variable of interest. It is one of the most commonly used performance measures in disc-transfer planning tasks such as the TOL (Sullivan, Riccio, & Castillo, 2009; see also Berg & Byrd, 2002) and the one yielding the greatest effect size in a meta-analysis on different outcome measures (Sullivan et al., 2009). In addition to overall planning accuracy, the TOL-F provides also information on planning accuracy separately for the different levels of minimum moves which was used for the present analyses of variance (ANOVAs) of the effects of age, sex, and task difficulty (in terms of the minimum number of moves) on planning ability (see Analyses of Variance).

Secondary variables of the TOL-F concern the total number of problems solved at all (irrespective of whether they are solved in the minimum number), and the initial thinking and movement execution times, which will be reported and discussed elsewhere. Further note that the TOL-F also provides reports on the number and type of observed rule breaks (e.g., trying to select a blocked ball, to place a ball on a blocked rod or outside the tower configuration) which may be informative for clinical use so as to disentangle a patient's impairments in planning ability. However, given their infrequent occurrence in healthy subjects, these data were not subject to the present data analyses.

Data Analyses

Analyses of variance

Mixed ANOVAs on planning accuracy as dependent variable were conducted using IBM SPSS Statistics for Windows (Version 21.0; IBM Corp., Armonk, NY) to test for main effects of the between-subjects factors “Age Groups” and “Sex” and the within-subject factor “Minimum Moves” and their interactions. ANOVAs were separately run for the Mainz and Vienna samples.

Reliability estimates

In accordance with the revised review model for the description and evaluation of psychological and educational tests (Version 4.2.6; http://www.efpa.eu/professional-development/assessment) recently suggested by the Board of Assessments of the European Federation of Psychologists' Associations (EFPA, 2013), the following estimates of reliability were considered for assessing the internal consistency of overall planning accuracy as the primary TOL-F outcome variable: λ2, λ3 reflecting Cronbach α, ωtot, and the greatest lower bound (glb). In addition, λ4 was computed for comparability with the exhaustive split-half reliability estimates reported by Kaller, Unterrainer, and Stahl (2012). All indices were computed using the psych package (Version 1.3.2; Revelle, 2013) for the R open-source statistical software (Version 3.0.1; R Core Team, 2013). Note that although most commonly reported, the λ3 measure (or Cronbach α) is only a lower bound to the reliability that often constitutes a gross underestimate (Cortina, 1993; Sijtsma, 2009). Sijtsma (2009), therefore, advocated using glb as a better alternative (Bentler & Woodward, 1980), whereas Revelle and Zinbarg (2009) put forward ωtot (McDonald, 1999).

Reliability indices were separately computed for the two overall samples from Mainz and Vienna as well as for the respective age subgroups (cf. Age Groups and Sample Characteristics). Given that adequate reliability estimations require sufficiently large sample sizes of at least 100 and preferably more than 200 subjects per (sub)group (EFPA, 2013), a further subdivision along sex and education level was omitted. In this respect, even the estimates for the different Vienna age subgroups have to be interpreted with some caution. However, as data collection is continuing for the Mainz sample, comprehensive reliability estimates for age, sex, and education-fair subsamples will be provided once the necessary numbers of observations are available.

Results

Effects of Age, Sex, and Minimum Moves on Planning Accuracy

Mainz sample

A mixed ANOVA with the between-subjects factors Age Groups (cf. Age Groups and Sample Characteristics) and Sex (men vs. women) and the within-subject factor Minimum Moves (four-, five-, and six-move problems) on planning accuracy as dependent variable revealed significant main effects of Age Groups (F(7,3754) = 115.44, p < .001, ηpartial2=0.177), Sex (F(1,3754) = 114.31, p < .001, ηpartial2=0.030), and Minimum Moves (F(2,7508) = 10610.58, p < .001, ηpartial2=0.739). Also the two-way interactions of Age Groups by Minimum Moves (F(14,7508) = 12.33, p < .001, ηpartial2=0.022) and Sex by Minimum Moves (F(2,6034) = 15.62, p < .001, ηpartial2=0.004) reached significance, whereas Age Groups by Sex (F(7,3754) = 0.67, p = .702, ηpartial2=0.001) and the three-way interaction were not found to be significant (F(14,7508) = 0.62, p = .847, ηpartial2=0.001). Linear contrasts further confirmed that planning accuracy systematically declined in a linear fashion both with increasing age (F(1,3754) = 736.77, p < .001, ηpartial2=0.164) and with increasing task complexity in terms of the minimum number of moves for optimal solution (F(1,3754) = 21383.62, p < .001, ηpartial2=0.851). Linear contrasts also revealed that the factor Minimum Moves exerted a linear moderator effect on both the main effects of Age Groups (F(7,3754) = 24.74, p < .001, ηpartial2=0.044) and of Sex (F(1,3754) = 31.45, p < .001, ηpartial2=0.008). That is, effects of age and sex on planning accuracy linearly increased with increasing TOL-F task difficulty and were hence most pronounced in the six-move problems.

The main results in the Mainz sample are illustrated in Fig. 3A (left panel, upper right panel), which also shows that overall planning accuracy approached a normal distribution (Fig. 3A, upper middle panel; M ± SD, 55.93 ± 15.60%) with both “skewness” (−0.208; SE, 0.040) and “kurtosis” (−0.211; SE, 0.080) approximating a value of zero. Moreover, the intended linear increase of task difficulty in the TOL-F was closely reflected by the individual limits of subjects' planning performance (Fig. 3A, lower right panel). Thus, subjects with a low, medium, and high level of planning ability were likely to attain correct solutions in problems up to a level of four, five, and six minimum moves, respectively, but not above. In other words, this important aspect of construct validity of the theory-driven TOL-F problem selection was demonstrated by the observed empirical task difficulty of the minimum move subscales.

Fig. 3.

Overall planning accuracy in the (A) Mainz and (B) Vienna samples. Within figure parts (A and B), the left panel illustrates planning accuracy as a function of age and sex with age groups corresponding to the subdivisions in Table 1. Accuracy values are plotted at the mean age of the respective subgroups; error bars denote the standard error of the mean (SEM). The upper middle panel illustrates the frequency distribution of planning accuracy values across the total sample. The upper right panel illustrates the amplification of age effects as a function of the problems' minimum number of moves. Age groups, that is, increasing age, are coded by the dots' darkening gray values. The lower right panel illustrates subjects' individual planning accuracy in four-, five-, and six-move problems (color coded in terms of number of correct solutions within minimum moves) as a function of their overall performance (solid line). Individual subjects are ordered along the abscissa in ascending order of overall performance, that is the total number (or percentage) of correctly solved problems.

Fig. 3.

Overall planning accuracy in the (A) Mainz and (B) Vienna samples. Within figure parts (A and B), the left panel illustrates planning accuracy as a function of age and sex with age groups corresponding to the subdivisions in Table 1. Accuracy values are plotted at the mean age of the respective subgroups; error bars denote the standard error of the mean (SEM). The upper middle panel illustrates the frequency distribution of planning accuracy values across the total sample. The upper right panel illustrates the amplification of age effects as a function of the problems' minimum number of moves. Age groups, that is, increasing age, are coded by the dots' darkening gray values. The lower right panel illustrates subjects' individual planning accuracy in four-, five-, and six-move problems (color coded in terms of number of correct solutions within minimum moves) as a function of their overall performance (solid line). Individual subjects are ordered along the abscissa in ascending order of overall performance, that is the total number (or percentage) of correctly solved problems.

Vienna sample

The same mixed ANOVA as earlier was also conducted for the Vienna sample revealing significant main effects of Age Groups (F(6,816) = 6.08, p < .001, ηpartial2=0.043), Sex (F(1,816) = 6.37, p = .012, ηpartial2=0.008), and Minimum Moves (F(2,1632) = 1837.20, p < .001, ηpartial2=0.692) as well as a significant two-way interaction of Age Groups by Minimum Moves (F(12,1632) = 1.98, p = .012, ηpartial2=0.014). The remaining two-way interactions of Age Groups by Sex (F(6,816) = 1.45, p = .194, ηpartial2=0.011) and Sex by Minimum Moves (F(2,1632) = 1.18, p = .309, ηpartial2=0.001) and the three-way interaction (F(12,1632) = 0.87, p = .581, ηpartial2=0.006) failed to reach significance. Linear contrasts again confirmed that planning accuracy linearly decreased with increasing age (F(1,816) = 25.92, p < .001, ηpartial2=0.031) and increasing task difficulty in terms of minimum number of moves to solution (F(1,816) = 3422.52, p < .001, ηpartial2=0.807) and that the former effect was linearly moderated by the latter (F(6,816) = 2.49, p = .021, ηpartial2=0.018). The main results in the Vienna sample are illustrated in Fig. 3B (left panel, upper right panel) and complemented by showing that overall planning accuracy also followed a normal distribution for these data (Fig. 3B, upper middle panel; M ± SD, 61.25% ± 14.31%) with both skewness (−0.303; SE, 0.085) and kurtosis (−0.086; SE, 0.170) approaching a value of zero. Furthermore, task difficulty in the TOL-F and the assessed planning ability at the level of individual subjects were closely linked also for the Vienna sample (Fig. 3B, lower right panel), thus again suggesting construct validity of the TOL-F.

Reliability Estimates for Overall Planning Accuracy

A comprehensive overview on the reliability estimates is provided in Table 1. The five different estimates on reliability ranged between 0.713 and 0.755 for the overall Mainz sample, but were slightly lower (∼0.05 units) for the overall Vienna sample (0.656–0.730). In both the overall samples and in the respective age groups, estimates were highest for glb and λ4, whereas λ3 or Cronbach's α yielded the lowest estimate in all cases. Notably, glb converged for the Mainz and the Vienna overall samples to almost congruent estimates of 0.755 and 0.730, respectively. Taken together, these results suggest that the TOL-F features an adequate and satisfactory reliability with estimates based on glb and λ4 always exceeding 0.7 for the two samples as well as for all age-related subsamples (Table 1).

Total Test Durations and Test Cancellations

The 20-min time limit for the TOL-F assessments in the Mainz Sample (see Task Description and Experimental Procedures) was exclusive of the instructions and self-determined pauses between problem items so as to ensure a comparable overall problem-related processing time for all subjects. However, for evaluating the TOL-F's practicability in clinical contexts, the total test durations (inclusive of instructions and pauses) are more informative and relevant and, in consequence, are descriptively provided in Fig. 4 for the overall samples from Mainz and Vienna as well as for the respective age cohorts. Overall medians (Md) and distributions (percentile ranks, PR) of the total durations were highly comparable in both samples with TOL-F test administrations lasting between 13 min (PR 15) and 20 min (PR 85) in the majority of the Mainz subjects (Md = 17 min) and between 12 and 20 min in the majority of the Vienna subjects (Md = 16 min). Furthermore, as becomes obvious from Fig. 4, total test durations of TOL-F administrations increase with older age.

Fig. 4.

Total duration of the test administrations (inclusive of instructions and pauses between problem items) in the (A) Mainz and (B) Vienna samples. The upper panels provide a histogram of the total test durations and the respective percentile ranks (PR). The lower panels illustrate the distributions and medians for the individual age groups with darker gray value indicating higher number of subjects. In this respect, black and white reflect proportions of 20% and 0% of a given age group, whereas gray values represent the linear transition between both.

Fig. 4.

Total duration of the test administrations (inclusive of instructions and pauses between problem items) in the (A) Mainz and (B) Vienna samples. The upper panels provide a histogram of the total test durations and the respective percentile ranks (PR). The lower panels illustrate the distributions and medians for the individual age groups with darker gray value indicating higher number of subjects. In this respect, black and white reflect proportions of 20% and 0% of a given age group, whereas gray values represent the linear transition between both.

Likewise, the percentage of test cancellations due to the applied criteria of three subsequent timeouts (both samples) and the 20-min overall time limit (Mainz sample only) rose with increasing age, particularly from 40 years onwards (Fig. 5). Interestingly, the pattern for the three subsequent timeouts limit appears to mark two stages (around problem items #13 and #20) with an increased frequency of test cancellations with older age, which may reflect declining capacities for efficiently planning ahead with increasing age (Fig. 5; see also Discussion).

Fig. 5.

Test cancellations and timeouts in the (A) Mainz and (B) Vienna samples. The upper panels provide an overview on the percentage of test cancellations per age group following the applied criteria of three subsequent timeouts (both samples) and the 20-min overall time limit (exclusive of instructions and pauses between problem items, Mainz sample only). The frequency of test cancellations due to the three subsequent timeouts limit separately for the individual problem items is illustrated in the middle panels, whereas the lower panel illustrates the frequency of test cancellations following the 20-min overall time limit (Mainz sample only). Darker gray values indicate a higher percentage of subjects whose testing was cancelled at a given problem due to the earlier criteria for cancellation. In this respect, black and white reflect proportions of 10% and 0% of a given age group, whereas gray values represent the linear transition between both.

Fig. 5.

Test cancellations and timeouts in the (A) Mainz and (B) Vienna samples. The upper panels provide an overview on the percentage of test cancellations per age group following the applied criteria of three subsequent timeouts (both samples) and the 20-min overall time limit (exclusive of instructions and pauses between problem items, Mainz sample only). The frequency of test cancellations due to the three subsequent timeouts limit separately for the individual problem items is illustrated in the middle panels, whereas the lower panel illustrates the frequency of test cancellations following the 20-min overall time limit (Mainz sample only). Darker gray values indicate a higher percentage of subjects whose testing was cancelled at a given problem due to the earlier criteria for cancellation. In this respect, black and white reflect proportions of 10% and 0% of a given age group, whereas gray values represent the linear transition between both.

Discussion

The main objective of the present study was to evaluate the reliability of assessing planning ability across the lifespan with the TOL (Freiburg version; TOL-F), whereas the second objective was to assess age- and sex-related differences in planning ability.

With regard to the first objective, reliable assessment of EF and related complex cognitive abilities is often difficult to achieve due to the multifaceted nature of EF, their inherent task impurity, and reliance on novel situations that elicit nonroutine behavior (Rabbitt, 1997; Strauss et al., 2006). Therefore, EF tasks need to be sufficiently complex—while at the same time covering a broad range of difficulty—and to comprise a relatively large number of items to yield valid and reliable estimates at the first administration. This, however, is often not achievable in clinical routine, where easy-to-administer, short tests are required. As a result, extant TOL versions for use in clinical settings have mostly yielded insufficient or inconsistent reliability estimates (e.g., Humes et al., 1997; Lowe & Rabbitt, 1998; Syväoja et al., 2015; but see Culbertson & Zillmer, 1998a, 1998b) and provoked criticism of the validity of the TOL (Kafer & Hunter, 1997). In contrast, here it was demonstrated that split-half reliability and internal consistency of TOL-F accuracy was adequate overall as well as across a wide age range, as the different reliability estimates ranged between 0.713 and 0.755 for the overall Mainz sample and were only slightly lower (∼0.05 units) for the overall Vienna sample (0.656–0.730). Moreover, despite differences in test administration (touch screen vs. computer mouse; overall time limit of 20 min vs. no time limit; see Methods), reliability estimates were highly comparable between the two samples, for instance, the glb converged to almost identical estimates of 0.755 and 0.730 for the overall Mainz and Vienna samples, respectively. Furthermore, reliability was found to be stable across different age groups, with glb and λ4 exceeding 0.7 for all subgroups. In contrast, Cronbach's α (Cronbach, 1951) yielded the lowest estimate in all cases. This is in line with suggestions that—although still the most commonly reported reliability metric—Cronbach's α may not be the most appropriate measure and will underestimate the true reliability in many cases and that ωtot and glb be used as better indices of the lower bound method of quantifying reliability (Peters, 2014; Revelle & Zinbarg, 2009; Sijtsma, 2009). For reasons of comparability, we nonetheless also report Cronbach's α, which reached 0.713 at least in the Mainz sample and thus indicated sufficiently high reliability of the item set, thus clearly exceeding previously reported data from nonoptimized problem sets (see Humes et al., 1997; see also Kafer & Hunter, 1997). Moreover, current reliability estimates also exceed those of a recent empirically derived four-disk TOL version, which yielded maximal values of ωtot = 0.64, glb = 0.67, and λ4 = 0.65 in children, but lower estimates in adults (Tunstall et al., 2014). That is, using indices concurring with newly developed research criteria (EFPA, 2013), reliability estimates of accuracy of the TOL-F problem set originally composed on the basis of theoretical problem space analyses were found to be satisfactory and considerably higher than those for TOL versions based on empirical item selection. The TOL-F can, therefore, be regarded as a planning test suitable for adults, to which the four-disk TOL version by Tunstall and colleagues (2014) may represent a valuable complementary test for children.

Furthermore, the TOL-F yielded a broad range of performance even in healthy participants, with mean accuracy ranging approximately between 40% and 70% (see Fig. 3A and B, left panels). More specifically, five- and six-move problems contributed most to performance differences of the adults tested here, whereas four-move problems did not provide much variance. In clinical populations, however, it can be expected that four-move problems will account for additional variance, yielding even broader ranges of performance than found in the present sample (cf. Köstering, Schmidt, et al., 2015). As originally envisaged, the TOL-F was further shown to be a “graded difficulty test” (Shallice, 1982, p. 204), where low-performing subjects gradually fail to solve problems at more demanding levels, whereas high-performing ones reliably solve easier as well as more difficult problems (Fig. 3A and B, lower right panels), which also attests to the TOL-F's construct validity. In close relation, recent analyses of the TOL-F problem set using the framework of Item Response Theory (cf. Hambleton & Swaminathan, 1985) underlined its construct validity for measuring planning ability in terms of a psychometrically unidimensional trait (Debelak, Egle, Köstering, & Kaller, 2015), and the test–retest reliability of TOL-F accuracy, measured as the relative consistency and absolute agreement of individual performance over time, was also found to be adequate (Köstering, Nitschke, et al., 2015). Based on these and the present results, we propose the TOL-F as a clinically feasible, yet reliable test of complex cognition.

It has to be noted, however, that current results for the TOL-F do not necessarily generalize to other TOL versions that do not feature a systematic manipulation of structural determinants of problem difficulty. That is, as the difficulty of individual TOL problems is substantially determined by structural problem parameters beyond the minimum moves such as goal hierarchy, search depth, and the number of optimal paths to solution (e.g., Berg et al., 2010; Kaller et al., 2004; Ward & Allport, 1997; for an extensive discussion, see Kaller, Rahm, Köstering, et al., 2011), it was hence an integral part of the present theoretically grounded approach to control for these influences by keeping them at a constant level within and across the minimum number of moves (cf. Kaller, Rahm, Köstering, et al., 2011). In addition, accounting for structural problem parameters has also enabled examination of their specific impact in neurological patients (Köstering, McKinlay, Stahl, & Kaller, 2012; McKinlay et al., 2008; Rainville, Lepage, Gauthier, Kergoat, & Belleville, 2012; see also Andrews, Halford, Chappell, Maujean, & Shum, 2014) as well as in developmental studies extending the common knowledge about planning trajectories (Kaller, Rahm, Spreer, Mader, & Unterrainer, 2008; Köstering, Stahl, et al., 2014; Unterrainer et al., 2013, 2015, in press), and also provided new insights into the neural foundation of planning ability using functional and structural brain imaging (Kaller et al., 2015; Kaller, Heinze, et al., 2012; Kaller, Rahm, Spreer, Weiller, & Unterrainer, 2011; Newman, Greco, & Lee, 2009; Ruh, Rahm, Unterrainer, Weiller, & Kaller, 2012). Thus, carefully selecting TOL items with regard to their structural problem parameters is both critical for adequate psychometric properties and useful for a more specific characterization of planning ability in healthy and clinical populations.

With respect to the second objective, this is the largest study to date to assess both age- and sex-related differences in TOL performance in healthy adults. This study is, therefore, an important complement to a recent study on TOL performance from childhood to young adult age based on n ∼ 900 participants (Albert & Steinberg, 2011). More specifically, in line with previous studies (e.g., Allamanno et al., 1987; De Luca et al., 2003; Köstering, Stahl, et al., 2014; Peña-Casanova et al., 2009; Robbins et al., 1998), in the Mainz sample, there was an age-related decrease in planning ability, which unfolded in a decrement in accuracy of about 20% from age 40 to 80 (Fig. 3A). As obvious in the Vienna sample (Fig. 3B), performance remained at a stable level from ages 16 to 35 with a decrease beginning at about 45 years. Thus, taking together results from both samples, age-related performance was found to be comparably stable from late adolescence/early adulthood until the fifth decade of life and to decrease from thereon in an almost linear trajectory. Important for further diagnostic considerations, the age-related decline was strongly tied to problem difficulty, as the ability to correctly solve more difficult five- and in particular six-move problems decreased with increasing age (Fig. 3). This clearly points to the fact that in older ages, planning per se still worked for easier tasks, but when longer sequences of moves had to be generated and evaluated, limitations became obvious. Recently, it has been shown that age-related differences in TOL planning accuracy of adults aged 60 years and older are driven by concomitant differences in fluid reasoning and—to a lesser extent—in working memory capacity, but not by a general slowing in processing speed (Köstering, Leonhart, Stahl, Weiller, & Kaller, 2014). As fluid reasoning and working memory (capacity) are known to be linked to planning performance (e.g., Bugg et al., 2006; Gilhooly, Wynn, Phillips, Logie, & Della Sala, 2002; Phillips, Wynn, Gilhooly, Della Sala, & Logie, 1999; Unterrainer et al., 2004; Zook, Davalos, DeLosh, & Davis, 2004; Zook et al., 2006), whereas processing speed is an unspecific function known to affect on a range of cognitive domains (Salthouse, 1996, 2000), this further argues for an age-related deterioration in the core planning processes from mid-adulthood onward as evidenced here. Nonetheless, one caveat remains namely that—due to the cross-sectional data acquisition in both samples—the present age effects may at least to some extent be also driven by cohort and period effects and hence be to some extent overestimated.

Another issue that has to be discussed is the observed sex difference. As depicted in Fig. 3A and B, this phenomenon was observable in both samples independently and indicated that men outperformed women by about 5%. Although the sex by age interaction was not significant in either sample, in the Vienna data, it is apparent on a descriptive level that men were significantly better than women from about 45 years onward. In the Mainz sample, the sex difference was found across all age groups from 40 to 80 years. Previous reports on sex differences in planning ability using tower tasks have been very inconsistent, with few studies revealing significant effects (De Luca et al., 2003; Peña-Casanova et al., 2009; Rönnlund et al., 2001; Unterrainer et al., 2013). In contrast, present results can be taken to reveal that sex effects on planning performance do indeed exist, albeit only translating into a modest difference in accuracy. Furthermore, the fifth decade of life possibly marks a critical age during adulthood at which these sex differences begin to emerge. Interestingly, this closely parallels the trajectory of age-related decrements in planning accuracy, which also became consistently evident from 40 years of age onward. That is, reasons for overlooked sex differences in previous studies may have been due to the often limited sample size and the age of participants. In line with the latter explanation, of the four studies that did find significant differences between adult male and female participants, two studies tested samples in the age range from about 35 to 90 years (Peña-Casanova et al., 2009; Rönnlund et al., 2001), thus roughly concurring with the age range for which sex differences were evident here, whereas the remaining studies reported sex effects for the whole sample aged 8–64 years (De Luca et al., 2003) and for preschoolers (Unterrainer et al., 2013).

One explanation for the sex differences in planning accuracy might be the well-accepted common knowledge that men perform better on visuospatial tasks than women (Halpern, 1997; Kimura, 1992). However, this cannot fully account for the interaction of sex and minimum number of moves found for the Mainz sample, which revealed that differences between men and women increased as a function of problem difficulty. Although more moves and problem states have to be visualized for more complex problems, the nature of what has to be visualized does not change. Therefore, increased problem difficulty is likely to result in increased demands on working memory (more information to actively maintain and manipulate) and relational reasoning (more move interdependencies to consider), rather than in a qualitative difference in basic visuospatial processing. Thus, sex differences are more likely to arise from differences in the planning processes per se, that is, in the mental simulation and evaluation of moves and their interdependencies. In this regard, sex differences favoring men are relatively consistently reported for visuospatial working memory (e.g., Cansino et al., 2013; De Luca et al., 2003; Kaufman, 2007; Lejbak, Crossley, & Vrbancic, 2011; Pauls, Petermann, & Lepach, 2013; for a review, see Wang & Carr, 2014). Sex effects on fluid reasoning are more equivocal (Lynn & Kanazawa, 2011; Plaisted, Bell, & Mackintosh, 2011; Savage-McGlynn, 2012; Zook et al., 2006), but in a meta-analysis, a small yet significant advantage for adult men on the Raven matrices was found (Lynn & Irwing, 2004). Hence, sex-related differences in planning ability could also be driven by concomitant differences in visuospatial working memory and relational reasoning in the visuospatial domain.

In close relation to this, steroid hormone levels follow different trajectories over the male versus female adult life span. Whereas testosterone levels decrease gradually in men from young adulthood onward (Mooradian & Korenman, 2015), estrogen levels in women decline markedly during the menopause—typically starting in the fifth or sixth decade of life—after relatively stable levels in the reproductive age range. Animal studies have established that the prefrontal cortex (PFC) of the brain and prefrontally mediated cognitive functions are particularly sensitive to the detrimental effects of estrogen depletion (Kritzer & Kohama, 1998; Lacreuse, Wilson, & Herndon, 2002; Tinkler, Tobin, & Voytko, 2004), especially the dorsolateral PFC (Kritzer & Kohama, 1998), which is critical for planning ability and working memory functions. In line with this, cognitive deficits in working memory and other executive tasks have been found in different stages of menopause (Weber, Mapstone, Staskiewicz, & Maki, 2012; Weber, Rubin, & Maki, 2013) and in menopausal women without versus with hormonal replacement therapy (Keenan, Ezzat, Ginsburg, & Moore, 2001). Furthermore, women's task-related brain activity in working memory tasks is sensitive to manipulations in estrogen levels (Epperson, Amin, Ruparel, Gur, & Loughead, 2012; Jacobs & D'Esposito, 2011; Li et al., 2015). Hence, sex-related differences in planning ability that were especially pronounced from the fifth decade onward (Fig. 3) could further be driven by detrimental effects of menopause-related decreases in hormone levels on visuospatial working memory (and fluid reasoning) processes in women. Irrespective of the specific processes underlying the sexually dimorphic pattern of planning ability, here it was demonstrated that sex differences in planning accuracy are consistently evident from the fifth decade of life onward, thus resolving equivocal evidence from previous studies, and should be taken into account when establishing normative data.

Notwithstanding the uniquely large and population-representative data presented here, some limitations have to be considered. Collecting large data sets is highly desirable but almost always comes at certain trade-off, for example, group testings (Vienna) or overall time limits (Mainz). Because all participants were tested using the identical computer program in quiet and standardized settings, differences between group and single testing should be negligible. Furthermore, despite the predetermined overall time limit of 20 min (exclusive of instructions and self-determined pauses between problem items) in the Mainz sample, only a small number of participants (5.8%) did not accomplish to finalize the test within 20-min processing time (Fig. 5A), indicating that the given time window was sufficient in the majority of cases. In comprehensive neuropsychological examinations, the time available for single tests is typically severely restricted and thus many researchers and clinicians might even favor and benefit from a reliable data acquisition with subtle time constraints that meet a realistic testing situation.

When inspecting the pattern of test cancellations in more detail (Fig. 5), one has to differentiate between cancellations due to three consecutive timeouts (in both samples) and due to the 20-min overall time limit (in the Mainz sample only). With regard to the three consecutive timeouts, there is a marked pattern of “peaks” in the frequency at items #13 and #20 (especially in the Mainz sample; Fig. 5). These two items are the fifth and fourth item presented of the five-move and six-move problems, respectively. That is, when the test is cancelled, subjects have either worked on the first five 5-move-items and consecutively failed to solve them, or worked on the first four 6-move-items and consecutively failed them. Importantly, as there are four combinations of the problem parameters search depth and goal hierarchy that are presented in these first four items of each minimum move length (and repeated in the second set of four items), subjects have worked on each possible combination of minimum moves, search depth, and goal hierarchy (cf. Table 2). Thus, the test cancellations are most likely the result of an overall limit of participants' planning ability, rather than a failure to cope with a specific (or arbitrary) combination of problem parameters.

In this regard, there seems to be a two-stage pattern of limitations in planning ability, with the first becoming obvious already at the level of five minimum moves, whereas the second becomes obvious at six minimum moves. That is, there is a proportion of subjects with low performance for whom the five-move problems are already a significant challenge. Without the cancellation after three subsequent timeouts, these subjects would most probably be very frustrated at having to work through the rest of the five-move and the even more difficult six-move problems, so that for them, the test cancellation spares them this frustration and avoids unnecessary testing. Importantly, this pattern is not (entirely) age related, but also occurs in the “younger” groups of mid-adulthood from 40 years of age onward (and only begins to rise substantially after 65 years of age), so it cannot be exhaustively explained by an age-related slowing of processing speed, but is rather reflective of lower planning ability across all stages of mid-adulthood. The second stage where a proportion of subjects reaches three consecutive timeouts occurs at the most difficult problems (i.e., with six minimum moves), and this pattern of timeouts seems to be age related in that the frequency of subjects increases for the older age groups, especially from 65 years of age onward. Again, as the test cancellation occurs after the first four items with six minimum moves have been worked on, this also seems to reflect the effect of age-related decrements in planning ability and in dealing with the most complex problems with six minimum moves.

With regard to test cancellations due to the 20-min overall time limit in the Mainz sample, there is—by way of nature of this time limit—no clear pattern regarding the specific items at which this cancellation occurs, but there is a marked age-related increase in test cancellations. Besides reaching the individual limits in planning ability (as discussed earlier), this age-related increase may also be attributed to an unspecific decrease in processing speed with older age that is unrelated to participants’ planning ability (but see Köstering, Leonhart, et al., 2014). This might hence have led to an overestimation of the age-related decrease in planning ability in the Mainz sample (and explain the lower accuracy of the older age groups when compared with their homolog groups in the Vienna sample). But importantly, in the Vienna sample, a similar age-related decrease in planning ability was found, so that the overall time limit in the Mainz sample did not change the pattern of age-related performance differences per se (only its extent).

Finally, a recent study comparing manual and computerized administrations of the TOL (using a balanced problem set similar to the present one) did not reveal any significant differences in planning performance in healthy young adults (McKinlay & McLellan, 2011). A similar study on patients with Autism Spectrum Disorder yielded the same results, concluding that computer-administered and experimenter-administered versions can be used interchangeably (Williams & Jarrold, 2013). However, particularly for clinical applications, it might be relevant whether a manual administration effectively leads to a higher number of rule violations than a computerized administration, given that such rule violations could help to better understand and characterize a patient's impairments in planning. Although the TOL-F does not allow breaking rules, it (i) records any attempts to do so and (ii) provides data on the number and type of observed rule breaks (see also Dependent Measures). That is, whether rule breaks of patients in manual and computerized task versions are of a different quality and/or quantity remains subject to future research, but clinically valuable information on this matter can be derived from the TOL-F.

A further, partly related question concerns the extent to which the present findings on the TOL-F's psychometric properties generalize to clinical populations. First data on patients with ischemic stroke (n = 60), Parkinson's syndrome (n = 51), and mild cognitive impairment (n = 29) indicate that the TOL-F constitutes a reliable and valid measure of planning ability in these groups of neurologically impaired patients (Köstering, Schmidt, et al., 2015). However, as sample sizes were small and within-sample heterogeneity can in general be expected to be enlarged in patient studies, these findings have to be consolidated by more comprehensive data collections. In this respect, psychiatric populations with well-known planning impairments such as patients with schizophrenia, depression, obsessive compulsive disorder, attention deficit hyperactivity disorder, and autism spectrum disorder should also be considered in future assessments of the TOL-F's clinical utility.

Conclusion

Using a theoretically grounded problem set accounting for structural properties beyond the minimum number of moves, planning accuracy on the TOL-F possesses adequate psychometric properties that are stable across the adult life span. At the same time, TOL-F accuracy covers a broad range of graded difficulty even in healthy adults, which makes this task suitable for both research and clinical application. Furthermore, an age-related decrease in planning accuracy as well as significant differences between men and women unfold from the fifth decade onward.

Funding

The Gutenberg Health Study is funded through the government of Rhineland-Palatinate (“Stiftung Rheinland-Pfalz für Innovation,” contract AZ 961-386261/733), the research programs “Wissen schafft Zukunft” and “Center for Translational Vascular Biology (CTVB)” of the Johannes Gutenberg University of Mainz, and its contract with Boehringer Ingelheim and PHILIPS Medical Systems, including an unrestricted grant for the Gutenberg Health Study. This study was also supported by the intramural grant program “MAIFOR” of the University Medical Center of the Johannes Gutenberg University Mainz.

CPK and LK are supported by the BrainLinks-BrainTools Cluster of Excellence funded by the German Research Foundation (DFG; grant # EXC 1086). LK is further supported by scholarship funds from the State Graduate Funding Program of Baden-Württemberg, Germany. PSW is funded by the Federal Ministry of Education and Research (BMBF 01EO1003) and received honoraria for lectures or consulting from Boehringer Ingelheim and Bayer HealthCare, Leverkusen.

Conflict of Interest

CPK and JMU declare to receive a proportion of TOL-F license fees from the SCHUHFRIED GmbH due to authorship for the published TOL-F test materials (Kaller, Unterrainer, et al., 2012). RD and JE are employees of the SCHUHFRIED GmbH that maintains and commercially distributes the computerized TOL-F as part of the Vienna Test System. All other authors report no conflicts of interest.

References

Albert
D.
,
Steinberg
L.
(
2011
).
Age differences in strategic planning as indexed by the Tower of London
.
Child Development
 ,
82
,
1501
1517
.
Allamanno
N.
,
Della Sala
S.
,
Laiacona
M.
,
Pasetti
C.
,
Spinnler
H.
(
1987
).
Problem solving ability in aging and dementia: Normative data on a non-verbal test
.
The Italian Journal of Neurological Sciences
 ,
8
,
111
119
.
Andrews
G.
,
Halford
G. S.
,
Chappell
M.
,
Maujean
A.
,
Shum
D. H. K.
(
2014
).
Planning following stroke: A relational complexity approach using the Tower of London
.
Frontiers in Human Neuroscience
 ,
8
,
1
14
.
Bentler
P. M.
,
Woodward
J. A.
(
1980
).
Inequalities among lower bounds to reliability: With applications to test construction and factor analysis
.
Psychometrika
 ,
45
,
249
267
.
Berg
W. K.
,
Byrd
D. L.
(
2002
).
The Tower of London spatial problem-solving task: Enhancing clinical and research implementation
.
Journal of Clinical and Experimental Neuropsychology
 ,
24
,
586
604
.
Berg
W. K.
,
Byrd
D. L.
,
McNamara
J. P. H.
,
Case
K.
(
2010
).
Deconstructing the tower: Parameters and predictors of problem difficulty on the Tower of London task
.
Brain and Cognition
 ,
72
,
472
482
.
Bugg
J. M.
,
Zook
N. A.
,
DeLosh
E. L.
,
Davalos
D. B.
,
Davis
H. P.
(
2006
).
Age differences in fluid intelligence: Contributions of general slowing and frontal decline
.
Brain and Cognition
 ,
62
,
9
16
.
Cansino
S.
,
Hernández-Ramos
E.
,
Estrada-Manilla
C.
,
Torres-Trejo
F.
,
Martínez-Galindo
J. G.
,
Ayala-Hernández
M.
et al
. (
2013
).
The decline of verbal and visuospatial working memory across the adult life span
.
Age
 ,
35
,
2283
2302
.
Carder
H.
,
Handley
S.
,
Perfect
T.
(
2004
).
Deconstructing the Tower of London: Alternative moves and conflict resolution as predictors of task performance
.
The Quarterly Journal of Experimental Psychology. A, Human Experimental Psychology
 ,
57
,
1459
1483
.
Chan
R. C. K.
,
Shum
D.
,
Toulopoulou
T.
,
Chen
E. Y. H.
(
2008
).
Assessment of executive functions: Review of instruments and identification of critical issues
.
Archives of Clinical Neuropsychology
 ,
23
,
201
216
.
Cortina
J. M.
(
1993
).
What is coefficient alpha? An examination of theory and applications
.
Journal of Applied Psychology
 ,
78
,
98
104
.
Cronbach
L. J.
(
1951
).
Coefficient alpha and the internal structure of tests
.
Psychometrika
 ,
16
,
297
334
.
Culbertson
W.
,
Zillmer
E.
(
1998a
).
The construct validity of the Tower of LondonDX as a measure of the executive functioning of ADHD children
.
Assessment
 ,
5
,
215
226
.
Culbertson
W.
,
Zillmer
E.
(
1998b
).
The Tower of LondonDX: A standardized approach to assessing executive functioning in children
.
Archives of Clinical Neuropsychology
 ,
13
,
285
301
.
Debelak
R.
,
Egle
J.
,
Köstering
L.
,
Kaller
C. P.
(
2015
).
Assessment of planning ability: Psychometric analyses on the unidimensionality and construct validity of the Tower of London Task (TOL-F)
.
Neuropsychology
 ,
Advance online publication.
De Luca
C. R.
,
Wood
S. J.
,
Anderson
V.
,
Buchanan
J.-A.
,
Proffitt
T. M.
,
Mahony
K.
et al
. (
2003
).
Normative data from the CANTAB. I: Development of executive function over the lifespan
.
Journal of Clinical and Experimental Neuropsychology
 ,
25
,
242
254
.
Diamond
A.
(
2013
).
Executive functions
.
Annual Review of Psychology
 ,
64
,
135
168
.
EFPA
. (
2013
).
EFPA review model for the description and evaluation of psychological and educational tests. Version 4.2.6. Retrieved from http://www.efpa.eu/professional-development/assessment
Epperson
C. N.
,
Amin
Z.
,
Ruparel
K.
,
Gur
R.
,
Loughead
J.
(
2012
).
Interactive effects of estrogen and serotonin on brain activation during working memory and affective processing in menopausal women
.
Psychoneuroendocrinology
 ,
37
,
372
382
.
Gilhooly
K. J.
,
Wynn
V.
,
Phillips
L. H.
,
Logie
R. H.
,
Sala
S. D.
(
2002
).
Visuo-spatial and verbal working memory in the five-disc Tower of London task: An individual differences approach
.
Thinking & Reasoning
 ,
8
,
165
178
.
Halpern
D. F.
(
1997
).
Sex differences in intelligence. Implications for education
.
The American Psychologist
 ,
52
,
1091
1102
.
Hambleton
R. K.
,
Swaminathan
H.
(
1985
).
Item response theory: Principles and applications
 .
Boston
:
Kluwer
.
Humes
G.
,
Welsh
M.
,
Retzlaff
P.
(
1997
).
Towers of Hanoi and London: Reliability and validity of two executive function tasks
.
Assessment
 ,
4
,
249
257
.
Jacobs
E.
,
D'Esposito
M.
(
2011
).
Estrogen shapes dopamine-dependent cognitive processes: Implications for women's health
.
The Journal of Neuroscience
 ,
31
,
5286
5293
.
Kafer
K. L.
,
Hunter
M.
(
1997
).
On testing the face validity of planning/problem-solving tasks in a normal population
.
Journal of the International Neuropsychological Society
 ,
3
,
108
119
.
Kaller
C. P.
,
Heinze
K.
,
Mader
I.
,
Unterrainer
J. M.
,
Rahm
B.
,
Weiller
C.
et al
. (
2012
).
Linking planning performance and gray matter density in mid-dorsolateral prefrontal cortex: Moderating effects of age and sex
.
NeuroImage
 ,
63
,
1454
1463
.
Kaller
C. P.
,
Rahm
B.
,
Köstering
L.
,
Unterrainer
J. M.
(
2011
).
Reviewing the impact of problem structure on planning: A software tool for analyzing tower tasks
.
Behavioural Brain Research
 ,
216
,
1
8
.
Kaller
C. P.
,
Rahm
B.
,
Spreer
J.
,
Mader
I.
,
Unterrainer
J. M.
(
2008
).
Thinking around the corner: The development of planning abilities
.
Brain and Cognition
 ,
67
,
360
370
.
Kaller
C. P.
,
Rahm
B.
,
Spreer
J.
,
Weiller
C.
,
Unterrainer
J. M.
(
2011
).
Dissociable contributions of left and right dorsolateral prefrontal cortex in planning
.
Cerebral Cortex
 ,
21
,
307
317
.
Kaller
C. P.
,
Reisert
M.
,
Katzev
M.
,
Umarova
R.
,
Mader
I.
,
Hennig
J.
et al
. (
2015
).
Predicting planning performance from structural connectivity between left and right mid-dorsolateral prefrontal cortex: Moderating effects of age during postadolescence and midadulthood
.
Cerebral Cortex
 ,
25
,
869
883
.
Kaller
C. P.
,
Unterrainer
J. M.
,
Kaiser
S.
,
Weisbrod
M.
,
Aschenbrenner
S.
(
2012
).
Tower of London—Freiburg version
 .
Mödling
:
Schuhfried
.
Kaller
C. P.
,
Unterrainer
J. M.
,
Rahm
B.
,
Halsband
U.
(
2004
).
The impact of problem structure on planning: Insights from the Tower of London task
.
Cognitive Brain Research
 ,
20
,
462
472
.
Kaller
C. P.
,
Unterrainer
J. M.
,
Stahl
C.
(
2012
).
Assessing planning ability with the Tower of London task: Psychometric properties of a structurally balanced problem set
.
Psychological Assessment
 ,
24
,
46
53
.
Kaufman
S. B.
(
2007
).
Sex differences in mental rotation and spatial visualization ability: Can they be accounted for by differences in working memory capacity?
Intelligence
 ,
35
,
211
223
.
Keenan
P. A.
,
Ezzat
W. H.
,
Ginsburg
K.
,
Moore
G. J.
(
2001
).
Prefrontal cortex as the site of estrogen's effect on cognition
.
Psychoneuroendocrinology
 ,
26
,
577
590
.
Kimura
D.
(
1992
).
Sex differences in the brain
.
Scientific American
 ,
267
,
118
125
.
Köstering
L.
,
Leonhart
R.
,
Stahl
C.
,
Weiller
C.
,
Kaller
C. P.
(
2014
).
Planning decrements in healthy aging: Mediation effects of fluid reasoning and working memory capacity
.
The Journals of Gerontology, Series B: Psychological Sciences and Social Sciences
 ,
August 26, 2014 [Epub ahead of print].
Köstering
L.
,
McKinlay
A.
,
Stahl
C.
,
Kaller
C. P.
(
2012
).
Differential patterns of planning impairments in Parkinson's disease and sub-clinical signs of dementia? A latent-class model-based approach
.
PLoS One
 ,
7
,
e38855
.
Köstering
L.
,
Nitschke
K.
,
Schumacher
F. K.
,
Weiller
C.
,
Kaller
C. P.
(
2015
).
Test-retest reliability of the Tower of London planning task (TOL-F)
.
Psychological Assessment
 ,
27
,
925
931
.
Köstering
L.
,
Schmidt
C. S. M.
,
Egger
K.
,
Amtage
F.
,
Peter
J.
,
Klöppel
S.
et al
. (
2015
).
Assessment of planning performance in clinical samples: Reliability and validity of the Tower of London task (TOL-F)
.
Neuropsychologia
 ,
75
,
646
655
.
Köstering
L.
,
Stahl
C.
,
Leonhart
R.
,
Weiller
C.
,
Kaller
C. P.
(
2014
).
Development of planning abilities in normal aging: Differential effects of specific cognitive demands
.
Developmental Psychology
 ,
50
,
293
303
.
Krikorian
R.
,
Bartok
J.
,
Gay
N.
(
1994
).
Tower of London procedure: A standard method and developmental data
.
Journal of Clinical and Experimental Neuropsychology
 ,
16
,
840
850
.
Kritzer
M. F.
,
Kohama
S. G.
(
1998
).
Ovarian hormones influence the morphology, distribution, and density of tyrosine hydroxylase immunoreactive axons in the dorsolateral prefrontal cortex of adult rhesus monkeys
.
The Journal of Comparative Neurology
 ,
395
,
1
17
.
Lacreuse
A.
,
Wilson
M. E.
,
Herndon
J. G.
(
2002
).
Estradiol, but not raloxifene, improves aspects of spatial working memory in aged ovariectomized rhesus monkeys
.
Neurobiology of Aging
 ,
23
,
589
600
.
Lejbak
L.
,
Crossley
M.
,
Vrbancic
M.
(
2011
).
A male advantage for spatial and object but not verbal working memory using the n-back task
.
Brain and Cognition
 ,
76
,
191
196
.
Li
K.
,
Huang
X.
,
Han
Y.
,
Zhang
J.
,
Lai
Y.
,
Yuan
L.
et al
. (
2015
).
Enhanced neuroactivation during working memory task in postmenopausal women receiving hormone therapy: A coordinate-based meta-analysis
.
Frontiers in Human Neuroscience
 ,
9
,
1
9
.
Lowe
C.
,
Rabbitt
P.
(
1998
).
Test\re-test reliability of the CANTAB and ISPOCD neuropsychological batteries: Theoretical and practical issues
.
Neuropsychologia
 ,
36
,
915
923
.
Luciana
M.
,
Collins
P. F.
,
Olson
E. A.
,
Schissel
A. M.
(
2009
).
Tower of London performance in healthy adolescents: The development of planning skills and associations with self-reported inattention and impulsivity
.
Developmental Neuropsychology
 ,
34
,
461
475
.
Lynn
R.
,
Irwing
P.
(
2004
).
Sex differences on the progressive matrices: A meta-analysis
.
Intelligence
 ,
32
,
481
498
.
Lynn
R.
,
Kanazawa
S.
(
2011
).
A longitudinal study of sex differences in intelligence at ages 7, 11 and 16 years
.
Personality and Individual Differences
 ,
51
,
321
324
.
McDonald
R. P.
(
1999
).
Test theory: A unified treatment
 .
Hillsdale
:
Erlbaum
.
McKinlay
A.
,
Kaller
C. P.
,
Grace
R. C.
,
Dalrymple-Alford
J. C.
,
Anderson
T. J.
,
Fink
J.
et al
. (
2008
).
Planning in Parkinson's disease: A matter of problem structure?
Neuropsychologia
 ,
46
,
384
389
.
McKinlay
A.
,
McLellan
T.
(
2011
).
Does mode of presentation affect performance on the Tower of London task?
Clinical Psychologist
 ,
15
,
63
68
.
Mooradian
A. D.
,
Korenman
S. G.
(
2015
).
Management of the cardinal features of andropause
.
American Journal of Therapeutics
 ,
13
,
145
160
.
Newman
S. D.
,
Greco
J. A.
,
Lee
D.
(
2009
).
An fMRI study of the Tower of London: A look at problem structure differences
.
Brain Research
 ,
1286
,
123
132
.
Newman
S. D.
,
Pittman
G.
(
2007
).
The Tower of London: A study of the effect of problem structure on planning
.
Journal of Clinical and Experimental Neuropsychology
 ,
29
,
333
342
.
Owen
A. M.
,
Downes
J. J.
,
Sahakian
B. J.
,
Polkey
C. E.
,
Robbins
T. W.
(
1990
).
Planning and spatial working memory following frontal lobe lesions in man
.
Neuropsychologia
 ,
28
,
1021
1034
.
Owen
A. M.
,
Sahakian
B. J.
,
Hodges
J. R.
,
Summers
B. A.
,
Polkey
C. E.
,
Robbins
T. W.
(
1995
).
Dopamine-dependent frontostriatal planning deficits in early Parkinson's disease
.
Neuropsychology
 ,
9
,
126
140
.
Pauls
F.
,
Petermann
F.
,
Lepach
A. C.
(
2013
).
Gender differences in episodic memory and visual working memory including the effects of age
.
Memory
 ,
21
,
857
874
.
Peña-Casanova
J.
,
Quiñones-Ubeda
S.
,
Gramunt-Fombuena
N.
,
Quintana
M.
,
Aguilar
M.
,
Molinuevo
J. L.
et al
. (
2009
).
Spanish Multicenter Normative Studies (NEURONORMA Project): Norms for the Stroop color-word interference test and the Tower of London-Drexel
.
Archives of Clinical Neuropsychology
 ,
24
,
413
429
.
Peters
G.-J. Y.
(
2014
).
The alpha and the omega of scale reliability and validity
.
The European Health Psychologist
 ,
16
,
56
69
.
Phillips
L. H.
,
Wynn
V.
,
Gilhooly
K. J.
,
Della Sala
S.
,
Logie
R. H.
(
1999
).
The role of memory in the Tower of London task
.
Memory
 ,
7
,
209
231
.
Plaisted
K.
,
Bell
S.
,
Mackintosh
N. J.
(
2011
).
The role of mathematical skill in sex differences on Raven's Matrices
.
Personality and Individual Differences
 ,
51
,
562
565
.
Primi
R.
(
2014
).
Developing a fluid intelligence scale through a combination of Rasch modeling and cognitive psychology
.
Psychological Assessment
 ,
26
,
774
788
.
Rabbitt
P.
(
1997
).
Introduction: Methodologies and models in the study of executive functions
. In
Rabbitt
P.
(Ed.),
Methodology of frontal and executive function
  (pp.
1
38
).
East Sussex, UK
:
Psychology Press
.
Rainville
C.
,
Lepage
E.
,
Gauthier
S.
,
Kergoat
M.
,
Belleville
S.
(
2012
).
Executive function deficits in persons with mild cognitive impairment: A study with a Tower of London task
.
Journal of Clinical and Experimental Neuropsychology
 ,
34
,
306
324
.
R Core Team
. (
2013
).
R: A language and environment for statistical computing
 .
Vienna
. .
Revelle
W.
(
2013
).
psych: Procedures for personality and psychological research. R package version 1.3.2. Retrieved from http://personality-project.org/r/psych
.
Revelle
W.
,
Zinbarg
R. E.
(
2009
).
Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma
.
Psychometrika
 ,
74
,
145
154
.
Robbins
T. W.
,
James
M.
,
Owen
A. M.
,
Sahakian
B. J.
,
Lawrence
A. D.
,
McInnes
L.
et al
. (
1998
).
A study of performance on tests from the CANTAB battery sensitive to frontal lobe dysfunction in a large sample of normal volunteers: Implications for theories of executive functioning and cognitive aging
.
Journal of the International Neuropsychological Society
 ,
4
,
474
490
.
Rönnlund
M.
,
Lövdén
M.
,
Nilsson
L.-G.
(
2001
).
Adult age differences in Tower of Hanoi performance: Influence from demographic and cognitive variables
.
Aging, Neuropsychology and Cognition
 ,
8
,
269
283
.
Ruh
N.
,
Rahm
B.
,
Unterrainer
J.
,
Weiller
C.
,
Kaller
C.
(
2012
).
Dissociable stages of problem solving (II): First evidence for process-contingent temporal order of activation in dorsolateral prefrontal cortex
.
Brain and Cognition
 ,
80
,
170
176
.
Salthouse
T.
(
1996
).
The processing-speed theory of adult age differences in cognition
.
Psychological Review
 ,
103
,
403
428
.
Salthouse
T. A.
(
2000
).
Aging and measures of processing speed
.
Biological Psychology
 ,
54
,
35
54
.
Savage-McGlynn
E.
(
2012
).
Sex differences in intelligence in younger and older participants of the Raven's Standard Progressive Matrices Plus
.
Personality and Individual Differences
 ,
53
,
137
141
.
Schnirman
G. M.
,
Welsh
M. C.
,
Retzlaff
P. D.
(
1998
).
Development of the Tower of London-Revised
.
Assessment
 ,
5
,
355
360
.
Shallice
T.
(
1982
).
Specific impairments of planning
.
Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences
 ,
298
,
199
209
.
Shallice
T.
,
Burgess
P. W.
(
1991
).
Deficits in strategy application following frontal lobe damage in man
.
Brain
 ,
114
,
727
741
.
Sijtsma
K.
(
2009
).
On the use, the misuse, and the very limited usefulness of Cronbach's alpha
.
Psychometrika
 ,
74
,
107
120
.
Strauss
E.
,
Sherman
E.
,
Spreen
O.
(
2006
).
A compendium of neuropsychological tests: Administration, norms, and commentary
  (
3rd ed.
).
New York
:
Oxford University Press
.
Sullivan
J. R.
,
Riccio
C. A.
,
Castillo
C. L.
(
2009
).
Concurrent validity of the tower tasks as measures of executive function in adults: A meta-analysis
.
Applied Neuropsychology
 ,
16
,
62
75
.
Syväoja
H. J.
,
Tammelin
T. H.
,
Ahonen
T.
,
Räsänen
P.
,
Tolvanen
A.
,
Kankaanpää
A.
et al
. (
2015
).
Internal consistency and stability of the CANTAB neuropsychological test battery in children
.
Psychological Assessment
 ,
27
,
698
709
.
Tinkler
G. P.
,
Tobin
J. R.
,
Voytko
M. L.
(
2004
).
Effects of two years of estrogen loss or replacement on nucleus basalis cholinergic neurons and cholinergic fibers to the dorsolateral prefrontal and inferior parietal cortex of monkeys
.
The Journal of Comparative Neurology
 ,
469
,
507
521
.
Tunstall
J. R.
,
O'Gorman
J. G.
,
Shum
D. H. K.
(
2014
).
A four-disc version of the Tower of London for clinical use
.
Journal of Neuropsychology
 ,
November 25, 2014 [Epub ahead of print]
.
Unterrainer
J. M.
,
Kaller
C. P.
,
Loosli
S. V.
,
Heinze
K.
,
Ruh
N.
,
Paschke-Müller
M.
et al
. (
2015
).
Looking ahead from age 6 to 13: A deeper insight into the development of planning ability
.
British Journal of Psychology
 ,
106
,
46
67
.
Unterrainer
J. M.
,
Rahm
B.
,
Halsband
U.
,
Kaller
C. P.
(
2005
).
What is in a name: Comparing the Tower of London with one of its variants
.
Cognitive Brain Research
 ,
23
,
418
428
.
Unterrainer
J. M.
,
Rahm
B.
,
Kaller
C. P.
,
Leonhart
R.
,
Quiske
K.
,
Hoppe-Seyler
K.
et al
. (
2004
).
Planning abilities and the Tower of London: Is this task measuring a discrete cognitive function?
Journal of Clinical and Experimental Neuropsychology
 ,
26
,
846
856
.
Unterrainer
J. M.
,
Rauh
R.
,
Rahm
B.
,
Hardt
J.
,
Kaller
C. P.
,
Klein
C.
et al
. (
2015
).
Development of planning in children with high-functioning autism spectrum disorders and/or attention deficit/hyperactivity disorder
.
Autism Research
 . .
Unterrainer
J. M.
,
Ruh
N.
,
Loosli
S. V.
,
Heinze
K.
,
Rahm
B.
,
Kaller
C. P.
(
2013
).
Planning steps forward in development: In girls earlier than in boys
.
PLoS One
 ,
8
,
e80772
.
Wang
L.
,
Carr
M.
(
2014
).
Working memory and strategy use contribute to gender differences in spatial ability
.
Educational Psychologist
 ,
49
,
261
282
.
Ward
G.
,
Allport
A.
(
1997
).
Planning and problem solving using the five disc Tower of London task
.
The Quarterly Journal of Experimental Psychology
 ,
50A
,
49
78
.
Weber
M. T.
,
Mapstone
M.
,
Staskiewicz
J.
,
Maki
P. M.
(
2012
).
Reconciling subjective memory complaints with objective memory performance in the menopausal transition
.
Menopause
 ,
19
,
735
741
.
Weber
M. T.
,
Rubin
L. H.
,
Maki
P. M.
(
2013
).
Cognition in perimenopause: The effect of transition stage
.
Menopause
 ,
20
,
511
517
.
Williams
D.
,
Jarrold
C.
(
2013
).
Assessing planning and set-shifting abilities in autism: Are experimenter-administered and computerised versions of tasks equivalent?
Autism Research
 ,
6
,
461
467
.
Zook
N.
,
Welsh
M. C.
,
Ewing
V.
(
2006
).
Performance of healthy, older adults on the Tower of London Revised: Associations with verbal and nonverbal abilities
.
Aging, Neuropsychology and Cognition
 ,
13
,
1
19
.
Zook
N. A.
,
Davalos
D. B.
,
Delosh
E. L.
,
Davis
H. P.
(
2004
).
Working memory, inhibition, and fluid intelligence as predictors of performance on Tower of Hanoi and London tasks
.
Brain and Cognition
 ,
56
,
286
292
.