Abstract

Language comprehension is vital to social and educational development but few pediatric tests are available for its assessment. To approach this problem, two versions of the Token Test (TT), “TT short form” (DeRenzi & Faglioni, 1978) and “Revised Token Test” (RTT), were first compared. Using a sample of 88 normally developing Spanish-speaking children, the tests were compared on their: (a) established psychometric development and (b) internal consistency. The RTT was judged to be superior and was selected for additional experimentation. The RTT was compared with a developmental measure of lexical knowledge on a cross-sectional sample of 250 4–12-year-old normally developing Spanish-speaking children. A significant positive and high correlation supports its concurrent validity. Significant differences across the age groups, along with a principal component analysis that yielded a three-factor structure, support its construct validity. Preliminary normative data across the nine age groups are provided.

The assessment of the language function is an important part in an individual's overall neuropsychological evaluation, and the assessment of auditory language comprehension and processing is critical as it forms the foundation of communication and language learning. In spite of this importance, neuropsychology continues to face the challenge of creating better language tests of auditory language processing and comprehension that are culturally and linguistically unbiased. The increasing demands for neuropsychological assessment of persons who do not speak English have motivated the translation and development of tests across cultural and linguistic groups (Lezak, Howieson, & Loring, 2004). Challenges in the translation of language tests are many. Tests that have relatively simple grammatical form and simple and universal lexical content are ideal for translation and development. Tests that are psychometrically well developed that also meet these criteria are rare. Nonetheless, valid, reliable, and sensitive tests are essential for studying both language comprehension and production development and disorders. In the specific case of child neuropsychological assessment of receptive language disorders, it is essential that such tests are able to detect and differentially diagnose impairments as well as plan for treatment and evaluate treatment's effectiveness. A test widely reported in the scientific literature to detect receptive language disorders is the Token Test (TT). The conceptual structure for the TT was originally outlined by DeRenzi and Vignolo (1962). Their intent was to develop a sensitive test for the detection of subtle language deficits that were not caused by memory or general intellectual impairments. The TT was originally proposed for assessing adults with aphasia; however, subsequent studies confirmed its clinical utility in assessing various other clinical populations as patients with slight receptive disorders (Van Harskamp & Van Dongen, 1977), patients with different types of aphasia and partial or complete hemianopsia (Poeck & Hartje, 1979), dysphasic children and with learning disabilities (Amorosa, Kleinhans-Lintner, & Von Bender-Fisser, 1980), normal and language-delayed children (Cole & Fewell, 1983), brain-damaged children and adolescents with or without aphasia (Gutbrod & Michel, 1986), and aphasic and nonaphasic patients with left focal hemispheric lesions, right hemispheric brain damage, or dementia (Fontanari, 1989).

Following the conceptual introduction of the TT, a number of tests were developed all termed “the TT.” The proliferation of “the TT” attests to the clinical value of such a test (Lass, DePaolo, Simcoe, & Samuel, 1975), but also to the lack of specificity in the original publication by DeRenzi and Vignolo. This lack of specification resulted from the publication of a conceptual framework for the test instead the publication of a test with established validity and reliability. Consequently, there have been many adaptations and attempts to validate various versions of the task, all referred to as “the TT.” These include versions in Dutch (Paquier et al., 2007; Van Harskamp & Van Dongen, 1977), English (DeRenzi & Faglioni, 1978; Spellacy & Spreen, 1969), German (Orgass & Poeck, 1966; Remschmidt, Niebergall, Geyer, & Merschmann, 1977), Kannada (Vena, 1982), Khmer (Keo, 1999), Polish (Kosciesza & Krasowicz, 1995), and Portuguese (Fontanari, 1989), among others. These TT developments in adults were followed by similar efforts in children. As a result, a variety of TTs have been used to study different pediatric populations both for clinical and research purposes. Pediatric versions have been published in English (Aram & Ekelman, 1987; Cole & Fewell, 1983; DiSimoni & Mucha, 1982; Fusilier & Lass, 1984; Gutbrod & Michel, 1986; Kitson, Vance, & Blosser, 1985; Lass et al., 1975; McNeil, Brauer, & Pratt, 1990; Shelton, Arndt, & Johnson, 1977; Silverman, Raskin, Davidson, & Bloom, 1977), Dutch (Wassenberg et al., 2008), German (Amorosa et al., 1980; Remschmidt et al., 1977), Kannada (Vena, 1982), Polish (Kosciesza & Krasowicz, 1995), and in several Spanish-speaking populations (Ardila & Roselli, 1994; Galarza, Padilla, & Bonilla, 2005; Peña-Casanova et al., 2009). The majority of these versions are unstandardized within and between tests. For example, the objects created by the tester to which the individual responds varies among test version (e.g., three-dimensional vs. two-dimensional objects), different colors and hues of tokens, different sizes of the objects, and real objects versus geometric figures have been used. The rate, intensity, and prosody of the spoken test commands and the physical placement of tokens are often unspecified. Scoring conventions involve complex multidimensional scales, category scales, plus/minus scores, and they are applied to either individual linguistic elements or to the whole sentence. These and other unspecified test variables within and across test versions makes interpretation and generalization among studies and clinical data shared among facilities difficult or impossible to achieve.

As stated above, there is relatively little information published on the assessment of receptive language disorders in Spanish-speaking pediatric populations. Differences in the structure of language, culture, socioeconomic level, and education can influence test performance (Ardila, Roselli, & Puente 1994; Campbell, Dollaghan, Needleman, & Janosky, 1997) across languages (e.g., Spanish, English, Italian, and others). Additionally, language-specific normative data are necessary to interpret neuropsychological assessments and to make informed clinical decisions from them. The awareness of this necessity has motivated attempts to acquire pediatric normative data on some TT versions in some Hispanic countries. These studies have collected normative data on 5–12-year-old children using the DeRenzi and Faglioni (1978) version of the TT and on 5–15-year-old children in a translated and adapted version of this test (Ardila & Roselli, 1994; Galarza et al., 2005; Peña-Casanova et al., 2009). The Revised TT (RTT), another version of a TT, also has been used with pediatric populations (McNeil et al., 1990). It has, however, not been systematically evaluated in Spanish-speaking children. The RTT has been demonstrated to be effective in detecting subtle receptive language deficits in persons with aphasia and nonaphasic right-hemisphere-damaged adults (Eberwein, Pratt, McNeil, Szuminsky, & Doyle, 2007; McNeil & Prescott, 1978). It has a demonstrated high internal consistency (McNeil & Prescott, 1978) and a valid and sensitive multidimensional scoring system (Hula, Doyle, McNeil, & Mikolic, 2006; McNeil, Dionigi, Langlois, & Prescott, 1989). It also has been shown to be sensitive as a measure of auditory language processing for different clinical populations associated with certain language and learning disabilities (Aram & Ekelman, 1987; Brauer, 1988) and has been successfully used to document normal language growth in children aged 5–13 (McNeil et al., 1990). Moreover, the RTT has been successfully abbreviated for use with adults with aphasia (Arvedson, McNeil, & West, 1986) and this shortened version has been shown to have high test–retest reliability in this adult population (Park, McNeil, & Tompkins, 2000).

In spite of abundant psychometric assessment of several TT versions, their factor structure(s) does not appear to have been investigated. There have, however, been many studies on various aspects of several TTs that inform the understanding of the underlying construct shared by many of them. One early study of the homogeneity of the items within subtests was conducted on a German version of a TT (Willmes, 1981). In this test, there were 10 items per subtest in each of five subtests. The 10 items within each of the first four subtests were constructed with identical syntactic forms and sentence lengths (in word number). Part five was composed of both prepositional (e.g., “Put the white square on the green circle”) and adverbial sentences (e.g., “Take all of the circles except the yellow one”). These sentences were not only syntactically heterogeneous, but also required substantively different responses. Some commands required “picking up” a single token, others required “touching” a single token and yet others required picking up multiple tokens or picking up and touching tokens within the same imperative sentence. In short, Willmes found that each of the first four subtests fit a single Rasch model and concluded that “….one can assume that all ten items of each part are homogeneous and that they call for the same type of processing” (p. 633). However, subtest V, with sentences that were heterogeneous on many dimensions (e.g., “Take the blue circle or the yellow square”), did not fit the Rasch model. The data from this study lead the author to conclude that the processing required for subtests I–IV is derived from a single uniform mode of processing in persons with aphasia. These data also support Willmes' conclusion that the processing requirements for the first four subtests are fundamentally different from those required to perform subtest V. With this finding, a two-factor model might be hypothesized. A second Rasch modeling study assessed the validity and sensitivity to change on the shortened (55-item) version of the RTT in adults with aphasia (Hula et al., 2006). This study compared the scores derived from the 15-point multidimensional scoring scale and those derived from the Rasch-based scores in their ability to detect group differences and change over time. Comparisons were made between left-hemisphere-damaged persons with (n = 30) and without (n = 25) aphasia and right-hemisphere-damaged individuals without aphasia (n = 53). The results generally failed to find a group or change measurement difference between the Rasch scores and the RTT scores. Five of the seven test items that showed a poor model fit came from subtest IX, which is equivalent to the misfit items found in subtest V in the Willmes study. This finding also is consistent with the findings from Arvedson and colleagues (1986) who found that subtest IX (adverbial clauses and phrases) could not be shorted because of its within subtest heterogeneity. These findings are again suggestive of a two-factor structure when adverbial clauses are included in the test battery. In a recent study using a newly developed computerized version of the RTT (CRTT), McNeil and colleagues (Submitted) found that for persons with aphasia, subtest IX or X did not correlate highly or significantly with the other subtests or with the overall score, though most of the other subtest did. These are subtests that contain adverbial clauses and are the subtests with the syntactic properties identified in the Willmes study as the source contributing to its differential model fit from the other four subtests. Although factor analyses were not conducted within this study, these findings are interpreted as being consistent with a hypothesized two-factor structure for the CRTT. Because of its established internal consistency, validity, reliability, sensitive multidimensional scoring system applied to each linguistic element in each command, and well-specified and standardized administration, the RTT appears well suited to enhance the precision of auditory comprehension assessment in the Spanish-speaking pediatric population. For these reasons, the RTT was judged to be a particularly appropriate tool for translation into Spanish and for exploration as a test for assessing language development and disorders. Because of its popularity and frequency of translation, the short version of the TT (DeRenzi & Faglioni, 1978) was chosen for comparison to the RTT for possible translation and development with a Spanish-speaking pediatric population.

The aims of the present study were to (a) analyze the general psychometric characteristics and internal consistency of the TT short version (DeRenzi & Faglioni, 1978) and of the RTT (McNeil & Prescott, 1978), (b) choose the test with the demonstrated highest internal consistency, (c) evaluate the chosen measure across the age ranges assessed, and (d) obtain preliminary normative data for 4–12-year-old Spanish-speaking children on that version.

Method

The study was divided into two phases: a pilot study and the validation study.

Study 1: Pilot Study

The goal of the pilot study was to determine whether the TT short version (DeRenzi & Faglioni, 1978) demonstrated better internal consistency than the RTT (McNeil & Prescott, 1978). This was accomplished by comparing the internal consistency coefficients and discriminability indices between the two tests with the objective of selecting for further development the one with the higher coefficient.

The short version of the TT uses 20 tokens, big and small circles and squares in five different colors (black, red, yellow, green, and white). The participants are required to respond to 36 commands divided in six parts: part A consists of 7 commands, parts B–E consists of 4 commands each, and part F has 13 instructions. Parts B, D, and F use big tokens only. The test has an increasing difficulty level, but within each part, the complexity level is designed to be equivalent. An error on any part of speech yields an incorrect response for the entire sentence. One point is awarded for each fully correct sentence with a maximum score of 36. This test is administered in approximately 15 min.

The RTT uses 20 tokens, including big and little circles and squares of five colors (red, blue, green, white, and black). The test consists of 10 subtests varying in sentence length and complexity, with each subtest having 10 commands of equal length, syntactic complexity, vocabulary level, and response difficulty. In subtests I, III, V, VII, and IX, only large tokens are used. There are five different sentence lengths across subtests that vary among three, four, six, and eight linguistic units to be scored in each command. Each linguistic element is scored in each sentence with a 15-point multidimensional scoring system. The average of all linguistic element scores forms a command mean. Overall mean scores for each command within a subtest are averaged to obtain an overall mean subtest score. The maximum average score in a subtest is 15. Likewise, overall mean subtest scores for all 10 subtests are averaged to obtain an overall mean score for the entire test. The score can range between 1.00 and 15.00 and is computed in hundredths. The administration time is about 25–30 min. This system provides valuable information about how the patient performed the task, and if it was performed incompletely or incorrectly, and what part of the sentence was wrong (McNeil & Prescott, 1978). Scoring and subtest construction are the most important difference between the TT and RTT besides the number of items. Additionally, it offers more specific information to the clinician about reasons (such as lack of attention, impulsiveness, memory problems, comprehension disorder, and others) and the locus of impairments (e.g., verbs, prepositions, adverbial clauses, size, color, or shape).

First the RTT was translated to Spanish according to a forward–backward translation procedure. This method included the translation of all items and test instructions by professional translators from English to Spanish. A systematic review of this new version was used to verify that it was equivalent to the original. Next, the Spanish version was translated back to the original English language by different translators who verified its equivalence with 100% fidelity. These procedures were not necessary with the TT short version because a standardized Spanish version (Galarza et al., 2005) was available.

Eighty-eight children participated in this pilot study. All were normal Spanish speakers between the ages of 4 and 12. Fifty percent were male. The TT short version was administered to 42 children and the RTT also was administered to 46 children selected with a random quota technique. All children were monolingual and had a normal psychomotor and cognitive development, normal sight and hearing, and no history of neurological problems according to teacher reports (they confirmed this information with parents). The participants were recruited from two regular elementary schools and two preschools in Guadalajara, Mexico. The selection of the schools was also random and included different geographic areas to control for socioeconomic variables such as parents' education and private versus public schools. The study was conducted with all parents' authorization and children agreed to participate without reward.

Results for the TT short version revealed a Cronbach's α of 0.812, and the RTT reached 0.923. The two-half Cronbach's α procedure for the RTT estimations yielded 0.961 (Half 1) and 0.886 (Half 2). All discriminability rates for each element were significant with values >0.60 compared with the total scale.

Finally, a moderately high-positive Pearson's correlation of 0.659 was obtained between the TT Spanish version and the new RTT Spanish version. Because the general psychometric characteristics of the test and its internal consistency were higher and the discriminability rates were acceptable, the Spanish RTT was selected for further development.

Study 2: Validation Study

Participants

The RTT was administered to 250 normal 4–12-year-old Spanish-speaking children. Fifty percent of the subjects were women. The subjects were randomly selected by using a quota sampling procedure according to age and gender, from five different grade schools and five preschools in Guadalajara, Mexico, and its metropolitan area. The schools as well were randomly selected from diverse socioeconomic status and different geographical areas of the city. All the participants attended regular classrooms and their school grade placement was appropriate for their age. They were monolingual and had normal psychomotor and cognitive development, normal sight and hearing, and no history of neurological problems according to teacher reports. Teachers obtained this information from the parents, as well as the authorization for the children's participation in the study. All the children agreed to be tested without reward.

Instruments

The Vocabulary Subtest (VS) of the Wechsler Intelligence Scales was administered along with the RTT to obtain a concurrent validity measure of general language performance. This subtest consists of straightforward questions about the meaning of words. The Spanish VS of the Wechsler Preschool and Primary Scale of Intelligence (WPPSI) was administered to 4–6-year-old children who attended preschool. The VS of the Wechsler Intelligence Scale for Children Fourth Edition Spanish (WISC-IV) was administered to the 6–12-year-old children who attended elementary school. One of the main reasons to choose the VS was because the Wechsler scales are well standardized in Mexican population, and other language tests did not have this important feature.

Procedures

The VS was administered first, followed by the pretest of the RTT, and finally, all the RTT subtests were administered in their numerical order. A way to control error is counterbalancing the administration of the two measures, but the constancy of conditions was preferred to control the effect of order in the administration of the tests.

Statistical analysis

For the statistical analysis, SPSS 17.0 was used. Descriptive statistical analyses were performed calculating mean scores, standard deviations, and percentiles. To address the accuracy of the RTT translation, a Pearson correlation was conducted with the TT translated by Galarza and colleagues (2005). To obtain criterion validity, the RTT was correlated with the VS scores. Age, education, and gender effects on test performance were measured with analysis of variance (ANOVA) means comparison. Internal consistency was estimated with Cronbach's α. In addition, a factorial analysis was conducted to estimate construct validity and finally normative data were obtained. All the statistical techniques were carried out with a significance level at 0.05.

Results

Descriptive Analysis

To determine inter-rater reliability in the validation phase of the RTT, four raters independently scored the same videotaped RTT administration. The Pearson correlation of these results indicated a 96.42% of inter-rater reliability.

In this study, the total sample was 250 participants, 50% of which were male. The RTT pretest was administered and all children demonstrated color, shape, and size knowledge, and no subjects were excluded on this basis. Both the RTT and the VS were administered to 28 children of each age group (4–12 years old) except the last group, which included 26 subjects. In this study, 26.4% of the subjects were attending preschool education, whereas the rest, 73.6%, were in elementary school.

The descriptive statistics for each subtest are shown in Table 1 and for each linguistic element in Table 2. The total means and standard deviations were calculated across the entire group for each age group and again estimated for the entire sample with resampling for missing data through linear interpolation (right side of Tables 1 and 2). The missing values correspond to the youngest children who did not complete all subtests (15.2%) because they said the test was too complex or they were judged to be fatigued. Because a relatively small number of participants did not complete all of the subtests, the effect of missing data in the statistical analysis was considered irrelevant. Table 1 displays the results obtained with the full sample applied via a simulation technique (linear interpolation; Lipsitz, Herring, & Ibrahim, 2005).

Table 1.

Descriptive measures for each subtest

Subtest n Mean SD Maximum score Mean (SD)* 
250 14.6305 0.67560 15.00 14.6305 (0.67560) 
II 250 14.2357 0.80149 15.00 14.2357 (0.80149) 
III 249 13.6456 1.20198 15.00 13.6445 (1.19957) 
IV 245 12.5259 1.55391 15.00 12.5212 (1.53895) 
231 13.4934 1.24771 15.00 13.4894 (1.19938) 
VI 225 12.6272 1.65199 15.00 12.6178 (1.56769) 
VII 218 13.1806 1.63991 15.00 13.1763 (1.53102) 
VIII 211 12.5239 1.70295 15.00 12.5154 (1.56421) 
IX 211 12.3674 1.25389 14.80 12.3601 (1.15182) 
211 12.1118 1.47557 15.00 12.1058 (1.35527) 
Total 250 12.1305 2.83585 14.69 13.1298 (0.90028) 
Subtest n Mean SD Maximum score Mean (SD)* 
250 14.6305 0.67560 15.00 14.6305 (0.67560) 
II 250 14.2357 0.80149 15.00 14.2357 (0.80149) 
III 249 13.6456 1.20198 15.00 13.6445 (1.19957) 
IV 245 12.5259 1.55391 15.00 12.5212 (1.53895) 
231 13.4934 1.24771 15.00 13.4894 (1.19938) 
VI 225 12.6272 1.65199 15.00 12.6178 (1.56769) 
VII 218 13.1806 1.63991 15.00 13.1763 (1.53102) 
VIII 211 12.5239 1.70295 15.00 12.5154 (1.56421) 
IX 211 12.3674 1.25389 14.80 12.3601 (1.15182) 
211 12.1118 1.47557 15.00 12.1058 (1.35527) 
Total 250 12.1305 2.83585 14.69 13.1298 (0.90028) 

*Estimated for all sample with resampling for missing data through linear interpolation (n = 250).

Table 2.

Descriptive measures for each linguistic element

Linguistic element n Mean SD Maximum scores Mean (SD)* 
Verb I 250 12.4659 2.78880 14.84 12.4659 (2.78880) 
Size I 250 11.8686 2.98214 14.84 11.8686 (2.98214) 
Color I 250 12.2437 2.89904 14.84 12.2437 (2.89904) 
Shape I 250 11.8994 2.90354 14.64 11.8994 (2.90354) 
Verb II 249 13.8484 1.53267 16.85 13.3483 (1.52959) 
Size II 246 11.6539 2.99596 15.00 11.6494 (2.97238) 
Color II 249 11.9716 3.11714 15.00 11.9712 (3.11089) 
Shape II 248 11.8479 3.02713 14.93 11.8428 (3.01586) 
Preposition 231 12.5991 1.98253 15.00 12.5922 (1.90580) 
Left–right preposition 217 11.8728 2.14247 15.00 11.8551 (1.99687) 
Adverbial clause 211 12.4507 1.27721 15.00 12.4472 (1.17299) 
Linguistic element n Mean SD Maximum scores Mean (SD)* 
Verb I 250 12.4659 2.78880 14.84 12.4659 (2.78880) 
Size I 250 11.8686 2.98214 14.84 11.8686 (2.98214) 
Color I 250 12.2437 2.89904 14.84 12.2437 (2.89904) 
Shape I 250 11.8994 2.90354 14.64 11.8994 (2.90354) 
Verb II 249 13.8484 1.53267 16.85 13.3483 (1.52959) 
Size II 246 11.6539 2.99596 15.00 11.6494 (2.97238) 
Color II 249 11.9716 3.11714 15.00 11.9712 (3.11089) 
Shape II 248 11.8479 3.02713 14.93 11.8428 (3.01586) 
Preposition 231 12.5991 1.98253 15.00 12.5922 (1.90580) 
Left–right preposition 217 11.8728 2.14247 15.00 11.8551 (1.99687) 
Adverbial clause 211 12.4507 1.27721 15.00 12.4472 (1.17299) 

*Estimated for all sample with resampling for missing data through linear interpolation (n = 250).

The observed values for each of the variables were normally distributed (all variables were analyzed using the Lillieford and Massey test) and outlier values and gaps were not observed.

Psychometric Analysis

To address the accuracy of the RTT translation, a Pearson correlation was conducted with the TT translated by Galarza and colleagues (2005). There was a significant positive correlation between the tests, r = .659; p < .01 was obtained. This is not a high correlation as anticipated, but the forward–backward translation procedure confirms the accuracy of the RTT-translated version.

As expected, there were no significant differences in the RTT performance according to gender (tStudent = 0.799–0.804; p = .42 to p = .44). However, the analysis by age groups revealed the score means of the subtests increased significantly with age. As test complexity increased, scores decreased across the subtests within the same age group. The Levene test of homogeneity of variances was computed. None of the RTT subtest or linguistic element scores was significant; therefore, homogeneity was assumed. In addition, all test scores were significantly different according to the education level. Table 3 summarizes the significant differences between age groups and education level. The Scheffé test post hoc contrasts revealed two inflexion points between the subtests according to age and education: the 4- and 5-year-old groups demonstrated a similar test performance level, and it differed from the other age groups (all comparisons were significant; p < .001); in relation to the education level, the results were similar (p < .001) and showed the existence of two different clusters, but with poorest results produced by the children from the basic grades with small improvement demonstrated by the other grade levels. This improvement followed a linear or quadratic function.

Table 3.

ANOVA results for each subtest and linguistic element between age groups and education level

Variable Age groups
 
Education level
 
 F Degree of freedom ɛ2 1 – β F Degree of freedom ɛ2 1 − β 
Subtest I 15.51* 8,249 0.621 0.991 13.76* 6,249 0.602 0.875 
Subtest II 14.58* 8,249 0.434 0.993 15.98* 6,249 0.484 0.891 
Subtest III 12.60* 8,248 0.411 0.902 15.01* 6,248 0.572 0.899 
Subtest IV 10.36* 8,244 0.398 0.899 12.15* 6,244 0.441 0.906 
Subtest V 21.32* 8,230 0.723 0.903 19.51* 6,230 0.654 0.902 
Subtest VI 34.04* 8,224 0.802 0.857 35.80* 6,224 0.811 0.891 
Subtest VII 39.44* 8,217 0.895 0.881 36.94* 6,217 0.861 0.874 
Subtest VIII 44.26* 8,210 0.898 0.901 47.29* 6,210 0.902 0.838 
Subtest IX 39.12* 8,210 0.711 0.899 35.49* 6,210 0.645 0.856 
Subtest X 44.85* 8,210 0.875 0.905 42.34* 6,210 0.812 0.888 
Total 50.45* 8,249 0.901 0.941 47.57* 6,249 0.856 0.901 
Verb I 42.69* 8,249 0.833 0.891 40.54* 6,249 0.786 0.931 
Size I 46.49* 8,249 0.876 0.921 46.48* 6,249 0.875 0.942 
Color I 49.09* 8,249 0.911 0.915 47.07* 6,249 0.885 0.911 
Shape I 56.11* 8,249 0.845 0.897 50.87* 6,249 0.746 0.899 
Verb II 6.53* 8,248 0.231 0.922 8.62* 6,248 0.295 0.883 
Size II 32.01* 8,245 0.734 0.912 35.71* 6,245 0.759 0.924 
Color II 43.84* 8,248 0.823 0.908 44.72* 6,248 0.833 0.951 
Shape II 41.73* 8,247 0.848 0.901 44.36* 6,247 0.876 0.928 
PP 20.85* 8,230 0.549 0.892 22.80* 6,230 0.612 0.891 
LRP 17.28* 8,216 0.489 0.882 24.77* 6,216 0.728 0.911 
AC 14.80* 8,216 0.402 0.918 10.74* 6,216 0.372 0.915 
Variable Age groups
 
Education level
 
 F Degree of freedom ɛ2 1 – β F Degree of freedom ɛ2 1 − β 
Subtest I 15.51* 8,249 0.621 0.991 13.76* 6,249 0.602 0.875 
Subtest II 14.58* 8,249 0.434 0.993 15.98* 6,249 0.484 0.891 
Subtest III 12.60* 8,248 0.411 0.902 15.01* 6,248 0.572 0.899 
Subtest IV 10.36* 8,244 0.398 0.899 12.15* 6,244 0.441 0.906 
Subtest V 21.32* 8,230 0.723 0.903 19.51* 6,230 0.654 0.902 
Subtest VI 34.04* 8,224 0.802 0.857 35.80* 6,224 0.811 0.891 
Subtest VII 39.44* 8,217 0.895 0.881 36.94* 6,217 0.861 0.874 
Subtest VIII 44.26* 8,210 0.898 0.901 47.29* 6,210 0.902 0.838 
Subtest IX 39.12* 8,210 0.711 0.899 35.49* 6,210 0.645 0.856 
Subtest X 44.85* 8,210 0.875 0.905 42.34* 6,210 0.812 0.888 
Total 50.45* 8,249 0.901 0.941 47.57* 6,249 0.856 0.901 
Verb I 42.69* 8,249 0.833 0.891 40.54* 6,249 0.786 0.931 
Size I 46.49* 8,249 0.876 0.921 46.48* 6,249 0.875 0.942 
Color I 49.09* 8,249 0.911 0.915 47.07* 6,249 0.885 0.911 
Shape I 56.11* 8,249 0.845 0.897 50.87* 6,249 0.746 0.899 
Verb II 6.53* 8,248 0.231 0.922 8.62* 6,248 0.295 0.883 
Size II 32.01* 8,245 0.734 0.912 35.71* 6,245 0.759 0.924 
Color II 43.84* 8,248 0.823 0.908 44.72* 6,248 0.833 0.951 
Shape II 41.73* 8,247 0.848 0.901 44.36* 6,247 0.876 0.928 
PP 20.85* 8,230 0.549 0.892 22.80* 6,230 0.612 0.891 
LRP 17.28* 8,216 0.489 0.882 24.77* 6,216 0.728 0.911 
AC 14.80* 8,216 0.402 0.918 10.74* 6,216 0.372 0.915 

Notes: ANOVA = analysis of variance; PP = Place Preposition; LRP = Left/Right Preposition; AC = Adverbial Clause. ɛ2 is the effect size and the (1 – β) is the power of contrast.

*p < .001.

Internal consistency was recalculated for the total sample. The overall Cronbach's α was 0.911 and the corrected score was 0.918. Cronbach's α for the two halves was 0.829 (Half 1) and 0.866 (Half 2). Criterion validity was evaluated by correlating the RTT and the VS scores using Pearson's correlation coefficients. There was a significant positive correlation between the test scores, r = .540, p < .01; but, with this sample size, the correlation shows moderate relationship between tests.

An exploratory factor analysis was performed in order to explore the RTT construct validity for this Spanish-speaking pediatric population. The Kaiser–Meyer–Olkin showed a moderately high esphericity structure value (0.702) and the Bartlett test result, (χ2 = 9,486.56; df = 253; p < .001), yielded similar information. The factor's extraction procedure established three factors that accounted for 79.63% of explained variance and the rotated (Varimax with five iterations) estimation generated significant coefficients (λij) for all iterations at p < .001 (Table 4).

Table 4.

Rotated Factor loadings for each variable and factor

 Factor
 
 
Verb II 0.881*   
Size II 0.879*   
Color II 0.873*   
Subtest VI 0.838*   
Verb I 0.838*   
Size I 0.803*   
Shape II 0.800*   
Color I 0.787*   
Total score 0.784*   
Place preposition 0.783*   
Subtest VIII 0.778*   
Subtest IV 0.766*   
Subtest VII 0.742*   
Subtest III 0.719*   
Shape I 0.701*   
Subtest V 0.697*   
Adverbial clause  0.943*  
Subtest IX  0.888*  
Subtest X  0.854*  
Subtest I   0.737* 
Left–right preposition 0.565*  0.617* 
Subtest II   0.544* 
Total variance 46.51% 17.144% 15.975% 
 Factor
 
 
Verb II 0.881*   
Size II 0.879*   
Color II 0.873*   
Subtest VI 0.838*   
Verb I 0.838*   
Size I 0.803*   
Shape II 0.800*   
Color I 0.787*   
Total score 0.784*   
Place preposition 0.783*   
Subtest VIII 0.778*   
Subtest IV 0.766*   
Subtest VII 0.742*   
Subtest III 0.719*   
Shape I 0.701*   
Subtest V 0.697*   
Adverbial clause  0.943*  
Subtest IX  0.888*  
Subtest X  0.854*  
Subtest I   0.737* 
Left–right preposition 0.565*  0.617* 
Subtest II   0.544* 
Total variance 46.51% 17.144% 15.975% 

*p < .05.

The first factor corresponds to those elements that are related to auditory comprehension of compound imperative sentences characterized by comprehension of verbs (action of touching or putting), colors, shapes, and sizes. The second factor relates to subtests IX and X that are characterized by compound adverbial phrases, along with the scores for the adverbial clauses. The third factor corresponds to the easiest simple imperative sentence subtests (I and II), in addition to the performance on the left and right prepositions.

Finally, there is a need to represent the children's performance in a standardized manner for ease of interpretation. To do this, the scores were converted to 5th, 10th, 25th, 50th, 75th, 90th, and 95th percentiles for both the subtests and the linguistic elements for each yearly age interval. Because there were no differences between genders, the data for the boys and girls were combined. Supplementary material online, Appendices S1 and S2, show the percentile values calculated from the standardization of the observed distribution.

Discussion

With regard to the first aim of this study, the RTT showed higher internal consistency than the TT short version. The RTT high internal reliability indicates that all of the items perform unidirectionally. These results are considered evidence of acceptable internal consistency (Muñiz, 2002) for this population.

RTT performance showed a gradual increasing difference with age and consequently, with education. This is an important characteristic in a test that assesses child language because language development affects performance across the age range investigated in this study and in the domains measured by the RTT. Performance was not affected by gender, as found in previous studies with a variant of a TT (Galarza et al., 2005; Paquier et al., 2007; Remschmidt et al., 1977). Although no literature was found to support it, we speculate that this may be due to cultural differences or because the specific version of the TT used in those studies differed in ways that are sensitive to gender differences.

The RTT was compared with the VS of the WPPSI and WISC-IV as a measure of concurrent validity. Although the correlation between the two scores was significant and moderately high, it underscores a fundamental tenet of the RTT that it is more than an index of vocabulary mastery and is a test of language processing.

The knowledge of the factor structure adds some interesting details to the study. The first factor contains the parts of speech (verbs [touch and put], colors, shapes, sizes, spatial, and left–right prepositions) and the subtests scores associated with them. Because these vocabulary words are already known by the subjects, it can be deduced that there was little demands on lexical semantics and the demands on syntax involved primarily the processing of the standard word order. That is, integrating the adjectival and noun information and response planning was largely involved in the auditory comprehension of these imperative sentences and composed the linguistic demands necessary to respond correctly. Subtests V, VI, VII, and VIII require the comprehension of spatial concepts and a movement instead of a touching response, in addition to the aforementioned processes in order to respond correctly. However, as these rather diverse psycholinguistic and motoric processes loaded on a single factor, their unidimensionality suggests that the lexical integration required among all the sentences may constitute the primary task demands. Although it is true that the following two factors also presented values >1, the smaller amount of explained variance relegates them to a secondary role in the factor structure of the test as a whole.

Grammatically, the second factor differs greatly from the rest of the test. The adverbial clause score derived within subtests IX and X require the participant to comprehend the meaning of the adverbial clause (e.g., “before you touch the red square, touch the green circle” or “touch the green circle, before you touch the red square”) in order to give a correct response. The separation of this factor is consistent with RTT performance in normal adults and in adults with aphasia (McNeil & Prescott, 1978). It is also consistent with previous studies on a German form of the TTs (Willmes, 1981) and with the structure of both the RTT results from persons with aphasia (McNeil et al., submitted) and the 55-item RTT (Hula et al., 2006). The processing of adverbial clauses reliably yields performance and requires comprehension processes that are fundamentally different from the task demands required within the rest of the test.

It can be hypothesized that the third factor may be related to the analysis of spatial information. In subtests I and II, the participant may be learning to recognize the token's arrangement and becoming accustomed the task requirements. Although the left and right prepositions also imply spatial notions, it is interesting that they loaded separately from the other prepositions; giving support to their inclusion as separate subtests in the battery. Certainly, this factor may have differed from the others because subtests I and II were very easy for the children in this study and therefore yielded little variability in scores. It is interesting to note that this inconsistency adds information about the unidimensionality of the test, but with small distortions, especially in the third factor.

Conclusions

Relatively few studies have used the RTT version of “the TTs” and relatively little research has used it with children. This study is consistent with the McNeil and colleagues (1990) study demonstrating a valid use of the RTT in children. Although the RTT was originally designed to be used with adults, this adaptation with this sample adds support to the previously reported data for English-speaking children that the RTT is a valuable instrument in examining childhood language processing and comprehension. These results extend its use to 4–12-year-old Spanish-speaking children.

Based on the results of this study, the RTT appears to have the psychometric properties and sensitivity to capture developmental changes across the age and education levels examined. The variables of age and education are intrinsically linked and children are expected to improve test performance across these ranges on both independent variables. This essential test characteristic was realized in this investigation.

Sufficient evidence of criterion validity for the RTT was found. The factor analysis revealed that three factors emerged for this population. The first factor bore the main load, showing that the first eight subtests of the RTT are homogeneous, providing evidence of construct validity. The second factor was consistent with a wide range of evidence from the adult aphasia literature identifying adverbial clause processing as fundamentally different from the rest of the test.

Among the strengths of this study are its relatively large sample size. This strength is seldom realized in a study of this kind and provides preliminary normative data for this Mexican pediatric population. This is very important because clinicians require reliable and standardized instruments for use with children. It is also relevant that the test provides a single tool that can be validly used across the age range from 4 years to the oldest old geriatric individuals and is appropriate for differentiating pathological sentence comprehension from normal performance across this entire age range.

It is known that any communication difficulty in children limits their learning and social life. Thus, it is necessary to carry out accurate diagnoses and efficient assessments in order to habilitate and rehabilitate children. Only with accurate and efficient assessment will it be possible to insure that they have the linguistic knowledge and skill necessary for life-long learning, psychological adjustment, and productive adulthoods. The RTT appears to be an appropriate tool for monitoring the normal development of sentence-level language development that is relatively independent of vocabulary knowledge; thus making it a culturally, socioeconomically and linguistically unbiased assessment tool that is sensitive to age and education changes. Children with brain damage, learning disabilities, developmental delays, and specific language problems are the primary clinical populations where the RTT may be most useful for identification of impairment.

The results of this study have produced evidence that the RTT is a valid and reliable tool for measuring auditory language processing and comprehension in Spanish-speaking children. With additional normative data and a larger sample in each age group, it can be made available for practitioners. Additional research also is required to determine its sensitivity and specificity for detecting auditory language processing deficits in various pathological pediatric populations. Given its success for these purposes with adult populations, the effort is likely to be rewarded. Finally, following the work currently underway in adults, development of the RTT in a computerized version would likely be more appealing to children and would make the administration more reliable and efficient.

Supplementary material

Supplementary material is available at Archives of Clinical Neuropsychology online.

Funding

Funding received from MINISTERIO DE CIENCIA E INNOVACIÓN grant number PSI2010 – 21214 – C02 - 01.

Conflict of Interest

None declared .

References

Amorosa
A.
Kleinhans-Lintner
J.
Von Bender-Fisser
U.
An experimental study with the Token Test in dysphasic and other learning-disabled children
Zeitschrift für Kinder- und Jugendpsychiatrie und Psychotherapie
 , 
1980
, vol. 
8
 
3
(pg. 
288
-
299
)
Aram
D.
Ekelman
B.
Unilateral brain lesions in childhood: performance on the Revised Token Test
Brain and language
 , 
1987
, vol. 
32
 (pg. 
137
-
158
)
Ardila
A.
Roselli
M.
Development of language, memory, and visuospatial abilities in 5- to 12-year-old children using a neuropsychological battery
Developmental Neuropsychology
 , 
1994
, vol. 
10
 
2
(pg. 
97
-
120
)
Ardila
A.
Roselli
M.
Puente
A.
Neuropsychological Evaluation of the Spanish Speaker
 , 
1994
New York
Plenum Press
Arvedson
J.C.
McNeil
M.R.
West
T.L.
Prediction of Revised Token Test overall, subtest, and linguistic unit scores by two shortened versions
Clinical Aphasiology
 , 
1986
, vol. 
16
 (pg. 
57
-
63
)
Brauer
D
The differentiation of learning disabled, aphasic and normal adult language performance on selected aphasia test batteries
1988
 
Unpublished Master's Thesis, University of Wiscons-Madison
Campbell
T.
Dollaghan
C.
Needleman
H.
Janosky
J.
Reducing bias in language assessment: processing-dependent measures
Journal of Speech, Language, and Hearing Research
 , 
1997
, vol. 
40
 (pg. 
519
-
525
)
Cole
K.
Fewell
R.
A quick language screening test for young children: The Token Test
Journal of Psychoeducational Assessment
 , 
1983
, vol. 
1
 
2
(pg. 
149
-
153
)
DeRenzi
A.
Vignolo
L.
Token Test: A sensitive test to detect receptive disturbances in aphasics
Brain: A Journal of Neurology
 , 
1962
, vol. 
85
 (pg. 
665
-
678
)
DeRenzi
E.
Faglioni
P.
Normative data and screening power of a shortened version of the Token Test
Cortex
 , 
1978
, vol. 
14
 
1
(pg. 
41
-
49
)
DiSimoni
F.
Mucha
R.
Use of the Token Test for children to identify language deficits in preschool age children
Journal of Auditory Research
 , 
1982
, vol. 
22
 
4
(pg. 
265
-
270
)
Eberwein
C.
Pratt
S.
McNeil
M.
Szuminsky
N.
Doyle
P.
Auditory performance characteristics of the computerized Revised Token Test (CRTT)
Journal of Speech, Language, and Hearing Research
 , 
2007
, vol. 
50
 (pg. 
865
-
877
)
Fontanari
J.
The “Token Test": Elegance and conciseness in the evaluation of comprehension in aphasic patients: Validation of the reduced version of DeRenzi to the Portuguese
Neurobiologia
 , 
1989
, vol. 
52
 
3
(pg. 
177
-
218
)
Fusilier
F.
Lass
N.
A comparative study of children's performance on the Illinois Test of Psycholinguistic Abilities and the Token Test
Journal of Auditory Research
 , 
1984
, vol. 
24
 
1
(pg. 
9
-
16
)
Galarza
J.
Padilla
A.
Bonilla
J.
Evaluación neuropsicológica de una muestra de niños de 5 a 12 años con instrucción escolar bilingüe
Interação em Psicologia
 , 
2005
, vol. 
9
 
1
(pg. 
125
-
130
)
Gutbrod
K.
Michel
M.
On the clinical validity of the Token Test with brain damaged children with and without aphasia
Diagnostica
 , 
1986
, vol. 
32
 
2
(pg. 
118
-
128
)
Hula
W.D.
Doyle
P.J.
McNeil
M.R.
Mikolic
J.M.
Rasch Modeling of Revised Token Test Performance: Validity and Sensitivity to Change
Journal of Speech, Language, and Hearing Research
 , 
2006
, vol. 
49
 (pg. 
27
-
46
)
Keo
S.
A comparative study of the Khmer token test and the English version of the token test
Dissertation Abstracts International: Section B: The Sciences and Engineering
 , 
1999
, vol. 
60
 
4B
(pg. 
1858
-
1865
)
Kitson
D.
Vance
B.
Blosser
J.
Comparison of the Token Test of Language Development and the Wechsler Intelligence Scale for Children-Revised
Perceptual and Motor Skills
 , 
1985
, vol. 
61
 
2
(pg. 
532
-
534
)
Kosciesza
M.
Krasowicz
G.
Polish adaptation of the “Token Test" for children and its practical applications
Psychologia Wychowawcza
 , 
1995
, vol. 
38
 
4
(pg. 
350
-
358
)
Lass
N.
DePaolo
A.
Simcoe
J.
Samuel
S.
A normative study of children's performance on the short form of the Token Test
Journal of Communication Disorders
 , 
1975
, vol. 
8
 (pg. 
193
-
198
)
Lezak
M.
Howieson
D.
Loring
D.
Neuropsychological assessment
 , 
2004
4th. ed
New York
Oxford University Press
Lipsitz
S.R.
Herring
A.H.
Ibrahim
J.G.
Missing-Data Methods for Generalized Linear Models: A Comparative Review
Journal of the American Statistical Association
 , 
2005
, vol. 
100
 
469
(pg. 
332
-
346
)
McNeil
M.R.
Brauer
D.
Pratt
S.R.
A test of auditory language processing regression: Adult aphasia versus normal children ages 5–13 years
Australian Journal of Human Communication Disorders
 , 
1990
, vol. 
18
 (pg. 
21
-
39
)
McNeil
M.R.
Dionigi
C.M.
Langlois
A.
Prescott
T.E.
A Measure of Revised Token Test Ordinality and Intervality
Aphasiology
 , 
1989
, vol. 
3
 
1
(pg. 
31
-
40
)
McNeil
M.R.
Pratt
S.R.
Szuminsky
N.
Sung
J.E.
Fossett
T.R.D.
Kim
A.
, et al.  . 
Description and psychometric development of the Computerized Revised Token Test (CRTT): Test-retest reliability, concurrent and construct validity with comparison to three experimental Reading versions (CRTT-R) in adult normal controls and persons with aphasia
 
(Submitted) Manuscript submitted for publication
McNeil
M.R.
Prescott
T.E.
Revised Token Test
 , 
1978
Austin, Texas
PRO-ED, Inc
Muñiz
J.
Teoría clásica de los tests.
 , 
2002
Madrid
Pirámide
Orgass
B.
Poeck
K.
Clinical validation of a new test for aphasia: an experimental study on the token test
Cortex
 , 
1966
, vol. 
2
 
2
(pg. 
222
-
243
)
Paquier
P.
Van Mourik
M.
Van Dongen
H.
Catsman-Berrevoets
C.
Creten
W.
Van Borsel
J.
Normative data of 300 Dutch-speaking children on the Token Test
Aphasiology
 , 
2007
, vol. 
23
 
4
(pg. 
427
-
437
)
Park
G.
McNeil
M.
Tompkins
C.
Reliability of the Five-Item Revised Token Test for individuals with aphasia
Aphasiology
 , 
2000
, vol. 
14
 
5/6
(pg. 
527
-
535
)
Peña-Casanova
J.
Quiñones-Ubeda
S.
Gramunt-Fombuena
N.
Aguilar
M.
Casas
L.
Molinuevo
J.L.
, et al.  . 
Spanish Multicenter Normative Studies (NEURONORMA Project): Norms for Boston Naming Test and Token Test
Archives of Clinical Neruopsychology
 , 
2009
, vol. 
24
 (pg. 
343
-
354
4
Poeck
K.
Hartje
W.
Token-test Performance of Aphasics with Auditory and Visual Presentation of Instructions
Der Nervenarzt
 , 
1979
, vol. 
49
 
11
(pg. 
654
-
657
)
Remschmidt
H.
Niebergall
G.
Geyer
M.
Merschmann
W.
Standardization of the Token Test on school children taking into account intelligence, vocabulary, and hand dominance
Zeitschrift für Kinder- und Jugendpsychiatrie und Psychotherapie
 , 
1977
, vol. 
5
 
3
(pg. 
222
-
237
)
Shelton
R.
Arndt
W.
Johnson
A.
Psychometric data for the token test derived from 8- and 9-year-old children with articulation disorders
Journal of the American Audiology Society
 , 
1977
, vol. 
2
 
6
(pg. 
208
-
212
)
Silverman
I.
Raskin
L.
Davidson
J.
Bloom
A.
Relationships among Token Test, age, and WISC scores for children with learning problems
Journal of Learning Disabilities
 , 
1977
, vol. 
10
 
2
(pg. 
104
-
107
)
Spellacy
F.J.
Spreen
O.
A short form of the Token Test
Cortex
 , 
1969
, vol. 
5
 (pg. 
390
-
397
)
Van Harskamp
F.
Van Dongen
H.
Construction and validation of different short forms of the token test
Neuropsychologia
 , 
1977
, vol. 
15
 
3
(pg. 
467
-
470
)
Vena
N.
Revised token test in Kannada
Journal of the All-India Institute of Speech & Hearing
 , 
1982
, vol. 
13
 (pg. 
192
-
204
)
Wassenberg
R.
Hurks
P.
Hendriksen
J.
Feron
F.
Meijs
C.
Vles
J.
, et al.  . 
Age-related improvement in complex language comprehension: Results of a cross-sectional study with 361 children aged 5–15
Journal of Clinical Experimental Neuropsychology
 , 
2008
, vol. 
30
 (pg. 
435
-
448
4
Willmes
K.
A new look at the token test using probabilistic test models
Neuropsychologia
 , 
1981
, vol. 
19
 
5
(pg. 
631
-
645
)