The Designs subtest allows for accumulation of raw score points by chance alone, creating the potential for artificially inflated performances, especially in older patients. A random number generator was used to simulate the random selection and placement of cards by 100 test naive participants, resulting in a mean raw score of 36.26 (SD = 3.86). This resulted in relatively high-scaled scores in the 45–54, 55–64, and 65–69 age groups on Designs II. In the latter age group, in particular, the mean simulated performance resulted in a scaled score of 7, with scores 1 SD below and above the performance mean translating to scaled scores of 5 and 8, respectively. The findings indicate that clinicians should use caution when interpreting Designs II performance in these age groups, as our simulations demonstrated that low average to average range scores occur frequently when patients are relying solely on chance performance.

Introduction

The Designs subtest, one of two measures assessing visual memory on the Wechsler Memory Scale—fourth edition (WMS-IV), measures test-takers' ability to select and place cards with specific designs on a 4 × 4 grid to match a previously viewed page. The test is administered twice, once following the test-taker's initial viewing of the visual stimuli (Designs I), and a second time following a 20–30 min delay (Designs II), contributing to the immediate and delayed memory indices of the WMS-IV, respectively. The Designs subtest is a new measure that was added to the WMS-IV to improve upon the Visual Memory Index of the WMS-III, which included the subtests Family Pictures and Faces. While little research is available on Family Pictures, Faces suffered from a number of problematic psychometric issues including high floor effects, a high guess rate (Levy, 2006), and low communality with other visual memory measures (Millis, Malina, Bowers, & Ricker, 1999). As a result, the development of a new measure of visual memory—in this case, the Designs subtest—seemed necessary.

Despite the widespread clinical use of the WMS-IV since its inception in 2009, little research has been conducted on the Designs subtest other than that initially conducted on the normative sample used to validate the test. As noted by Loring and Bauer (2010), while the WMS-IV manual provides information regarding the scales' factor loadings and relationships with other cognitive measures, critical information regarding criterion validity, an issue of great importance in neuropsychological assessment, is absent. A review of the literature resulted in only a handful of studies addressing the psychometric properties of the Designs subtest. Previous studies involving the Designs subtests have examined the factor structure of subtests of the WMS-IV and WAIS-IV combined (Holdnack, Zhou, Larrabee, Millis, & Salthouse, 2011), relationships between WMS-IV subtests and a measure of activities of daily living (Drozdick & Cullum, 2011), and prevalence of low scores on WMS-IV subtests in the normative sample (Brooks, Holdnack, & Iverson, 2011). In a principal component analysis of the WMS-IV normative sample by age group, the Designs subtests loaded onto the same dimension as the Visual Reproduction subtests for all age groups except for the 65–69 group (Hoelzle, Nelson, & Smith, 2011). Such a finding punctuates the need to assess the psychometric properties of neuropsychological measures across normative age groups as performance and test characteristics may vary.

While described as a measure of memory recall as opposed to a measure of memory recognition (Wechsler, Holdnack, & Drozdick, 2009), the Designs subtest is in essence a multiple choice recognition test. Test-takers are provided a number of possibilities, regarding both design type and location, and asked to distinguish what they have seen before from what they have not previously seen. On Item 1 of Designs I, for example, test-takers are first shown a grid containing four designs. They are then asked to choose the four designs they saw from a set of eight designs and to place them in the correct location on a grid with 1 possible locations. The test-taker, therefore, may not always know the correct design or location, but may guess correctly regarding one or both of these criteria on some designs simply because she is offered a limited number of options from which to choose. Such a premise is different from many other memory tests (example.g., Visual Reproduction) which typically provide no cues for retrieval and are therefore less susceptible to error variance secondary to guessing.

While guessing correctly is one way in which test-takers may earn points towards their Designs subtest score, they also may amass raw score points by choosing incorrectly. For each Designs trial, test-takers are provided both target designs (designs previously shown to them as part of the test stimuli) and distracter designs (designs that are slight but noticeable variants of the target designs). If the test-taker chooses the correct target card without also choosing the accompanying distracter card she will earn two points; however, if she chooses either only the distracter card without the target or choose both the target and the distracter, she will still earn one point. Therefore, regardless of which cards test-takers choose, they will earn points towards their overall content score.

In addition to earning points for selecting an incorrect design, test-takers can accrue additional points for guessing where any design was previously located. Specifically, test-takers earn one point for each design that is placed in the same location as a design shown in the test stimuli. The design does not need to be the same design that was originally placed in that location to receive a point. In fact, it does not even need to be one of the multiple original target designs. As long as any possible design card (target or distracter) is placed in a location that was previously occupied by a design on the original grid, a point for location is awarded. Because the number of designs increases across trials while the number of locations remains the same, the probability that a design is placed in a “correct” location also increases across trials. This is most apparent on Trial 4, the final trial of the Designs subtest, on which test-takers are explicitly asked to place eight designs in only 16 available locations.

A final way in which Designs test-takers earn points towards their overall raw score comes in the form of a two point bonus score. The fortunate guesser will sometimes both choose the correct design and place that design in its correct location. In these instances, the test-taker will receive an extra two points in addition to those points already received for placement and selection of the design.

Taken together these factors indicate that random guessing has the potential to contribute considerably to a patient's Designs subtest performance. Such an effect would be noteworthy given that, as stated in the WMS-IV Technical and Interpretive Manual (2009), “guessing reduces the reliability of a test and can mask floor issues in some types of tests.” Floor effects, in particular, may significantly limit the usefulness of a measure as they impede its ability to delineate level of impairment, rendering it unsuitable for use in low functioning examinees (Brooks, Strauss, Sherman, Iverson, & Slick, 2009).

Because raw score points on the Designs subtest can be obtained by random guessing, the goal of this study was to determine the expected performance of those approaching the task by chance alone. While the scoring process for determining Designs I and II raw scores is identical, standardized scaled score equivalents for Designs II are based on relatively lower raw scores than those for Designs I. For this reason, artificially inflated raw scores, obtained by random guessing, could be particularly problematic on Designs II, especially in older age groups given that scaled score equivalents increase as a function of age. Thus, older individuals with impaired recall might achieve higher scaled scores on Designs II than would be expected, even when approaching the test by randomly guessing. We, therefore, hypothesized that raw scores achieved by chance selection alone would result in relatively high standardized scores, particularly in older aged groups, on Designs II.

Method

A random number generator was used to create two sequences of numbers for each of the four trials of the Designs subtests. The first sequence was used to assign card number and the second was used to assign card location. One hundred sets of numbers were randomly generated for each sequence to simulate the random selection and placement of cards by 100 test naïve participants. In accordance with WMS-IV Designs scoring criteria, Content, Spatial, and Bonus scores were tabulated across the four trials for the 100 randomly generated performances. These scores, reflecting accurate design selection, accurate design location, and accurate placement of the correct design, respectively, were added together to create the Designs subtest raw score for each performance.

Results

Statistics regarding the distribution of raw scores for the 100 randomly generated performances are provided in Table 1. The mean raw score was 36.26 (SD = 3.86), with the lowest randomly generated performance scoring 27 and the highest scoring 45. As seen in Fig. 1, the randomly generated performances were normally distributed; both skewness and kurtosis were well below recommended cut-off points (West, Finch, & Curran, 1995).

Table 1.

Distribution of randomly generated raw scores

Statistics
 
Mean 36.3 
Median 36.0 
Mode 38.0 
SD 3.86 
Minimum 27 
Maximum 45 
Skewness −0.02 
Kurtosis −0.50 
Statistics
 
Mean 36.3 
Median 36.0 
Mode 38.0 
SD 3.86 
Minimum 27 
Maximum 45 
Skewness −0.02 
Kurtosis −0.50 
Fig. 1.

Distribution of Designs raw score for 100 randomly generated performances.

Fig. 1.

Distribution of Designs raw score for 100 randomly generated performances.

The mean raw score, as well as raw scores falling at 1 and 2 SDs above and below the randomly generated performance mean are presented along with their scaled score equivalents for each WMS-IV defined age group for both Designs I (Table 2) and Designs II (Table 3). In tabulating Designs I scaled scores, the mean raw score of 36.3 resulted in scaled scores of 2 or less across all age groups. As expected, the scaled score equivalents for Designs II were markedly higher than those found for Designs I for all age groups (Table 3). In the oldest age group, 65–69, the mean randomly simulated performance resulted in a scaled score of 7, with scores 1 SD below and above the performance mean translating to scaled scores of 5 and 8, respectively. As seen in Fig. 2, when converting raw scores to scaled scores using the norms for the 65–69 age group, the median and mode of the randomly generated performances were also 7. It can also be seen that within this age group, three of the 100 randomly generated simulations reached the maximum score attained, a scaled score of 9.

Table 2.

Scaled score equivalents on Designs I for randomly generated performances by age group

 Raw 16–17 18–19 20–24 25–29 30–34 35–44 45–54 55–64 65–69 
−2 SD 28.6 
−1 SD 32.4 
Mean 36.3 
+ 1 SD 40.1 
+ 2 SD 44.0 
 Raw 16–17 18–19 20–24 25–29 30–34 35–44 45–54 55–64 65–69 
−2 SD 28.6 
−1 SD 32.4 
Mean 36.3 
+ 1 SD 40.1 
+ 2 SD 44.0 
Table 3.

Scaled score equivalents on Designs II for randomly generated performances by age group

 Raw 16–17 18–19 20–24 25–29 30–34 35–44 45–54 55–64 65–69 
−2 SD 28.6 
−1 SD 32.4 
Mean 36.3 
+ 1 SD 40.1 
+ 2 SD 44.0 
 Raw 16–17 18–19 20–24 25–29 30–34 35–44 45–54 55–64 65–69 
−2 SD 28.6 
−1 SD 32.4 
Mean 36.3 
+ 1 SD 40.1 
+ 2 SD 44.0 
Fig. 2.

Designs II scaled score equivalents of randomly generated performances for the 65- to 69-year-old age group.

Fig. 2.

Designs II scaled score equivalents of randomly generated performances for the 65- to 69-year-old age group.

Overall, scaled scores derived from the random simulations increased as a function of increasing age. The results supported the hypothesis that Designs II subtest performance achieved entirely by chance would result in relatively high scaled scores. This was particularly true for the 65- to 69-year-old age group in which the randomly simulated performances resulted in a scaled score of 7 or greater >50% of the time. The WMS-IV Technical and Interpretive Manual (2009) allows for Content and Spatial scores to be converted to scaled scores as well, and these scores were examined to determine if one or both of these subscores was influenced by random performance. The mean simulated Content total raw score was 25.55, which translated to a scaled score of 7 in adults aged 65–69. Similarly, the mean simulated Spatial total raw score was 9.44, which translated to a scaled score of 8. Therefore, adults aged 65–69 who approach the test randomly are expected to produce scores in the low average to average range on both the Content and Spatial subscores, as well as the Designs II overall score.

Discussion

The Designs subtest is distinct from most other memory measures in that the construction of the test allows for test-takers to earn points towards their raw score by chance alone. This is similar to recognition tests on which participants are provided a correct response within a set of other, incorrect responses. Whereas it is relatively easy to calculate the score that one would be expected to receive by chance alone on most recognition measures, such a score is exceedingly more difficult to calculate on the Designs subtest due to the score being derived from three independent contributors (i.e. Content, Spatial, and Bonus scores). The findings of this study provide important data regarding expectations for performance on the Designs subtest for those individuals who approach the test by relying solely on guessing. This information should be utilized in a manner similar to knowledge regarding chance performance on recognition tests. That is, as a test-taker's raw score approaches 36.3, the mean raw score of the randomly generated performances, the possibility that the individual's performance was one of chance selection should be highly considered.

A second important implication of the study regards the interpretation of the Designs II subtest in adults aged 65–69. Our random simulations show that individuals in this group are expected to perform within the low average to average ranges when recollection of designs is completely absent and when patients are relying solely on chance performance. For the same reasons that it would be troublesome for an individual scoring a 10 out of 20 on a forced choice recognition subtest to be considered to have low average memory recognition abilities, we find it troublesome that our data indicate that individuals aged 65–69 are frequently expected to obtain a scaled score of 7 or higher on Designs II by random chance performance. Therefore, the findings indicate that low average to average performance on Designs II, especially when in conflict with other test data, should be interpreted with great caution. The data also indicate that low scores in adults age 45–64 should be interpreted with some caution as 16% of these individuals are expected to score a scaled score of 7 or higher if approaching the task by chance alone.

Levy (2006), in critiquing the Faces subtest of the WMS-III, noted that older adults are routinely able to score a scaled score of 7 or greater under chance performance on the subtest. The author reasons that because such a substantial portion of healthy older adults with no known impairment score at the same level as that expected by chance, the test suffers from a very strong floor effect. Our data indicate that Designs II may not offer an improvement over Faces regarding this limitation. The results of this study show that a scaled score of 7 in adults aged 65–69 is strongly suggestive of a chance performance, meaning that a substantial portion of healthy adults with no known memory impairments in the normative sample scored at the same level as chance. The implications of the findings are two-fold: (i) unimpaired older adults find the Designs II subtest difficult enough that many perform at chance levels, suggesting strong floor effects and (ii) impaired adults aged 65–69 are able to appear low average to average simply by guessing. Taken together, these findings indicate that the test is not able to discriminate between those with actual memory impairment and those with unimpaired memory, particularly as age increases. Interestingly, it was for similar reasons that the Designs subtest was not included in the WMSI-IV Older Adult Battery (Wechsler, Holdnack, & Drozdick, 2009).

While the validity of the Designs II subtest is vulnerable to floor effects and guessing, these factors do not appear to threaten the validity of Designs I. Randomly generated performances on Designs I resulted in mean scaled scores below the first percentile and within the extremely low range (Wechsler, 2009) across all age groups. Such scores are more consistent with what is reasonably expected from chance performance than those scores produced by our randomly generated performances on Designs II.

It should be noted that while the results suggest problems with the procedures and scoring of the Designs II subtest, the problems do not necessarily indicate that the paradigm of memory assessment on which the test was founded is flawed. It is quite possible that modifying the test's procedural and scoring design would improve the test's sensitivity to impairment in older adults by reducing the ease at which points are accumulated by guessing. As noted in Results section, both Content and Spatial scores are highly susceptible to the impact of random guessing in adults aged 65–69 as the scaled score equivalents for the mean randomly generated performance for these subscores were 7 and 8, respectively. The authors also examined the extent to which Content, Spatial, and Bonus scores generated by our analyses contributed to the overall raw score mean of 36.3 in order to determine which of these scores contributes most significantly to the inflation of the total raw score when guessing is involved. Points awarded for Content (design selection) accounted for 70% of the points accumulated by the simulated performances, points accrued for Spatial score (design placement) accounted for 26% of all points, and bonus points accounted for only 4% of the points awarded to the simulated performances. Therefore, it appears that the Content score contributes to the total score raw to a disproportionally greater extent than the Spatial and Bonus scores. This would suggest that reducing the points awarded for design content may, in particular, reduce the performance variance related to guessing on Designs II. Since test-takers can earn up to 2 points for correct content and up to 1 point for incorrect content for each design, future test developers utilizing the designs paradigm are encouraged to assess the validity of awarding only 1 point for each correct response and 0 points for each incorrect response.

The test's sensitivity to true impairment in older adults could also likely be improved by lowering the floor of the test by making it easier. As noted in the WMS-IV Technical and Interpretive Manual (Wechsler, Holdnack, & Drozdick, 2009), the Designs test on the WMS-IV is a modification of the Memory for Designs subtest found on the Developmental Neuropsychological Assessment, second edition (NEPSY-II). Similar to Designs, this subtest requires individuals to remember designs placed on a 4 × 4 grid stimuli before being asked to correctly select and then place these designs on an empty grid. However, unlike Designs, Memory for Designs does not require the test-taker to learn a new set of designs for each item. Rather, on Memory for Designs, a core set of designs on Item 1 is shown to the test-taker, which is then supplemented by new designs on each subsequent item. The delayed recall measure of this subtest requires individuals to identify and place a set of 10 designs which have been practiced in an additive fashion over multiple trials. Such an approach allows for learning and might potentially make the test easier, which, in turn, could result in less impactful floor effects when assessing adults in older age groups. Future research is needed to determine whether these or other modifications to the procedure and scoring of the Designs subtest can improve the ability of this measure to discriminate between older age groups with and without impaired memory.

Although a strength of this study is that it allows one to examine the Designs subtest in a purely chance-based setting, this is also a limitation of the study because no human participants were included. A human test-taker, even absent of any recall of the previously viewed designs, might be expected to introduce some degree of strategy to their design selection and placement. One such strategy, observed clinically, is for individuals to choose either only the target or only the distracter for a given design, realizing that it would be unlikely that two very similar designs had been displayed at once. Such a strategy would, in and of itself, elevate one's performance above that of chance as choosing either the target or the distracter for a given design would result in a greater number of raw score points than if one were to choose both. Therefore, a performance characterized by guessing by an actual test-taker, coupled with some degree of strategy, would logically be expected to result in a scaled score that is even higher than that indicated by our data using random number generating simulations. Additionally, it seems unlikely that many human test-takers, even ones with severely impaired memory recall, would not benefit from some level of recognition or familiarity of the Designs stimuli. Such familiarity with the design content and location could lead to an even higher score than that achieved by random guessing alone, even in the absence of spontaneous recall. Thus, given that older individuals approaching the subtest at random are frequently expected to perform within the low average to average ranges, patients with impaired memory might well score within the average range or above. Future research should, therefore, further assess Designs II using human samples to continue to evaluate the validity of the test in older age groups.

References

Brooks
B. L.
Holdnack
J. A.
Iverson
G. L.
Advanced clinical interpretation of the WAIS-IV and WMS-IV: Prevalence of low scores varies by level of intelligence and years of education
Assessment
 
2011
18
2
156
167
Brooks
B. L.
Strauss
E.
Sherman
E. M. S.
Iverson
G. L.
Slick
D. J.
Developments in neuropsychological assessment: Refining psychometric and clinical interpretive methods
Canadian Psychology
 
2009
50
3
196
209
Drozdick
L. W.
Cullum
C. M.
Expanding the ecological validity of WAIS-IV and WMS-IV with the Texas Functional Living Scale
Assessment
 
2011
18
2
141
155
Hoelzle
J. B.
Nelson
N. W.
Smith
C. A.
Comparison of Wechsler Memory Scale—fourth edition (WMS—IV) and third edition (WMS—III) dimensional structures: Improved ability to evaluate auditory and visual constructs
Journal of Clinical and Experimental Neuropsychology
 
2011
33
3
283
291
Holdnack
J. A.
Zhou
X.
Larrabee
G. J.
Millis
S. R.
Salthouse
T. A.
Confirmatory factor analysis of the WAIS-IV/WMS-IV
Assessment
 
2011
18
2
178
191
Levy
B.
Increasing the power for detecting impairment in older adults with the Faces subtest from Wechsler Memory Scale-III: An empirical trial
Archives of Clinical Neuropsychology
 
2006
21
687
692
Loring
D. W.
Bauer
R. M.
Testing the limits: Cautions and concerns regarding the new Wechsler IQ and Memory scales
Neurology
 
2010
74
685
690
Millis
S.
Malina
A.
Bowers
D.
Ricker
J.
Confirmatory factory analysis of the Wechsler Memory Scale-III
Journal of Clinical and Experimental Neuropsychology
 
1999
21
87
93
Wechsler
D.
Wechsler memory scale—fourth edition
 
2009
San Antonio, TX
Pearson Assessment
Wechsler
D.
Holdnack
J. A.
Drozdick
L. W.
Wechsler memory scale—fourth edition, technical and interpretive manual
 
2009
San Antonio, TX
NCS Pearson, Inc
West
S. G.
Finch
J. F.
Curran
P. J.
Hoyle
R. H.
Structural equation models with nonnormal Variables: Problems and remedies
Structural equation modeling: Concepts, issues and applications
 
1995
Newbery Park, CA
Sage
56
75