## Abstract

Neuropsychologists frequently rely on a battery of neuropsychological tests which are normally distributed to determine impaired functioning. The statistical likelihood of Type I error in clinical decision-making is in part determined by the base rate of normative individuals obtaining atypical performance on neuropsychological tests. Base rates are most accurately obtained by co-normed measures, but this is rarely accomplished in neuropsychological testing. Several statistical methods have been proposed to estimate base rates for tests that are not co-normed. This study compared two statistical approaches (binomial and Monte Carlo models) used to estimate the base rates for flexible test batteries. The two approaches were compared against empirically derived base rates for a multitest co-normed battery of cognitive measures. Estimates were compared across a variety of conditions including age and different *α* levels *(N =*3,356). Monte Carlo *R*^{2} estimates ranged from .980 to .997 across five different age groups, indicating a good fit. In contrast, the binomial model fit estimates ranged from 0.387 to 0.646. Results confirm that the binomial model is insufficient for estimating base rates because it does not take into account correlations among measures in a multitest battery. Although the Monte Carlo model produced more accurate results, minor biases occurred that are likely due to skewess and kurtosis of test variables. Implications for future research and applied practice are discussed.

## Introduction

Neuropsychologists frequently rely on norm-referenced measures as a nomothetic foundation (Rabin, Barr, & Burton, 2005) from which idiographic evaluation of deficits in cognition or other neuropsychological areas of functioning are determined (Hale, Fiorello, & Thompson, 2010). Typically, impaired performance in an individual with otherwise average scores is designated by scores below some threshold (e.g., standard scores <80). Although profile variability is common in individuals with brain injury and other disabilities (Fiorello, Hale, Holdnack, 2007), it is also quite common for individuals without brain injury or neurological disorders to have a significant profile variability with some lower test scores approaching or surpassing this threshold. Type I error occurs when clinicians interpret a low score in a neurologically intact individual as evidence of brain injury or other neurological disorder. Although diagnostic errors are virtually impossible to eliminate entirely, their frequency can be minimized with proper safeguards.

For a norm-referenced measure, the score that is used for determining impaired performance (i.e., *α*) also sets the Type I error rate, which in this context may refer to incorrectly rejecting the null hypothesis (i.e., concluding patient is impaired when in fact they are not impaired). For example, a standard score of 80, which is at the 9th percentile, would have a Type I error rate of 0.09. However, this error rate only holds for the one measure. The error rate for a battery of tests is influenced by the number of tests included in the battery. Clinical decision-making using an error rate for a single test score as applied to a battery of tests may result in diagnostic errors in judgment, which may be reduced by using a hypothesis driven approach to diagnostic assessment (e.g., Decker, 2008; Hale & Fiorello, 2004). However, with administration of additional measures comes the risk of Type I error associated with each of them. A major unresolved issue in neuropsychological assessment is how to control for the Type I error rate in a multitest battery (Axelrod & Wall, 2007; Binder, Iverson, & Brooks, 2009; Brooks & Iverson, 2010; Crawford, Garthwaite, & Gault, 2007; Ingraham & Aiken, 1996). In essence, as the number of measures increases, so does the likelihood of Type I error in clinical decision-making. Additionally, given the influence of base rates on the reliability and validity of clinical judgment (Garb & Schramke, 1996), accurately estimating Type I error rates for a multitest battery is critical in clinical applications, but seldom mathematically ascertained. The ideal solution for clinicians is to have access to co-normed measures where base rates for the number of measures expected to be in the impaired range is empirically derived. However, this is frequently unavailable, especially for flexible test battery approaches, which has been a primary criticism of a flexible battery approach (Bigler, 2008; Russell, Russell, & Hill, 2005).

Alternatively, the next best option is to have a mathematical estimation to control for Type I error. Two approaches are possible. One approach is to estimate the *α* level given the number of tests administered. The other approach is to estimate the base rates for impaired tests given a particular *α* level. *α* adjustment methods, like the Bonferroni method, control for the Type I error rate as a function of the number of comparisons being made. The calculation of Bonferroni is simply the desired *α* level (*x*) divided by the number of comparisons (*m*). For example, when 14 tests are given and *α* is set at 0.05, the Bonferroni method would be 0.05/14 = 0.004, which suggests to truly keep error rates at a 0.05 level when making 14 comparisons, an *α* level of 0.004 should be used. The *p*-value of .004 corresponds to a *z*-score of −2.652 which corresponds to a standard score of approximately 60. Notwithstanding, *α* adjustment levels have been criticized for being overly conservative (Ingraham & Aiken, 1996) and most clinicians would agree reserving impairment for standard score of 60 would seem highly conservative in practice.

The problems with *α* adjustment methods led to greater emphasis in developing base rate approaches to evaluating individual differences in performance. Ingraham and Aiken (1996) used the binomial model to estimate base rates for a given set of tests at a set *α* level. The binomial model estimates the probability of an event (typically described as success) given its probability and the number of trials in which the event can occur. Viewing each test as a “trial” in a series of trials (test battery) in which a person may obtain some status (impaired vs. not impaired), Ingraham and Aiken (1996) used the binomial model to provide expectancy curves for the likelihood of obtaining an impaired test result for a variety of *α* levels and number of tests administered (Ingraham & Aiken, 1996).

A few research studies have supported the use of the binomial model. Janssen and colleagues (1989) found the binomial model correctly predicted individuals in more progressive stages of HIV with more severely impaired performance than individuals in earlier stages. Similarly, Axelrod and Wall (2007) generalized Ingraham and Aiken's (1996) use of the binomial model methodology to compare the frequency of impairment on the Halstead–Reitan Battery in a non-clinical sample. The binomial model was found to closely predict the empirical rate of impaired performance across seven tests.

One major problem with the binomial model is that it assumes all tests are uncorrelated (Crawford et al., 2007). By neglecting the correlations of the measures (which is unacceptable since neuropsychological measures are often moderately correlated), the binomial model produces overly stringent criteria, thus reducing power. As such, Crawford and colleagues (2007) developed a method which incorporated the correlations among tests in a Monte Carlo simulation method. Although a more complex approach, the process is simplified through the use of a computer program.

Several studies have empirically investigated the utility of the Monte Carlo method. In one study, the Monte Carlo software was used to predict base rates in 394 adults who were administered a variety of neuropsychological measures (Schretlen, Testa, Winicki, Pearlson, & Gordon, 2008). Results from this study found the Monte Carlo provided better predictions than the binomial model and reasonably accurate predictions of base rates for this particular sample. Also noted from this research is the unusually large number of healthy subjects who obtain impaired performance on some neuropsychological measures. The findings of Schretlen and colleagues (2008) have been replicated (Binder et al., 2009). These findings underscore the need for additional research in base rate estimation (Brooks & Iverson, 2010). Since there is substantial overlap between low scores in typical individuals and individuals with disabilities (Hale et al., 2008), base rate estimation becomes critical in diagnostic practice. The clinician's goal is to decrease the likelihood of Type I error rates or false positives, while simultaneously minimizing the likelihood of Type II errors or false negatives in diagnostic practice (Hale et al., 2010).

To further this line of research, the current study examined the accuracy of the binomial model and the Monte Carlo approach for estimating base rates of a co-normed test battery when base rates are known through empirical sampling. By using a co-normed test sized battery in which base rates are empirically available, the accuracy of the mathematical estimation procedures could be empirically tested. We predicted that the Monte Carlo approach would perform better than the binomial approach in establishing base rate estimates. Additionally, we tested the model's accuracy across different *α* parameters and age levels.

## Method

### Participants

Participants for this study included a subset of the Woodcock–Johnson Tests of Cognitive Abilities-Third Edition (WJ III; Woodcock, McGrew, & Mather, 2001) standardization sample. The WJ III standardization sample consisted of 8,782 participants with an age range of 12 months to 90+ years of age. Normative statistics were based on the 2005 U.S. census. Individual subject weighting was applied to obtain data proportional to the U.S. census (McGrew, Schrank, & Woodcock, 2007). Data used from this study were based on the WJ III Normative Update (McGrew et al., 2007), which served as the normative data set for the WJ III. Participants were based on a stratified sampling design that controlled for 11 variables including census region, sex, race, education, and social economic status (*see*McGrew et al., 2007, for more specific details). Because only participants who had complete data for the 14-subtest WJ III Extended battery (Tests of Cognitive Abilities 1–7 and 11–18) were used in this study, this reduced the sample size to 3,356. Fig. 1 displays the demographic make-up of the sample across five age groups. Children aged 5 and younger were omitted from the analyses because too few of them had complete profiles. The remaining five age groups were chosen to mirror the age groupings found in most analyses and descriptive statistics in the “WJ III Technical Manual” (Woodcock et al., 2001).

As seen in Tables 1–5, the requirement that participants have a complete 14-subtest profile means that this study's subsample selected from the WJ III standardization sample performed slightly better than average on most subtests in the adult age groups. Thus, the generalizability of the findings of this study might be somewhat limited. Although not presented here, we ran analyses that estimated how much the conclusions of this study might have changed if the entire standardization sample had complete profiles. We prorated the number of low scores people with incomplete profiles had and ran the same analyses with the entire standardization sample. In general, these estimates suggested that the overall findings of the current study would not change.

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .52 | .19 | .43 | .49 | .27 | .37 | .66 | .30 | .17 | .27 | .38 | .26 | .39 | |

VAL | .23 | .32 | .48 | .29 | .38 | .49 | .24 | .20 | .25 | .35 | .26 | .27 | ||

SR | .19 | .21 | .13 | .20 | .23 | .13 | .07 | .12 | .12 | .13 | .19 | |||

SB | .34 | .17 | .22 | .45 | .18 | .14 | .31 | .22 | .13 | .29 | ||||

CF | .32 | .34 | .46 | .28 | .20 | .21 | .53 | .25 | .27 | |||||

VM | .38 | .22 | .41 | .16 | .28 | .31 | .46 | .23 | ||||||

NR | .30 | .28 | .15 | .14 | .29 | .26 | .34 | |||||||

GI | .28 | .13 | .28 | .34 | .24 | .28 | ||||||||

RF | .18 | .15 | .23 | .31 | .19 | |||||||||

PR | .05 | .17 | .15 | .05 | ||||||||||

AA | .21 | .38 | .13 | |||||||||||

AS | .26 | .23 | ||||||||||||

DS | .15 | |||||||||||||

MW | ||||||||||||||

Mean | 100.5 | 102.2 | 100.9 | 103.5 | 101.5 | 99.5 | 100.2 | 101.9 | 101.2 | 101.3 | 95.3 | 100.3 | 100.7 | 101.8 |

SD | 14.1 | 13.4 | 13.9 | 15.1 | 15.5 | 14.0 | 14.9 | 14.2 | 14.6 | 15.1 | 14.7 | 15.8 | 15.1 | 15.6 |

Skewness | −0.15 | 0.06 | 0.07 | −0.03 | −0.20 | −0.06 | −0.23 | −0.36 | −0.46 | −0.11 | −0.30 | −0.44 | −0.27 | 0.11 |

Kurtosis | −0.15 | 0.32 | 0.48 | 1.63 | 0.70 | 1.51 | 0.88 | 1.34 | 1.03 | 0.59 | 1.40 | 1.05 | 0.54 | 0.12 |

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .52 | .19 | .43 | .49 | .27 | .37 | .66 | .30 | .17 | .27 | .38 | .26 | .39 | |

VAL | .23 | .32 | .48 | .29 | .38 | .49 | .24 | .20 | .25 | .35 | .26 | .27 | ||

SR | .19 | .21 | .13 | .20 | .23 | .13 | .07 | .12 | .12 | .13 | .19 | |||

SB | .34 | .17 | .22 | .45 | .18 | .14 | .31 | .22 | .13 | .29 | ||||

CF | .32 | .34 | .46 | .28 | .20 | .21 | .53 | .25 | .27 | |||||

VM | .38 | .22 | .41 | .16 | .28 | .31 | .46 | .23 | ||||||

NR | .30 | .28 | .15 | .14 | .29 | .26 | .34 | |||||||

GI | .28 | .13 | .28 | .34 | .24 | .28 | ||||||||

RF | .18 | .15 | .23 | .31 | .19 | |||||||||

PR | .05 | .17 | .15 | .05 | ||||||||||

AA | .21 | .38 | .13 | |||||||||||

AS | .26 | .23 | ||||||||||||

DS | .15 | |||||||||||||

MW | ||||||||||||||

Mean | 100.5 | 102.2 | 100.9 | 103.5 | 101.5 | 99.5 | 100.2 | 101.9 | 101.2 | 101.3 | 95.3 | 100.3 | 100.7 | 101.8 |

SD | 14.1 | 13.4 | 13.9 | 15.1 | 15.5 | 14.0 | 14.9 | 14.2 | 14.6 | 15.1 | 14.7 | 15.8 | 15.1 | 15.6 |

Skewness | −0.15 | 0.06 | 0.07 | −0.03 | −0.20 | −0.06 | −0.23 | −0.36 | −0.46 | −0.11 | −0.30 | −0.44 | −0.27 | 0.11 |

Kurtosis | −0.15 | 0.32 | 0.48 | 1.63 | 0.70 | 1.51 | 0.88 | 1.34 | 1.03 | 0.59 | 1.40 | 1.05 | 0.54 | 0.12 |

*Notes:* WJ III = Woodcock–Johnson Tests of Cognitive Abilities-Third Edition; VC = Verbal Comprehension; VAL = Visual-Auditory Learning; SR = Spatial Relations; SB = Sound Blending; CF = Concept Formation; VM = Visual Matching; NR = Numbers Reversed; GI = General Information; RF = Retrieval Fluency; PR = Picture Recognition; AA = Auditory Attention; AS = Analysis Synthesis; DS = Decision Speed; MW = Memory for Words.

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .52 | .35 | .46 | .61 | .31 | .40 | .80 | .40 | .19 | .30 | .48 | .24 | .40 | |

VAL | .29 | .37 | .48 | .28 | .33 | .44 | .29 | .23 | .22 | .43 | .22 | .29 | ||

SR | .28 | .38 | .22 | .27 | .32 | .13 | .15 | .18 | .37 | .15 | .19 | |||

SB | .38 | .18 | .25 | .45 | .23 | .14 | .34 | .32 | .21 | .38 | ||||

CF | .32 | .39 | .53 | .34 | .17 | .24 | .57 | .25 | .40 | |||||

VM | .33 | .27 | .38 | .16 | .33 | .30 | .53 | .25 | ||||||

NR | .33 | .26 | .12 | .22 | .35 | .17 | .40 | |||||||

GI | .39 | .18 | .30 | .42 | .23 | .36 | ||||||||

RF | .11 | .26 | .24 | .35 | .27 | |||||||||

PR | .10 | .19 | .19 | .10 | ||||||||||

AA | .22 | .42 | .25 | |||||||||||

AS | .21 | .33 | ||||||||||||

DS | .13 | |||||||||||||

MW | ||||||||||||||

Mean | 100.3 | 99.3 | 100.0 | 101.4 | 100.3 | 99.7 | 99.1 | 101.0 | 101.6 | 100.4 | 97.3 | 99.5 | 100.2 | 101.4 |

SD | 15.2 | 15.1 | 15.2 | 14.4 | 15.8 | 14.7 | 16.4 | 14.7 | 14.3 | 14.7 | 14.7 | 15.5 | 15.5 | 14.9 |

Skewness | −0.22 | 0.27 | 0.20 | 0.24 | −0.22 | −0.23 | −0.25 | −0.34 | −0.35 | 0.00 | −0.07 | −0.11 | 0.05 | −0.03 |

Kurtosis | 0.27 | 0.58 | 0.57 | 0.20 | 0.19 | 1.63 | 0.61 | 0.43 | 0.32 | 0.34 | 1.33 | 0.49 | 0.58 | 0.21 |

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .52 | .35 | .46 | .61 | .31 | .40 | .80 | .40 | .19 | .30 | .48 | .24 | .40 | |

VAL | .29 | .37 | .48 | .28 | .33 | .44 | .29 | .23 | .22 | .43 | .22 | .29 | ||

SR | .28 | .38 | .22 | .27 | .32 | .13 | .15 | .18 | .37 | .15 | .19 | |||

SB | .38 | .18 | .25 | .45 | .23 | .14 | .34 | .32 | .21 | .38 | ||||

CF | .32 | .39 | .53 | .34 | .17 | .24 | .57 | .25 | .40 | |||||

VM | .33 | .27 | .38 | .16 | .33 | .30 | .53 | .25 | ||||||

NR | .33 | .26 | .12 | .22 | .35 | .17 | .40 | |||||||

GI | .39 | .18 | .30 | .42 | .23 | .36 | ||||||||

RF | .11 | .26 | .24 | .35 | .27 | |||||||||

PR | .10 | .19 | .19 | .10 | ||||||||||

AA | .22 | .42 | .25 | |||||||||||

AS | .21 | .33 | ||||||||||||

DS | .13 | |||||||||||||

MW | ||||||||||||||

Mean | 100.3 | 99.3 | 100.0 | 101.4 | 100.3 | 99.7 | 99.1 | 101.0 | 101.6 | 100.4 | 97.3 | 99.5 | 100.2 | 101.4 |

SD | 15.2 | 15.1 | 15.2 | 14.4 | 15.8 | 14.7 | 16.4 | 14.7 | 14.3 | 14.7 | 14.7 | 15.5 | 15.5 | 14.9 |

Skewness | −0.22 | 0.27 | 0.20 | 0.24 | −0.22 | −0.23 | −0.25 | −0.34 | −0.35 | 0.00 | −0.07 | −0.11 | 0.05 | −0.03 |

Kurtosis | 0.27 | 0.58 | 0.57 | 0.20 | 0.19 | 1.63 | 0.61 | 0.43 | 0.32 | 0.34 | 1.33 | 0.49 | 0.58 | 0.21 |

*Notes*: WJ III = Woodcock–Johnson Tests of Cognitive Abilities-Third Edition; VC = Verbal Comprehension; VAL = Visual-Auditory Learning; SR = Spatial Relations; SB = Sound Blending; CF = Concept Formation; VM = Visual Matching; NR = Numbers Reversed; GI = General Information; RF = Retrieval Fluency; PR = Picture Recognition; AA = Auditory Attention; AS = Analysis Synthesis; DS = Decision Speed; MW = Memory for Words.

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .54 | .37 | .51 | .61 | .29 | .41 | .82 | .37 | .27 | .34 | .52 | .26 | .42 | |

VAL | .33 | .35 | .55 | .30 | .38 | .46 | .27 | .32 | .27 | .46 | .25 | .33 | ||

SR | .24 | .37 | .24 | .25 | .31 | .17 | .20 | .15 | .36 | .22 | .26 | |||

SB | .42 | .24 | .38 | .46 | .24 | .17 | .46 | .33 | .28 | .41 | ||||

CF | .33 | .40 | .54 | .32 | .27 | .31 | .58 | .30 | .39 | |||||

VM | .36 | .27 | .37 | .16 | .27 | .32 | .51 | .25 | ||||||

NR | .31 | .26 | .17 | .25 | .40 | .23 | .48 | |||||||

GI | .38 | .20 | .28 | .47 | .26 | .36 | ||||||||

RF | .19 | .27 | .30 | .36 | .23 | |||||||||

PR | .15 | .23 | .22 | .19 | ||||||||||

AA | .21 | .44 | .26 | |||||||||||

AS | .25 | .30 | ||||||||||||

DS | .18 | |||||||||||||

MW | ||||||||||||||

Mean | 101.34 | 99.81 | 100.22 | 99.60 | 100.88 | 100.51 | 100.73 | 101.28 | 100.92 | 101.32 | 101.55 | 101.30 | 99.87 | 103.49 |

SD | 14.72 | 16.53 | 15.35 | 14.32 | 15.41 | 14.99 | 15.20 | 15.42 | 13.80 | 14.33 | 15.11 | 14.79 | 15.82 | 15.43 |

Skewness | −0.40 | 0.09 | −0.18 | 0.22 | −0.49 | −0.01 | −0.23 | −0.33 | −0.16 | −0.05 | −0.46 | −0.14 | −0.04 | −0.19 |

Kurtosis | 0.25 | 0.08 | 0.40 | 0.18 | 0.41 | 1.23 | 0.30 | 1.11 | 0.45 | 0.25 | 2.66 | 0.84 | 0.38 | 0.39 |

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .54 | .37 | .51 | .61 | .29 | .41 | .82 | .37 | .27 | .34 | .52 | .26 | .42 | |

VAL | .33 | .35 | .55 | .30 | .38 | .46 | .27 | .32 | .27 | .46 | .25 | .33 | ||

SR | .24 | .37 | .24 | .25 | .31 | .17 | .20 | .15 | .36 | .22 | .26 | |||

SB | .42 | .24 | .38 | .46 | .24 | .17 | .46 | .33 | .28 | .41 | ||||

CF | .33 | .40 | .54 | .32 | .27 | .31 | .58 | .30 | .39 | |||||

VM | .36 | .27 | .37 | .16 | .27 | .32 | .51 | .25 | ||||||

NR | .31 | .26 | .17 | .25 | .40 | .23 | .48 | |||||||

GI | .38 | .20 | .28 | .47 | .26 | .36 | ||||||||

RF | .19 | .27 | .30 | .36 | .23 | |||||||||

PR | .15 | .23 | .22 | .19 | ||||||||||

AA | .21 | .44 | .26 | |||||||||||

AS | .25 | .30 | ||||||||||||

DS | .18 | |||||||||||||

MW | ||||||||||||||

Mean | 101.34 | 99.81 | 100.22 | 99.60 | 100.88 | 100.51 | 100.73 | 101.28 | 100.92 | 101.32 | 101.55 | 101.30 | 99.87 | 103.49 |

SD | 14.72 | 16.53 | 15.35 | 14.32 | 15.41 | 14.99 | 15.20 | 15.42 | 13.80 | 14.33 | 15.11 | 14.79 | 15.82 | 15.43 |

Skewness | −0.40 | 0.09 | −0.18 | 0.22 | −0.49 | −0.01 | −0.23 | −0.33 | −0.16 | −0.05 | −0.46 | −0.14 | −0.04 | −0.19 |

Kurtosis | 0.25 | 0.08 | 0.40 | 0.18 | 0.41 | 1.23 | 0.30 | 1.11 | 0.45 | 0.25 | 2.66 | 0.84 | 0.38 | 0.39 |

*Notes*: WJ III = Woodcock–Johnson Tests of Cognitive Abilities-Third Edition; VC = Verbal Comprehension; VAL = Visual-Auditory Learning; SR = Spatial Relations; SB = Sound Blending; CF = Concept Formation; VM = Visual Matching; NR = Numbers Reversed; GI = General Information; RF = Retrieval Fluency; PR = Picture Recognition; AA = Auditory Attention; AS = Analysis Synthesis; DS = Decision Speed; MW = Memory for Words.

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .46 | .41 | .51 | .55 | .25 | .36 | .80 | .34 | .21 | .30 | .50 | .20 | .37 | |

VAL | .46 | .45 | .59 | .35 | .43 | .42 | .21 | .34 | .32 | .49 | .30 | .30 | ||

SR | .33 | .48 | .34 | .33 | .38 | .06 | .25 | .30 | .41 | .31 | .27 | |||

SB | .41 | .29 | .32 | .47 | .24 | .19 | .42 | .36 | .22 | .40 | ||||

CF | .38 | .46 | .46 | .17 | .23 | .32 | .57 | .28 | .34 | |||||

VM | .42 | .22 | .21 | .24 | .35 | .36 | .56 | .28 | ||||||

NR | .33 | .17 | .17 | .28 | .40 | .21 | .46 | |||||||

GI | .33 | .19 | .31 | .44 | .20 | .35 | ||||||||

RF | .13 | .22 | .21 | .22 | .12 | |||||||||

PR | .21 | .23 | .23 | .11 | ||||||||||

AA | .31 | .42 | .24 | |||||||||||

AS | .22 | .30 | ||||||||||||

DS | .14 | |||||||||||||

MW | ||||||||||||||

Mean | 104.6 | 103.9 | 101.7 | 102.6 | 104.0 | 104.1 | 103.2 | 105.2 | 103.7 | 101.3 | 103.7 | 103.9 | 102.0 | 103.2 |

SD | 11.7 | 14.6 | 14.6 | 12.2 | 12.5 | 14.4 | 13.8 | 11.8 | 11.2 | 14.4 | 13.4 | 13.5 | 15.9 | 14.5 |

Skewness | −0.11 | 0.16 | −0.49 | −0.05 | −0.54 | 0.08 | −0.21 | −0.05 | −0.12 | 0.06 | 0.20 | −0.14 | −0.10 | 0.19 |

Kurtosis | 0.40 | −0.09 | 1.01 | 0.12 | 0.78 | 0.40 | −0.17 | 0.31 | 0.35 | 0.25 | 1.06 | 1.57 | 1.12 | 0.02 |

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .46 | .41 | .51 | .55 | .25 | .36 | .80 | .34 | .21 | .30 | .50 | .20 | .37 | |

VAL | .46 | .45 | .59 | .35 | .43 | .42 | .21 | .34 | .32 | .49 | .30 | .30 | ||

SR | .33 | .48 | .34 | .33 | .38 | .06 | .25 | .30 | .41 | .31 | .27 | |||

SB | .41 | .29 | .32 | .47 | .24 | .19 | .42 | .36 | .22 | .40 | ||||

CF | .38 | .46 | .46 | .17 | .23 | .32 | .57 | .28 | .34 | |||||

VM | .42 | .22 | .21 | .24 | .35 | .36 | .56 | .28 | ||||||

NR | .33 | .17 | .17 | .28 | .40 | .21 | .46 | |||||||

GI | .33 | .19 | .31 | .44 | .20 | .35 | ||||||||

RF | .13 | .22 | .21 | .22 | .12 | |||||||||

PR | .21 | .23 | .23 | .11 | ||||||||||

AA | .31 | .42 | .24 | |||||||||||

AS | .22 | .30 | ||||||||||||

DS | .14 | |||||||||||||

MW | ||||||||||||||

Mean | 104.6 | 103.9 | 101.7 | 102.6 | 104.0 | 104.1 | 103.2 | 105.2 | 103.7 | 101.3 | 103.7 | 103.9 | 102.0 | 103.2 |

SD | 11.7 | 14.6 | 14.6 | 12.2 | 12.5 | 14.4 | 13.8 | 11.8 | 11.2 | 14.4 | 13.4 | 13.5 | 15.9 | 14.5 |

Skewness | −0.11 | 0.16 | −0.49 | −0.05 | −0.54 | 0.08 | −0.21 | −0.05 | −0.12 | 0.06 | 0.20 | −0.14 | −0.10 | 0.19 |

Kurtosis | 0.40 | −0.09 | 1.01 | 0.12 | 0.78 | 0.40 | −0.17 | 0.31 | 0.35 | 0.25 | 1.06 | 1.57 | 1.12 | 0.02 |

*Note*: WJ III = Woodcock–Johnson Tests of Cognitive Abilities-Third Edition; VC = Verbal Comprehension; VAL = Visual-Auditory Learning; SR = Spatial Relations; SB = Sound Blending; CF = Concept Formation; VM = Visual Matching; NR = Numbers Reversed; GI = General Information; RF = Retrieval Fluency; PR = Picture Recognition; AA = Auditory Attention; AS = Analysis Synthesis; DS = Decision Speed; MW = Memory for Words.

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .51 | .36 | .42 | .49 | .32 | .38 | .79 | .39 | .22 | .32 | .41 | .35 | .33 | |

VAL | .34 | .40 | .50 | .36 | .37 | .46 | .28 | .37 | .33 | .42 | .30 | .27 | ||

SR | .18 | .32 | .24 | .23 | .32 | .02 | .16 | .16 | .32 | .21 | .17 | |||

SB | .33 | .37 | .35 | .41 | .33 | .21 | .50 | .24 | .32 | .46 | ||||

CF | .35 | .33 | .43 | .22 | .21 | .30 | .50 | .30 | .26 | |||||

VM | .38 | .33 | .35 | .25 | .35 | .26 | .51 | .30 | ||||||

NR | .32 | .26 | .21 | .27 | .34 | .24 | .41 | |||||||

GI | .35 | .17 | .31 | .38 | .33 | .33 | ||||||||

RF | .09 | .23 | .18 | .32 | .26 | |||||||||

PR | .20 | .20 | .21 | .23 | ||||||||||

AA | .20 | .44 | .30 | |||||||||||

AS | .21 | .24 | ||||||||||||

DS | .21 | |||||||||||||

MW | ||||||||||||||

Mean | 106.2 | 104.3 | 102.3 | 102.5 | 104.8 | 104.0 | 102.8 | 105.0 | 104.4 | 102.3 | 104.3 | 105.6 | 103.4 | 103.7 |

SD | 13.0 | 13.3 | 13.6 | 13.1 | 13.3 | 14.2 | 14.5 | 12.2 | 12.7 | 14.6 | 15.1 | 14.2 | 14.5 | 14.4 |

Skewness | 0.07 | 0.32 | −0.08 | −0.02 | −0.01 | 0.31 | −0.03 | −0.26 | 0.10 | 0.31 | 0.17 | −0.09 | 0.24 | −0.14 |

Kurtosis | 0.50 | 0.43 | 0.72 | 0.21 | 0.04 | 0.36 | 0.01 | 0.37 | 0.91 | 0.31 | 1.53 | 1.21 | 0.54 | 0.73 |

VC | VAL | SR | SB | CF | VM | NR | GI | RF | PR | AA | AS | DS | MW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

VC | .51 | .36 | .42 | .49 | .32 | .38 | .79 | .39 | .22 | .32 | .41 | .35 | .33 | |

VAL | .34 | .40 | .50 | .36 | .37 | .46 | .28 | .37 | .33 | .42 | .30 | .27 | ||

SR | .18 | .32 | .24 | .23 | .32 | .02 | .16 | .16 | .32 | .21 | .17 | |||

SB | .33 | .37 | .35 | .41 | .33 | .21 | .50 | .24 | .32 | .46 | ||||

CF | .35 | .33 | .43 | .22 | .21 | .30 | .50 | .30 | .26 | |||||

VM | .38 | .33 | .35 | .25 | .35 | .26 | .51 | .30 | ||||||

NR | .32 | .26 | .21 | .27 | .34 | .24 | .41 | |||||||

GI | .35 | .17 | .31 | .38 | .33 | .33 | ||||||||

RF | .09 | .23 | .18 | .32 | .26 | |||||||||

PR | .20 | .20 | .21 | .23 | ||||||||||

AA | .20 | .44 | .30 | |||||||||||

AS | .21 | .24 | ||||||||||||

DS | .21 | |||||||||||||

MW | ||||||||||||||

Mean | 106.2 | 104.3 | 102.3 | 102.5 | 104.8 | 104.0 | 102.8 | 105.0 | 104.4 | 102.3 | 104.3 | 105.6 | 103.4 | 103.7 |

SD | 13.0 | 13.3 | 13.6 | 13.1 | 13.3 | 14.2 | 14.5 | 12.2 | 12.7 | 14.6 | 15.1 | 14.2 | 14.5 | 14.4 |

Skewness | 0.07 | 0.32 | −0.08 | −0.02 | −0.01 | 0.31 | −0.03 | −0.26 | 0.10 | 0.31 | 0.17 | −0.09 | 0.24 | −0.14 |

Kurtosis | 0.50 | 0.43 | 0.72 | 0.21 | 0.04 | 0.36 | 0.01 | 0.37 | 0.91 | 0.31 | 1.53 | 1.21 | 0.54 | 0.73 |

*Notes*: WJ III = Woodcock–Johnson Tests of Cognitive Abilities-Third Edition; VC = Verbal Comprehension; VAL = Visual-Auditory Learning; SR = Spatial Relations; SB = Sound Blending; CF = Concept Formation; VM = Visual Matching; NR = Numbers Reversed; GI = General Information; RF = Retrieval Fluency; PR = Picture Recognition; AA = Auditory Attention; AS = Analysis Synthesis; DS = Decision Speed; MW = Memory for Words.

### Measures

The WJ III standardization sample for the extended battery of tests was used to obtain empirically, or “actual,” base rates for atypical subtest performance within the normal population. The convergence of neuropsychological and psychometric theories supports the construct validity of the WJ III factor structure and diagnostic approach, suggesting its utility in neuropsychological evaluations (Fiorello, Hale, Snyder, Forrest, & Teodori, 2008). For instance, fluid reasoning has been found to be related to executive functions (Decker, Hill, & Dean, 2007), whereas visual-spatial and crystallized abilities have been related to posterior right and left hemisphere functioning, respectively (Goldberg, 2001; Hale & Fiorello, 2004). In addition, the WJ III was considered ideal for this study because it is essentially a fixed battery; thus, the measures are co-normed, but it has also been advocated for use in a flexible battery approach. Its use by neuropsychologists has been moderate (Rabin et al., 2005) but it has grown in popularity as part of a standard neuropsychological battery (Davis, Finch, Dean, & Woodcock, 2006; Dean, Woodcock, Decker, & Schrank, 2003). Additionally, it is frequently used for hypothesis testing in a flexible battery approach, notably within a cross-battery model (Flanagan, Ortiz, & Alfonso, 2007) or for testing hypotheses derived from other cognitive measures (Fiorello et al., 2008).

The WJ III Extended battery consists of 14 different subtests that match the Cattell–Horn–Carroll (CHC) model of cognitive functioning. The CHC model, a hierarchical model, that, as operationalized by the WJ III, measure seven factors of cognitive functioning with two subtests per factor. Reliability estimates for each subtest are reported for broad age groups and generally found to range from 0.80 to 0.95 (*see* the *WJ III Technical Manual* for more specific information, Woodcock et al., 2001). Additionally the *WJ III Technical Manual* contains a complete description of each subtest used in this study. The details of each subtest were omitted here because the focus of this study was on statistical patterns of the subtests rather than on an interpretative analysis of each subtest. Further information on WJ III constructs and subtests can be found in Schrank, Miller, Wendling, and Woodcock (2010).

Subtests, rather than factors, were the focus of this investigation for a variety of reasons. First, although test interpretation is typically factor based, each factor is made of subtests. When a patient's subtest scores within the same factor differ substantially from each other, clinicians frequently shift their interpretive focus to individual subtest level (Flanagan et al., 2007; Hale & Fiorello, 2004; Kaufman, 1994). Second, this study sought to investigate Type I error in the most extreme cases. Since there are 7 factors but 14 subtests, interpretation of 14 subtests is more controversial and potentially more prone to error. Furthermore, neuropsychologists typically administer numerous tests, far more than 14. Thus, again, investigating a situation with 14 subtests is more ecologically valid than 7. Finally, the utility of subtest interpretation has been debated primarily on statistical grounds (Watkins, 2003), but subtests always account for more variance in meaningful outcomes than do nomothetic factor or IQ scores (Elliott, Hale, Fiorello, Dorvil, & Moldovan, 2010; Hale et al., 2008). In addition, subtest analysis may be useful in clinical diagnosis and used by many neuropsychologists. As such, controlling for Type I error at the subtest level may be of most relevance to neuropsychologists.

### Analysis

To obtain the empirical base rates of the number of low scores with each of the five age groups in the normative sample of the WJ III, all standard scores were converted into *z*-scores. To investigate the variety of Type I error strategies, an additional matrix was created for each of the 14 WJ III subtests consisting of either 0 or 1. A 1 would be assigned to a subtest if the person's score on the test was below the *α* threshold set to determine abnormality. A variety of *α* thresholds were evaluated. For example, if abnormality was defined as a test score <85, then all test scores less than a *z*-score of −1.0 were converted into 1, and all scores greater than a *z*-score of −1 were converted into a score of zero. Frequency counts were then made across all 14 subtests for each subject and summed across all subjects. This analysis provided the empirical or actual number of subjects with a specific number of impaired test scores at a particular *α* level.

Because various *α* levels were used in previous research, this study examined results across four different *α* levels which corresponded to standard scores of 85, 80, 75, and 70 (i.e., *z*-scores of −1, −1.33, −1.66, and −2, respectively). Additionally, because the correlation values are known to change across development, we investigated base rate estimates across five age groups.

Previous studies on this topic have used differing methodologies for comparing model estimates. Most studies included the percentage with a particular number of subtests in the impaired range, and inconsistently applied *χ*^{2} to either each frequency bin or across all frequencies. As such, this study used multiple methods to compare the relative fit of the quantitative models to the empirically based estimates. We compared the frequency distributions of both quantitative models to the empirically based estimated. These results provide evidence for the model fit between the quantitative models and the actual estimates by the percentage of overlapping cases correctly classified by the quantitative model for each frequency bin (Kline, 2004).

Additionally, regression analyses were conducted to determine the degree of fit between each quantitative model and the empirical-based estimates for each age group. These analyses provide an overall model fit across each frequency count (0, 1, 2, etc.) for each age group. These regression estimates for the Monte Carlo can then be directly compared with the binomial to determine which model provides a better estimate of the empirically derived frequencies. To compare the relative accuracy of the Monte Carlo method and the binomial method in predicting the empirical estimates, we used a method of comparing correlated correlation coefficients (two correlations that involve a common variable) in which correlations are converted to *z*-scores using Fisher's r-to-z transformation and the difference between the transformed correlations is evaluated with a *z*-test (Meng, Rosenthal, & Rubin, 1992).

#### Binomial model

After empirically deriving the base rates for the WJ III Extended battery of tests, base rate expectations based on the binomial model were then calculated. The binomial model has previously been demonstrated to have utility in predicting base rates (Ingraham & Aiken, 1996); however, it has the inherent short-coming of not taking into account correlations among the subtests. It was included in this study as a basis of comparison as well as an indicator of the influence of correlations on base rate estimation. The binomial model is a probabilistic model for binary outcomes. It provides a normative model for determining the probabilistic expectation of obtaining a particular outcome, given the number of trials and the probability of the outcome, based on chance. To apply the binomial theory to normative scores, the *p*-value for a particular score designated as a cutoff for determining impairment is derived. In the current study, this was accomplished by converting standard scores corresponding to some *α* to a *z*-value then obtaining the *p*-value for the area under the curve. For example, if *α* corresponded to a standard score of 80, then a *p*-value of .09 was used in the binomial model. Estimates were obtained by using the binomial function in Microsoft Excel. Since the binomial model does not take into account test correlations, expectations derived from the binomial model were invariant across different age ranges and only changed as a result of *α*.

#### Monte Carlo simulation method

The Monte Carlo estimates of base rates were calculated based on specifications outlined by Crawford and colleagues (2007). Calculation of low score prevalence were computed using simulated data sets generated in SPSS 19. The results were cross-checked using the computer program provided by Crawford (available at http://www.abdn.ac.uk/~psy086/dept/PercentAbnormKtests.htm). Test intercorrelations were based on the WJ III data set for each age group. As previously mentioned, Crawford developed the simulation method as a result of possible short-comings with the binomial model. This simulation is computationally complex and includes decomposition of correlation matrixes of multiple Monte Carlo simulations (Crawford et al., 2007). The method is simplified with a computer program that is available for download from the previously cited internet website. The software requires correlation estimates for the measures being used. Correlation matrixes used in the Monte Carlo for this study were derived from the WJ III data set for ages 6–8, 9–13, 14–19, 20–39, and 40+. In a similar manner, correlations from the technical manual could also be used. The derived correlations used in this study differed from the correlations published in the technical manual in that we only used participants who had complete data for the WJ III Extended battery. In comparing the Monte Carlo estimates using the data set correlations versus the published correlations, the results were generally equivalent with some deviations of 1 or 2 percentages.

## Results

Means, standard deviations, and zero-order correlations for the WJ III Extended battery of tests can be found in Table 1. As would be expected given the source of data, the descriptive data were largely invariant across subtests. Subtest intercorrelations were all positive but generally moderate, ranging from 0.07 to 0.66. There were several cross-factor correlations, suggesting factorial complexity across the battery, but the strongest correlations were generally found for the subtests that comprised the factors (e.g., crystallized abilities, fluid reasoning) they purportedly measure.

Frequency counts were made for the number of subtests below the four *α* thresholds for the empirical data for all subjects and subtests. Additionally, the number of subtests expected to occur below the *α* threshold for the Binomial and Monte Carlo were estimated. Since 14 subtests were used, it was possible for any particular subject to have a range of impaired scores from 0 to 14, with 0 indicating no impaired test scores and 14 indicating all tests were in the impaired range. Frequency counts were tallied across the specified age ranges.

For all age groups and *α* thresholds, the category of zero subtests in the impaired range was the most frequent category. However, as expected, the frequency estimate for this category varied considerably depending on the *α* threshold. It also varied depending on the age group. For example, for the *α* threshold corresponding to a standard score of 80, the frequency estimate for the category of 0 impaired tests ranged between 50% and 70%, approximately (Fig. 4). As such, in a battery of 14 tests, there is a 50%–30% chance of obtaining at least one or more subtests in the impaired range, with a higher probability of obtaining an impaired result for younger participants. Clearly, diagnostic decisions should not be made based on impairment of one test, but should not automatically be rejected as an anomaly. Instead, clinicians can use this information to develop hypotheses about cognitive strengths and weaknesses, and then use additional measures to confirm or refute potential weaknesses found on cognitive test profiles (Hale & Fiorello, 2004).

Figs 2–5 graphically display the frequency counts for the number of impaired subtests at different age groups and *α* thresholds. As indicated in the graphs, frequency counts are made based on the empirical data as well as the estimates for the binomial model and the Monte Carlo procedure. The percentage of overlap between each model and the actual data provide an indicator of model fit. Although actual estimates and Monte Carlo estimates change across age groups, binomial model estimates stay the same since the model only takes into account the number of tests and the probability of a low score, which remains invariant across age groups. In contrast, the Monte Carlo estimates change as the correlation across test change as age changes. As indicated by the percentage of overlap of each model with the empirical estimates by each frequency bin, the deviation in model fit primarily occurs for estimates having to do with fewer than three tests in the impaired range.

An interesting pattern in the data emerges when considering the results for clinical purposes. Suppose that a clinician considers scores below 80 to be in the impaired range. For each age range, ∼80%–90% of the participants can be described as having three or fewer test scores in the impaired range. If such a criterion is used, there is negligible difference between the Monte Carlo estimates and the binomial model estimates for frequency ranges of four impaired subtests or higher. That is, most of the difference between the Monte Carlo and the binomial model occur in the 0–2 or 0–3 range which is typically the non-clinical range. This may account for different studies validating each of these procedures since both models are fairly accurate in forecasting low probability events (4–10 impaired test scores). However, when used to estimate the frequency of test scores across the entire range of frequencies, the Monte Carlo method has a clear advantage.

The regression-based model fit indices to compare the binomial model with the Monte Carlo model for the number of test scores in the impaired range on the empirical or actual number of test scores in the impaired range were estimated. Regression analyses were performed for each age group (6–8, 9–13, 14–19, 20–39, and 40+). Subtest impairment was defined at four levels of impairment (*α* thresholds corresponding to standard score cutoffs of 70, 75, 80, and 85). Results from the regression analyses can be found in Table 6. In almost all analyses, the estimates from the binomial method and the Monte Carlo method were both significantly correlated with the observed rates. However, a test of the difference between both correlations (i.e., the square root of the *R*^{2} values in Table 6) found that, in each age group and *α* threshold, the Monte Carlo method's correlations with the observed rates were significantly larger than the binomial estimates.

Threshold | Age | R^{2} | z_{Difference} | |
---|---|---|---|---|

Monte Carlo | Binomial | |||

70 | 6–8 | .9999 | .97 | 9.84 |

9–13 | .9991 | .97 | 6.49 | |

14–19 | .9999 | .97 | 9.34 | |

20–39 | .9999 | .94 | 11.59 | |

40 + | .9998 | .94 | 9.44 | |

75 | 6–8 | .991 | .83 | 5.56 |

9–13 | .997 | .86 | 7.03 | |

14–19 | .999 | .86 | 8.14 | |

20–39 | .9993 | .77 | 9.80 | |

40 + | .9997 | .77 | 11.09 | |

80 | 6–8 | .98 | .58 | 5.60 |

9–13 | .997 | .65 | 7.80 | |

14–19 | .997 | .59 | 8.09 | |

20–39 | .993 | .39 | 7.26 | |

40 + | .996 | .40 | 8.00 | |

85 | 6–8 | .97 | .41 | 5.56 |

9–13 | .98 | .40 | 5.74 | |

14–19 | .99 | .33 | 6.57 | |

20–39 | .99 | .16* | 7.46 | |

40 + | .99 | .16* | 7.10 |

Threshold | Age | R^{2} | z_{Difference} | |
---|---|---|---|---|

Monte Carlo | Binomial | |||

70 | 6–8 | .9999 | .97 | 9.84 |

9–13 | .9991 | .97 | 6.49 | |

14–19 | .9999 | .97 | 9.34 | |

20–39 | .9999 | .94 | 11.59 | |

40 + | .9998 | .94 | 9.44 | |

75 | 6–8 | .991 | .83 | 5.56 |

9–13 | .997 | .86 | 7.03 | |

14–19 | .999 | .86 | 8.14 | |

20–39 | .9993 | .77 | 9.80 | |

40 + | .9997 | .77 | 11.09 | |

80 | 6–8 | .98 | .58 | 5.60 |

9–13 | .997 | .65 | 7.80 | |

14–19 | .997 | .59 | 8.09 | |

20–39 | .993 | .39 | 7.26 | |

40 + | .996 | .40 | 8.00 | |

85 | 6–8 | .97 | .41 | 5.56 |

9–13 | .98 | .40 | 5.74 | |

14–19 | .99 | .33 | 6.57 | |

20–39 | .99 | .16* | 7.46 | |

40 + | .99 | .16* | 7.10 |

*Note*: All *z*-tests significant at the 0.001 level.

**p* > 0.05

As indicated in Table 6, the Monte Carlo provided a better fit to the actual data across all age ranges. There was a slight decrease in fit across age. That is, the Monte Carlo produced greater misfit at older age groups rather than younger age groups. This trend can also be observed across Figs 2–5 and appears to primarily be a result of higher frequency counts of zero scores in the impaired range. Nonetheless, the Monte Carlo model still provided a good fit to the data with *p* < .001 across the five age ranges and four *α* thresholds.

## Discussion

Although multitest batteries are frequently used in neuropsychology, adjustment of thresholds for determining abnormal performance based on the number of tests administered may not be considered. Most clinicians recognize the tradeoff between Type I and Type II errors in diagnostic decision-making, but only in the abstract, because there is a paucity of mathematical analyses available for making more empirically based decisions regarding multiple base rates when conducting neuropsychological assessments, especially when multiple measures with different normative samples are used in a flexible battery approach. In addition to the importance of clinicians’ awareness of Type I error in decision-making (Gigerenzer, 2002; Gigerenzer & Hoffrage, 1995; Hunink et al., 2001; Wedding & Faust, 1989), legal precedents require an understanding of the odds of erroneous decision-making (e.g., Daubert). The current study suggests that mathematical models can be used to examine base rate estimations across measures, and this nomothetic information can be useful in guiding idiographic interpretation of neuropsychological test data (Hale et al., 2010).

This study compared two mathematical base rate estimation procedures, the binomial and Monte Carlo methods, with actual base rates for various *α* and age levels. Confirming previous research (Crawford et al., 2007), the binomial model produced the greatest inaccuracies in estimating Type I error rates when compared with empirical distributions from the WJ III measures. The drastic departure was expected given this model does not take into account test correlations, which are known to be significant on most cognitive and neuropsychological measures. The basic pattern of the model's bias was to underestimate the number of subtests below an *α* threshold at the low range and overestimate subtests below an *α* threshold at the higher range, which could lead to Type II error in the former, and Type I error in the latter situation. Despite its noted applications (Ingraham & Aiken, 1996), it is likely to produce erroneous results when used in clinical practice. In comparison to the binomial model, the Monte Carlo model was impressively more accurate. On the whole, the results of this study are consistent with other research, suggesting that the Monte Carlo method accurately estimates base rates and is more accurate than the binomial model (Brooks & Iverson, 2010; Schretlen et al., 2008).

The Monte Carlo method accurately predicted the trend of the empirical data from the WJ III standardization sample for each age group. However, some deviations were noted. Like the binomial model, the Monte Carlo method underestimated the number of subtests below an *α* threshold for the low-frequency subtest impaired range (0) and overestimated subtests below an *α* threshold at the higher impaired subtest frequency range (>1 or >2). This pattern produces a “cross-over” in the data where the model estimates are less than empirical norms for low-frequency values (0) about equal for (1) estimate then typically higher for (2–10) frequency ranges. The deviation of the model estimates from the actual rates is likely a result of minor differences in skew and kurtosis in the empirical sample that is not accounted for by the Monte Carlo procedure, which assumes perfect normality. Further evaluating the precise impact of skewness and kurtosis on Monte Carlo estimates may be an important area for future investigations.

The implications for clinical practice are that the Monte Carlo method is recommended to be used over the binomial model for estimating base rates during clinical evaluation. Although both models may work about equally well in predicting low probability events, the Monte Carlo's general accuracy across all frequency categories would be considered a more robust model. As a caution, the validity of the Monte Carlo method is compromised by the degree to which the tests included in the multitest battery depart from normality. Clinicians using the Monte Carlo method may consider first validating the normality of each test prior to its use. This is a significant consideration given that clinicians tend to evaluate individuals with disabilities, who have been shown to have different within and between subtest patterns than those in standardization samples (e.g., Elliott et al., 2010; Hale et al., 2008), consistent with functional and structural brain differences found among clinical populations (Hale et al., 2010).

Some additional caveats are warranted. First, as mentioned by Sattler and colleagues (2008), meaningful test interpretation requires the integration of information from various sources that include statistical significance, base rates, and other sources of information. The methods used in this article are applicable in situations where clinicians engage in “test-then-interpret” practice. That is, a large number of tests are administered, scores are reviewed, and a rationale is given for low scores. This approach is different than a “hypotheses-test-confirm/disconfirm hypotheses” practice that can be used to substantiate interpretive findings during cognitive and neuropsychological evaluation for both diagnostic and intervention purposes (e.g., Clements, Christner, McLaughlin, & Bolton, 2011; Decker, 2008; Dehn, 2008; Elliott et al., 2010; Feifer & Della Toffalo, 2007; Fiorello, Hale, Decker, & Coleman, 2009; Flanagan, Alfonso, Mascolo, & Hale, 2010; Fletcher-Janzen, 2005; Hale & Fiorello, 2004; Hale, Wycoff, & Fiorello, 2010; Miller, Getz, & Leffard, 2006; Miller & Hale, 2008; Witsken, Stoeckel, & D'Amato, 2008). In such practice, Type I error rates are reduced because a specific set of hypotheses are tested which is likely to be less than the number of tests administered. In other words, clinicians can develop and test-specific hypotheses about potential weaknesses rather than administer large fixed batteries of tests, which is an advantage of the flexible battery approach (Hale & Fiorello, 2004). Additionally, determining low test performance is a result of neurological impairment is a “clinical inference” and not just a result of low scores (Schretlen et al., 2008). Unfortunately, this is another area of neuropsychological practice without formal specifications to guide clinicians and is largely dependent on substantial supervision provided by credentialed neuropsychologists.

Future research may systematically examine the degree to which skewness, kurtosis, and non-linear associations among variables cause inaccuracies in the Monte Carlo procedure. Additionally, subsequent studies may attempt to modify the Monte Carlo equations to include skewness and kurtosis estimates to produce more accurate estimates of empirical base rates. Although tests in co-normed batteries like the WJ III or the Wechsler tests are independent, there may be idiosyncratic test development factors that influence test correlations. Future research may attempt to validate the Monte Carlo method on a flexible test battery where tests were not simultaneously developed and co-normed. Such investigations, as demonstrated in this study, must block samples by age. Finally, future studies may wish to validate the Monte Carlo method not only against the empirically derived base rates for a healthy population but also for a clinical population. These mathematical issues need to be explored within and across measures for standardization samples and for individuals with disabilities whose covariance matrices may be substantially different. Because score distributions in clinical populations may be more likely to produce non-normal score distributions, the Monte Carlo method may produce less accurate estimates in clinical populations.

This approach has useful implications for clinical decision-making. Accurate knowledge of the base rates of a phenomenon helps prevent misinterpretations and misdiagnoses. Fig. 6 displays the Monte Carlo estimates of the cumulative percent of examinees with a certain number of subtests below the desired threshold of impairment. If an examinee is in the 6–8 age range and the clinician uses a standard score of 80 as the threshold to determine impairment, it is helpful to know that it is rather common for individuals to have 1, 2, or 3 subtests below 80. It is not until there are 4 (out of 14) subtests below 80 that the prevalence is <10% of the population and not until there are 8 (out of 14) subtests below 80 that the prevalence is <1% of the population. This knowledge should promote a bit of caution and stimulate a bit of thought in the clinician before making a diagnostic decision based on one or two low subtest scores. A clinician under the mistaken impression that it is rare to have two subtests below 80 may be overly aggressive in diagnosing rare conditions. Ultimately, the diagnosis may still turn out to be the same but the process by which it emerged is likely better informed by more accurate base rates.

Adjusting this procedure to account for demographic factors (Schretlen et al., 2008) and other variables such as the mean subtest score (Brooks & Iverson, 2010) may be a focus of future research. However, this research will come with its own methodological challenges, such as the problems with adjusting diagnostic and empirical findings after accounting for the variance explained by IQ (e.g., Dennis et al., 2009; Hale, Fiorello, Kavanagh, Holdnack, & Aloe, 2007).

As a final note, it is important to make a distinction between atypical test scores and impaired test performance (Schretlen et al., 2008). Neuropsychological assessment requires the detection of a pattern of test scores consistent with cerebral impairment or brain dysfunction, but in many high-incidence cases all we have is the scores—based on performance during the testing situation—not “hard signs” of brain dysfunction (Hale & Fiorello, 2004). Lower than average test scores can be an important sign of cerebral impairment. However, low test scores can result for multiple reasons other than cerebral impairment, such as measurement error, inconsistent attention, or variable motivation. But this information too can be diagnostic, a poor motivation and variable performance can be a sign of executive dysfunction seen in individuals with psychopathology (Hale et al., 2009). Additionally, as demonstrated by this and other studies, low test scores may be a result of random chance. Cultivation of a statistical method to control for measurement error and variability, while not sufficient for identifying impairment, may improve the sensitivity and specificity by ruling out spurious test results resulting from random error and other sources of variation. Once sufficiently validated, the incorporation of such a method into clinical practice could be facilitated by easy to use computer software or by incorporating the statistical method into test scoring software.

## Conflict of Interest

None declared.

## Acknowledgement

The authors would like to thank the Woodcock-Munoz Foundation for permission to use the WJ-III standardization data.