Abstract

One of the most difficult tasks facing survey researchers is balancing the imperative to keep surveys short with the need to measure important concepts accurately. Not only are long batteries prohibitively expensive but lengthy surveys can also lead to less informative answers from respondents. Yet, scholars often wish to measure traits that require a multi-item battery. To resolve these contradicting constraints, we propose the use of adaptive inventories. This approach uses computerized adaptive testing methods to minimize the number of questions each respondent must answer while maximizing the accuracy of the resulting measurement. We provide evidence supporting the utility of adaptive inventories through an empirically informed simulation study, an experimental study, and a detailed case study using data from the 2016 American National Election Study (ANES) Pilot. The simulation and experiment illustrate the superior performance of adaptive inventories relative to fixed-reduced batteries in terms of precision and accuracy. The ANES analysis serves as an illustration of how adaptive inventories can be developed and fielded and also validates an adaptive inventory with a nationally representative sample. Critically, we provide extensive software tools that allow researchers to incorporate adaptive inventories into their own surveys.

1. INTRODUCTION

One of the most difficult tasks facing survey researchers is balancing the imperative to keep surveys short with the need to measure concepts accurately. Surveys with nationally representative samples are expensive; long surveys are extremely expensive. Worse, lengthy surveys increase the burden on respondents and drive up attrition, item nonresponse, unit nonresponse, and satisficing. Yet, scholars often wish to study latent traits or attitudes that can only be measured accurately using large multi-item batteries.

Facing these countervailing pressures, researchers almost universally choose a subset of questions from large batteries to administer to all respondents. However, this approach is inefficient since these reduced batteries inevitably include items that provide little additional information about some respondents’ positions on the latent scale. In other words, while any given reduced battery might perform well on average, it will be poorly chosen for many specific respondents.

In this article, we provide an alternative approach that we refer to as adaptive inventories (AIs), which allows researchers to sidestep this balancing act. The advantage of AIs is that they adjust dynamically by using individuals’ answers to items already administered in the battery to optimally choose subsequent questions. In short, using an AI rather than a fixed subset of questions comes at no additional cost in terms of survey time yet provides survey researchers more accurate estimates of respondents’ positions on a latent trait by asking a reduced battery customized to each respondent.

Intuitively, AIs are founded on the premise that we should not ignore what we have already learned about a respondent when choosing questions. For example, if we are measuring political knowledge, we should not ask a respondent who correctly defined the Byrd Rule whether she knows what position is held by Mike Pence. Any respondent with sufficient political sophistication to answer the former almost certainly knows the answer to the latter. Instead, we should update our beliefs about respondents’ level of knowledge as the survey progresses and choose questions calibrated to our current beliefs about their positions on the trait. For instance, we might ask that respondent to name the Secretary of Homeland Security.

On a more technical level, AIs are an application of computerized adaptive testing (CAT), an extension of item response theory (IRT) originating in educational testing (Weiss 1982). Computerized adaptive testing builds on basic IRT models to allow tests to change dynamically for each respondent (Kingsbury and Weiss 1983). Computerized adaptive testing is widely used in the fields of educational testing and psychology. Despite the vintage of the approach, however, it has rarely been applied in public opinion research and, to the best of our knowledge, has never before been used on a nationally representative sample in social science research.

The paucity of AIs on public opinion surveys is easy to explain. Researchers and survey firms are dissuaded from implementing AIs due to the lack of freely available software that they can integrate into their own data collection systems. Further, the literature lacks a comprehensive guide to CAT methods that explicates the technique in an intuitive manner, illustrates its advantages beyond simulations or single batteries, and provides guidance as to how they can be implemented in collaboration with real-world survey firms.

In this article, we build on the limited previous work applying AIs to survey research (e.g., Montgomery and Cutler 2013) in two ways to address these obstacles. To begin, previous presentations of AIs have illustrated its advantages using only one or two latent traits tested in either simulations or convenience samples. Here, we provide simulation and experimental evidence demonstrating the benefits of AIs using ten well-established personality batteries. This includes an application of AIs on the 2016 American National Election Study (ANES) Pilot. To our knowledge, this is the first time an adaptive inventory has been administered to a nationally representative sample in the social sciences.

More critically, we provide an extensive suite of software tools that resolve several technical obstacles. This includes a freely available R package that can execute adaptive algorithms in near real time (less than 0.01 seconds). We also provide a simplified approach to precalculating AIs for easy integration into online, interactive voice response (IVR) or computer-assisted telephone interview (CATI) surveys.

In the next section, we discuss a particular valuable setting for AIs that we focus on further in the article: measuring personality. In section 3, we outline the general AI approach and provide details on the specific implementation we use in our applications. We then provide evidence supporting the utility of AIs through an empirically informed simulation study, an experimental study using large convenience samples, and a case study of the 2016 ANES Pilot. Finally, in the online supplementary material, we provide details about our freely available software and practical guidance to assist scholars.

2. MOTIVATING EXAMPLE: MEASURING PERSONALITY

When would researchers be interested in including an AI in a survey in the first place? The method is most appropriate when the following four criteria are met. First, it assumes that scholars are interested in measuring a latent trait rather than analyzing survey responses to specific questions. This requires that the battery itself must conform with the assumptions of latent variable models (e.g., factor analysis, latent class analysis, etc.) that responses are conditionally independent. Specifically, there should be no order effects within the battery. So for instance, answers to Q1 cannot directly affect answers to Q2; any observed covariance in responses must be due to the fact that responses to both are a function of the underlying trait.

Second, it assumes that the underlying concept is unidimensional. However, in the conclusion we discuss extensions to the multidimensional setting. Third, the survey should have space for at least three question items drawn from a larger battery; although in principle, it can be applied using only two. Fourth, the battery should have many question items from which the algorithm can choose. Ideally, these questions should be varied and contain both extreme and moderate items. However, as we show later on, the method works with standard scales as short as eight items.

Survey research tasks often meet these criteria. Potential applications include placing survey respondents into an ideological space using roll-call questions (Bafumi and Herron 2010), estimating respondents’ likelihood of voting (Erikson, Panagopoulos and Wlezien 2004), estimating respondents’ political knowledge (Montgomery and Cutler 2013), measuring citizens’ values (Schwartz 1992), measuring respondents’ risk tolerance (Yook and Everett 2003), measuring cognitive reflection (Toplak, West and Stanovich 2011), and more. For the sake of concreteness, however, we focus on a particularly valuable context for adaptive methods: measuring personality.

Interest is increasing across disciplines regarding the role personality plays in affecting behavior. In public opinion research, the most prominent example is research on the “Big Five” personality traits. However, the Big Five are only one broad form of the “multifaceted, enduring, internal psychological structure[s]” that constitute personality traits (Mondak, Hibbing, Canache, Seligson, and Anderson 2010, p. 86). Other traits affect how individuals process information, including the need for cognition (e.g., Druckman 2004), the need to evaluate (e.g., Bizer, Krosnick, Holbrook, Wheeler, Rucker, et al. 2004; Chong and Druckman 2013), and the need for affect (Arceneaux and Vander Wielen 2013). Still more traits measure individuals’ orientation toward specific social constructs such as the acceptable degree of inequality in society or the appropriate scope of state action, including social dominance orientation (Sidanius, Pratto, Van Laar, and Levin 2004) and right wing authoritarianism (Altemeyer 1988).

However, many widely used measures in the literature have potential implications for public opinion research and yet appear in the public opinion literature rarely or not at all. These include, for instance, narcissism (Raskin and Terry 1988), empathy-systemizing quotients (Baron-Cohen, Richler, Bisarya, Gurunathan, and Wheelwright 2003), and Machiavellianism (Christie, Geis, and Berger 1970).

One reason many personality traits fail to filter into the public opinion literature is surely that many inventories are too long. Standard practices in social and cognitive psychology result in batteries containing dozens or even hundreds of question items (see table 1). Typically, survey researchers avoid these large scales because they are too time consuming for respondents and/or too expensive to administer.

Table 1.

Exemplar Full and Reduced-Form Measures of Personality Traits

Original lengthReduced length
Example psychology scales with reduced-form scales 
Narcissistic personality Raskin and Terry (1988) Ames et al. (2006) 
 Length 40 16 
Empathy quotient Baron-Cohen et al. (2003) Muncer and Ling (2006) 
 Length 40 15 
Systemizing quotient Baron-Cohen et al. (2003) Wakabayashi et al. (2006) 
 Length 40 25 
Machiavellian personality Christie, Geis, and Berger (1970) Rauthmann (2013) 
 Length 20 
American National Election Studies 2000–present 
Need for cognition Cacioppo and Petty (1982) Bizer et al. (2000) 
 Length 40 
Need to evaluate Jarvis and Petty (1996) Bizer et al. (2000) 
 Length 16 
American National Election Studies 2013 Internet followup 
Right wing authoritarianism Altemeyer (1988) American National Election Studies (2013) 
 Length 30 
Social dominance Pratto, Sidanius, Stallworth, and Malle (1994) American National Election Studies (2013) 
 Length 
Social equality Pratto et al. (1994) American National Election Studies (2013) 
 Length 
Need for affect Maio and Esses (2001) American National Election Studies (2013) 
 Length 26 
Original lengthReduced length
Example psychology scales with reduced-form scales 
Narcissistic personality Raskin and Terry (1988) Ames et al. (2006) 
 Length 40 16 
Empathy quotient Baron-Cohen et al. (2003) Muncer and Ling (2006) 
 Length 40 15 
Systemizing quotient Baron-Cohen et al. (2003) Wakabayashi et al. (2006) 
 Length 40 25 
Machiavellian personality Christie, Geis, and Berger (1970) Rauthmann (2013) 
 Length 20 
American National Election Studies 2000–present 
Need for cognition Cacioppo and Petty (1982) Bizer et al. (2000) 
 Length 40 
Need to evaluate Jarvis and Petty (1996) Bizer et al. (2000) 
 Length 16 
American National Election Studies 2013 Internet followup 
Right wing authoritarianism Altemeyer (1988) American National Election Studies (2013) 
 Length 30 
Social dominance Pratto, Sidanius, Stallworth, and Malle (1994) American National Election Studies (2013) 
 Length 
Social equality Pratto et al. (1994) American National Election Studies (2013) 
 Length 
Need for affect Maio and Esses (2001) American National Election Studies (2013) 
 Length 26 

Note.— The reduced-form batteries contain a strict subset of items in the original batteries. All question wordings are shown in the supplementary material online.

Table 1.

Exemplar Full and Reduced-Form Measures of Personality Traits

Original lengthReduced length
Example psychology scales with reduced-form scales 
Narcissistic personality Raskin and Terry (1988) Ames et al. (2006) 
 Length 40 16 
Empathy quotient Baron-Cohen et al. (2003) Muncer and Ling (2006) 
 Length 40 15 
Systemizing quotient Baron-Cohen et al. (2003) Wakabayashi et al. (2006) 
 Length 40 25 
Machiavellian personality Christie, Geis, and Berger (1970) Rauthmann (2013) 
 Length 20 
American National Election Studies 2000–present 
Need for cognition Cacioppo and Petty (1982) Bizer et al. (2000) 
 Length 40 
Need to evaluate Jarvis and Petty (1996) Bizer et al. (2000) 
 Length 16 
American National Election Studies 2013 Internet followup 
Right wing authoritarianism Altemeyer (1988) American National Election Studies (2013) 
 Length 30 
Social dominance Pratto, Sidanius, Stallworth, and Malle (1994) American National Election Studies (2013) 
 Length 
Social equality Pratto et al. (1994) American National Election Studies (2013) 
 Length 
Need for affect Maio and Esses (2001) American National Election Studies (2013) 
 Length 26 
Original lengthReduced length
Example psychology scales with reduced-form scales 
Narcissistic personality Raskin and Terry (1988) Ames et al. (2006) 
 Length 40 16 
Empathy quotient Baron-Cohen et al. (2003) Muncer and Ling (2006) 
 Length 40 15 
Systemizing quotient Baron-Cohen et al. (2003) Wakabayashi et al. (2006) 
 Length 40 25 
Machiavellian personality Christie, Geis, and Berger (1970) Rauthmann (2013) 
 Length 20 
American National Election Studies 2000–present 
Need for cognition Cacioppo and Petty (1982) Bizer et al. (2000) 
 Length 40 
Need to evaluate Jarvis and Petty (1996) Bizer et al. (2000) 
 Length 16 
American National Election Studies 2013 Internet followup 
Right wing authoritarianism Altemeyer (1988) American National Election Studies (2013) 
 Length 30 
Social dominance Pratto, Sidanius, Stallworth, and Malle (1994) American National Election Studies (2013) 
 Length 
Social equality Pratto et al. (1994) American National Election Studies (2013) 
 Length 
Need for affect Maio and Esses (2001) American National Election Studies (2013) 
 Length 26 

Note.— The reduced-form batteries contain a strict subset of items in the original batteries. All question wordings are shown in the supplementary material online.

Rather than include lengthy batteries, researchers typically develop a reduced version of a battery by selecting a subset of items from the larger scale to administer to respondents. For example, one reason the Big Five became so prominent in recent years is the advent of the Ten Item Personality Inventory (TIPI) (Gosling, Rentfrow, and Swann 2003). Prior to TIPI, the most common battery measuring the Big Five was the forty-four–item Big Five Inventory (John, Donahue, and Kentle 1991), itself a shorter alternative to the monstrous 240-item NEO Personality Inventory-Revised (McCrae and John 1992). In fact, developing reduced-form scales of larger batteries constitutes a considerable body of scholarship (see table 1 for examples).

Broadly speaking, researchers develop reduced scales in one of three ways. First, scholars may examine the properties of the scale to make theoretically motivated decisions about which items to preserve. Thus, Ames, Rose, and Anderson (2006, p. 441) developed the reduced-form narcissistic personality inventory (NPI-16) by choosing items with strong “face validity” that also ensured coverage of theorized subdomains.

Second, researchers may choose items based on factor loadings in the original publication. For instance, in designing a two-item battery measuring need for cognition for the American National Election Study (ANES), Bizer, Krosnick, Petty, Rucker, and Wheeler (2000, p. 13) chose “the two items that loaded most strongly on the latent construct in Cacioppo and Petty’s (1982) factor analysis.”

Finally, researchers may administer the original battery to one or more convenience samples and use these responses to select items. Thus, Muncer and Ling (2006) developed a fifteen-item reduced-form variant of the forty-item Empathy Quotient (Baron-Cohen et al. 2003) by analyzing responses from 362 students and parents at universities in North England.

In each approach, scholars rely on parameter estimates from calibration samples to develop a reduced scale. Once chosen, however, the same set of questions are administered to all respondents. Adaptive inventories also rely on parameters estimated from calibration samples. (We investigate how large these calibration samples ought to be in the supplementary material online.) However, AIs differ in that the goal is not to use this prior information to choose a single battery for all respondents but rather to tailor a reduced battery to each respondent in a manner designed to maximize precision. The result is improved measurement relative to any fixed battery of the same length.

3. ADAPTIVE INVENTORIES

In this section, we briefly provide the details of one implementation of AIs that we use below. For this example, we follow Chen, Hou, and Dodd (1998), van der Linden (1998), Segall (2005), and Choi and Swartz (2009). Adaptive inventories take a set of potential items and choose questions that best place each respondent on the latent scale. Broadly speaking, they choose items that are highly discriminatory and items where the respondent has a high probability of answering in multiple categories.

3.1 Overview

Figure 1 shows the basic elements of an AI. For some respondent j[1,,J], our goal is to estimate her true position on the latent scale, denoted θj. The first stage of the algorithm estimates her position, θ^j. If no questions from the battery have been administered, we estimate θ^j using only a common prior, π(θ). After a respondent has answered at least one item, we estimate θ^j based on both the prior and observed responses to previous questions (yj).

Figure 1.

Basic Elements of Adaptive Inventories.

Figure 1.

Basic Elements of Adaptive Inventories.

Second, the algorithm selects the next question item based on a predetermined criterion discussed further. Third, we administer the chosen item and record the response. Fourth, the algorithm checks some stopping rule. In our examples, this rule is that the number of items asked has reached a maximum value. If the stopping criterion is not met, the process repeats. Otherwise, the algorithm calculates final estimates for θ^j and terminates.

3.2 The General Model for Ordered Categorical Responses

Personality inventories typically include questions with multiple ordered response options (e.g., Likert scales). To handle ordered categorical responses, we use a graded response model (GRM) (Samejima 1969; Baker and Kim 2004). For each item i, we assume that there are Ci response options and a vector of threshold parameters defined as κi=(κi0,κi1,,κiCi), with κi0<κi1<κiCi, κi0=, and κiCi=. In addition, each item is associated with a discrimination parameter ai, which indicates how well item i corresponds to the underlying trait. Note that these parameters must be precalculated based on a calibration sample.

To calculate the likelihood function, we estimate Pijk=Pr(yij=k|θj), which is the probability of answering in the kth category for item i given the ability parameter for respondent j. This quantity cannot be calculated directly. Instead, we define Pijk*=Pr(yijk|θj). Assuming a logistic response function, this is,  
Pijk=exp(κikaiθj)1+exp(κikaiθj).
Note that this implies that Pij0=0, PijCi=1, and Pijk=Pij,kPij,k1. The likelihood function is then,  
L(θj)=i=1nk=1CiPijkI(yij=k)=exp[i=1nk=1Cilog(PijkI(yij=k))],
where I(·) is an indicator function.

To complete the model, we specify the prior, π(θj). A natural choice is a conjugate normal prior π(θj)N(μθ,τθ), where τθ denotes the standard deviation. We found that a standard normal prior works well in most settings. In our third application, however, we discuss selecting a prior based on a calibration sample.

3.3 Details for One Adaptive Battery

We can now provide the details of the adaptive algorithm applied further on. We have implemented all of the most common approaches for estimating θ^j in our freely available R package catSurv. Optional methods include maximum likelihood, weighted maximum likelihood, and maximum a posteriori methods. In our examples, we use the expected a posteriori (EAP) approach—a standard choice for those adopting a Bayesian perspective.

Assuming that person j has provided answers to at least one item (yj), we calculate EAP as,  
θ^j(EAP)=E(θj|yj)=θjπ(θj)L(θj)dθjπ(θj)L(θj)dθj.
(1)
Thus, θ^j(EAP) is the expected value of the posterior distribution. The posterior variance is  
Var(θ^j)=E((θjθ^j(EAP))2|yj)=(θjθ^j(EAP))2π(θj)L(θj)dθjπ(θj)L(θj)dθj.
(2)

As calculating these quantities involves solving only one-dimensional integrals, we can estimate both using numerical methods.

The next step is to choose an item based on our current estimate of θj. Adaptive batteries choose items to optimize some predefined objective function. Popular options include maximum Fisher’s information, maximum expected observed information, and maximum expected Kullback-Leibler divergence, among others (Choi and Swartz 2009). All of these options are available in our catSurv software. Here, we use the minimum expected posterior variance (MEPV) item selection criterion. Choi and Swartz (2009) note that this approach performs “equally well” to the more commonly used methods, but that “the MEPV method would be preferred from a Bayesian perspective” (p. 436), thus we choose it largely to stay within a simple Bayesian framework.

First, we need to use the current estimate of θ^j to estimate Pmjk for each possible response k to a candidate (unasked) item m. Second, we need to calculate the posterior variance we would have given each possible response to question m using (1) and (2). Third, we combine these elements to estimate the expected posterior variance (EPV) for the candidate item,  
EPVm=kPmjkVar(θ^j|ymj=k).

In words, EPVm is the posterior variance for θ^j that we would have given each possible response to item m weighted by the probability of observing that response—where Pmjk is conditioned on our current estimate θ^j. Finally, we select the item with the lowest EPV value.

After the item is chosen and administered to a respondent, the final step is to check a stopping rule. In the examples provided later on, the algorithm stops offering items when the number of questions reaches a prespecified threshold nmax. However, our software also allows researchers to use other criteria based on the precision of the current estimate of θj or the expected information to be gained from the remaining items in the battery.

4. APPLICATIONS

In this section, we demonstrate the advantages of AIs in an empirically informed simulation, an experiment conducted with convenience samples, and a case study using data from the 2016 ANES Pilot Study. The simulation and experiment illustrate the superior performance of AIs relative to fixed-reduced batteries. The ANES analysis serves as an in-depth case study of how AIs can be developed and fielded with a nationally representative sample.

4.1 Simulation: Narcissism, Machiavellianism, Empathy, and Systemizing

We first demonstrate the benefits of AIs using a dataset of responses to four personality inventories. Several previous articles have engaged in simulation studies of CAT (e.g., Weiss and Kingsbury 1984; van der Linden and Pashley 2010; Montgomery and Cutler 2013). However, an additional demonstration is useful because (a) the vast majority of these studies have focused on the binary response model (but see Hol, Vorst, and Mellenbergh 2007; Choi and Swartz 2009); (b) in almost all cases, the simulated batteries and responses are not based on actual battery calibrations or response sets (e.g., Chang and Ying 1999); and (c) the performances of the methods are typically assessed relative to each other (rather than established fixed batteries) using metrics (viz. item exposure rates) not relevant to survey research (e.g., Barrada, Olea, Ponsoda, and Abad 2010). Our goal, therefore, is to show the advantages of AIs relative to fixed reduced-form batteries based on real-world examples where the reduced batteries have been published in the literature. By using simulations, we can compare latent estimates under hypothetical counterfactuals where respondents received either a fixed, an adaptive, or a random reduced battery.

In our simulation, we relied on data collected by personality-testing.info, maintained by Eric Jorgenson, in which tens of thousands of respondents were recruited online to provide responses to prominent personality inventories. We selected four personality inventories for which there exists a validated reduced-form version (see table 1). All question wordings and response rates are provided in the supplementary material online. These reduced-form scales have been published in peer-reviewed journals, and several have been used extensively in academic research. For our analytical approach, it is essential that the reduced-form battery consists exclusively of a subset of items from the larger battery. Additional information about the scales and the data is shown in table 2.

Table 2.

Description of Large Personality Inventories in Simulation Study

Full batteryFixed batteryResponseTraining (n)Test (n)
lengthlengthcategories
Narcissism 40 16 8,700 1,740 
Machiavellianism 20  5 10,249 2,050 
Empathy 40 15 10,145 2,029 
Systemizing 40 25 10,145 2,029 
Full batteryFixed batteryResponseTraining (n)Test (n)
lengthlengthcategories
Narcissism 40 16 8,700 1,740 
Machiavellianism 20  5 10,249 2,050 
Empathy 40 15 10,145 2,029 
Systemizing 40 25 10,145 2,029 

Note.—See table 1 for additional details. All data obtained from: http://personality-testing.info.

Table 2.

Description of Large Personality Inventories in Simulation Study

Full batteryFixed batteryResponseTraining (n)Test (n)
lengthlengthcategories
Narcissism 40 16 8,700 1,740 
Machiavellianism 20  5 10,249 2,050 
Empathy 40 15 10,145 2,029 
Systemizing 40 25 10,145 2,029 
Full batteryFixed batteryResponseTraining (n)Test (n)
lengthlengthcategories
Narcissism 40 16 8,700 1,740 
Machiavellianism 20  5 10,249 2,050 
Empathy 40 15 10,145 2,029 
Systemizing 40 25 10,145 2,029 

Note.—See table 1 for additional details. All data obtained from: http://personality-testing.info.

To begin the analysis, we first needed item parameters from the GRM and a prior distribution for the position of respondents on the latent trait. We fit a GRM for each battery using a randomly selected training sample of 5/6 of the respondents using the GRM function from the ltm R package (Rizopoulos 2006; R Core Team 2017). This function identifies the model by assuming that the first item included in the dataset loads positively on the latent trait and that the θj parameters are distributed according to the standard normal distribution. Since we anticipated that the distribution of latent traits in the training and test samples would be similar, we used the standard normal prior.

Next, we turned to the remaining sample (the test sample) and used individuals’ recorded answers to estimate their scores under the assumption that we know only their responses to questions as chosen by (a) the reduced-fixed scale, (b) the reduced-adaptive scale, and (c) a randomly selected reduced battery. Our goal is to determine whether the reduced-fixed scale or the reduced-adaptive scale results in more accurate (less biased) estimates. In order to put our estimates on a meaningful scale, we evaluate bias relative to a naïve approach of selecting question items at random.

So, for example, if the fixed battery calls for asking question-items twenty, fourteen, and three, we first calculated all respondents’ scores using their real responses to just those three questions. Second, we let the adaptive algorithm choose the first item for all respondents, and we recorded each respondent’s “answer” using the real responses in the dataset. The algorithm then customized the selection of the next item for each individual, and so on. Finally, we administered a randomly constructed short battery. That is, for each individual, we chose three items from the full battery at random and calculated scores based on the observed responses to just those items.

To evaluate the performance of each reduced battery, we also estimated respondents’ positions on the latent trait using their recorded responses to the entire battery. For our calculations, we treat these scores as the respondents’ “true” positions and benchmark the various reduced batteries in terms of how well they approximate these estimates. Note that we use a GRM fit to the full response profiles in the test sample to estimate scores on the latent traits for both the reduced and full batteries. This ensures all estimates are on the same scale while also avoiding an undue advantage for the adaptive battery by relying on parameters estimated from the training set.

We look first at the narcissistic personality inventory (NPI) (Raskin and Terry 1988), which measures one’s “grandiose yet fragile sense of self and entitlement as well as a preoccupation with success and demands for admiration” (Ames et al. 2006, pp. 440–441). Although the original battery contained forty items, Ames et al. (2006) developed a sixteen-item version (NPI-16). In the first column of table 3, we compare the performance of NPI-16 with an adaptive inventory in terms of root mean squared error (RMSE), for which we consider estimates from responses to the entire battery are respondents’ “true” positions (θj). Since asking any item from a validated battery will reduce bias to some degree, we evaluate the bias of the fixed and adaptive batteries relative to a random battery of the same length. Table 3 shows that the RMSE of the adaptive NPI battery is 51 percent lower than that of the random battery, whereas the fixed battery provides only a 30 percent improvement. Thus, the difference in these improvements (the difference in differences) shows a 21 percent advantage for the adaptive battery.

Table 3.

Assessing Fit of Adaptive vs Fixed Batteries in Empirically Informed Simulation

Inventory name
NPIMACHEmpathySystemizing
Battery length 16 15 25 
Random (RMSE) 0.38 0.45 0.38 0.21 
Adaptive (RMSE) 0.18 0.42 0.15 0.13 
% Improvement over random 51.41% 8.56% 59.81% 39.50% 
Random (RMSE) 0.38 0.45 0.38 0.21 
Fixed (RMSE) 0.27 0.52 0.58 0.17 
% Improvement over random 30.26% −15.09% −53.86% 16.71% 
Difference in improvement for adaptive vs. fixed 21.14% 23.65% 113.68% 22.79% 
Inventory name
NPIMACHEmpathySystemizing
Battery length 16 15 25 
Random (RMSE) 0.38 0.45 0.38 0.21 
Adaptive (RMSE) 0.18 0.42 0.15 0.13 
% Improvement over random 51.41% 8.56% 59.81% 39.50% 
Random (RMSE) 0.38 0.45 0.38 0.21 
Fixed (RMSE) 0.27 0.52 0.58 0.17 
% Improvement over random 30.26% −15.09% −53.86% 16.71% 
Difference in improvement for adaptive vs. fixed 21.14% 23.65% 113.68% 22.79% 

Note.—Values are the root mean squared error for respondents simulated to have answered fixed-reduced batteries (see table 1) or adaptive batteries of the same length. Estimates were also calculated as if each respondent received a random battery of the same length by sampling from each response set. Point estimates were calculated relative to estimates generated for each respondent using the full inventory. In each case, a Wilcoxon Rank-Sum test finds the adaptive battery provides less bias than the fixed battery (p <0.05).

Table 3.

Assessing Fit of Adaptive vs Fixed Batteries in Empirically Informed Simulation

Inventory name
NPIMACHEmpathySystemizing
Battery length 16 15 25 
Random (RMSE) 0.38 0.45 0.38 0.21 
Adaptive (RMSE) 0.18 0.42 0.15 0.13 
% Improvement over random 51.41% 8.56% 59.81% 39.50% 
Random (RMSE) 0.38 0.45 0.38 0.21 
Fixed (RMSE) 0.27 0.52 0.58 0.17 
% Improvement over random 30.26% −15.09% −53.86% 16.71% 
Difference in improvement for adaptive vs. fixed 21.14% 23.65% 113.68% 22.79% 
Inventory name
NPIMACHEmpathySystemizing
Battery length 16 15 25 
Random (RMSE) 0.38 0.45 0.38 0.21 
Adaptive (RMSE) 0.18 0.42 0.15 0.13 
% Improvement over random 51.41% 8.56% 59.81% 39.50% 
Random (RMSE) 0.38 0.45 0.38 0.21 
Fixed (RMSE) 0.27 0.52 0.58 0.17 
% Improvement over random 30.26% −15.09% −53.86% 16.71% 
Difference in improvement for adaptive vs. fixed 21.14% 23.65% 113.68% 22.79% 

Note.—Values are the root mean squared error for respondents simulated to have answered fixed-reduced batteries (see table 1) or adaptive batteries of the same length. Estimates were also calculated as if each respondent received a random battery of the same length by sampling from each response set. Point estimates were calculated relative to estimates generated for each respondent using the full inventory. In each case, a Wilcoxon Rank-Sum test finds the adaptive battery provides less bias than the fixed battery (p <0.05).

Next we turn to Machiavellianism, a measure inspired by the depiction of the manipulative, immoral, and power-hungry ruler in Niccolo Machiavelli’s The Prince. A widely used scale in the literature is the twenty-item MACH-IV scale proposed by Christie, Geis, and Berger (1970). We compare the adaptive inventory method to the five-item Trimmed MACH (Rauthmann 2013). The second column of table 3 shows the results. Clearly, the adaptive battery is significantly better in reducing errors. Indeed, the Trimmed MACH scale performs worse than simply choosing survey items at random (a 15 percent increase in RMSE), whereas the adaptive method provides roughly an 8.5 percent decrease.

Third, we examine the empathizing and systemizing batteries developed by Baron-Cohen et al. (2003). Empathizing is defined as, “the way in which we understand the social world, the emotions and thoughts of others, and how we respond to these social cues.” By contrast, “systemizing is concerned with understanding rules, how things work and how systems are organized” (Ling, Burton, Salt, and Muncer 2009, p. 539). Each scale originally contained forty items and twenty “buffer” questions that we excluded. We compare the adaptive inventory method with the fifteen-item reduced empathizing scale proposed by Muncer and Ling (2006). For systemizing, we compare an adaptive inventory with the twenty-five–item reduced battery proposed by Wakabayashi, Baron-Cohen, Wheelwright, Goldenfeld, Delaney, et al. (2006).

The results are shown in the third and fourth columns of table 3. Clearly the items in the reduced-form empathizing scale were not well selected. The third column shows that the fixed scale is nearly 54 percent worse relative to randomly selecting items. On the other hand, the adaptive inventory does much better, reducing RMSE by about 60 percent. The fourth column does not reveal such a stark contrast for the systemizing scale, but here again it is clear that the dynamic battery does well against both a random battery and a published fixed-reduced battery. Further, the RMSE rate is quite low, showing that the adaptive inventory produces 40 percent less bias than random selection.

4.2 Experimental Study

One feature of the simulations previously mentioned is that even many of the reduced batteries include too many questions for a standard survey. Therefore, we turn to a more realistic setting where a researcher has space for only a handful of survey items. Specifically, we present results from an experiment conducted using convenience samples recruited via Amazon’s Mechanical Turk (AMT) service to compare the performance of fixed-reduced batteries with an adaptive inventory of the same length. In the fall of 2014, we administered full-length versions of five personality inventories that have been included in reduced forms on the American National Election Study (ANES) to 1,204 subjects. The batteries were need for cognition (NFC), need to evaluate (NTE), need for affect (NFA), social dominance orientation’s (SDO) items measuring dominance attitudes (Peña and Sidanius 2002), and right wing authoritarianism (RWA) (see table 1). Question wordings and response rates for all of our surveys are provided in the supplementary material online. Using these responses, we calibrated an adaptive inventory for each battery using the ltm package in R as described in the previous section.

In the spring of 2015, we then recruited 1,335 new respondents who were randomly assigned before each battery was administered to receive either a fixed-reduced battery as used by the ANES or an AI of the same length. For RWA, 639 respondents answered the adaptive battery, and 684 answered the fixed battery. The corresponding numbers of the other scales are as follows: SDO (adaptive = 682, fixed = 652), NFC (adaptive = 667, fixed = 666), NTE (adaptive = 682, fixed = 649), NFA (adaptive = 650, fixed = 661). After completing the reduced battery, all subjects then answered all remaining questions in the full battery in a random order.

We then estimated scores using only questions selected by the fixed batteries and the AIs and compared them, as with the simulation study in Section 4.1, using respondents’ “true” positions (estimated using responses to the complete battery) as a common benchmark. Finally, to put these numbers on a meaningful scale, we estimated respondents’ positions on the trait using a random subset of responses. To ensure all estimates were on the same latent scale, we generated estimates using a GRM fit only with full response profiles from the second sample.

Before turning to the results, it is worth noting that this experiment represents a far more difficult test for the adaptive batteries than the simulations above. To begin, these batteries are short (between two and five questions), giving the AI little opportunity to learn about respondents and choose items. Further, relative to the examples previously mentioned, the underlying batteries themselves are small (between eight and thirty items), meaning there are fewer items for the algorithm to choose from. Further, since we estimate respondents’ “true” positions using the full battery, error rates should be considerably smaller for almost any method of item selection.

Table 4 shows the root mean squared error for respondents answering either the fixed or adaptive scales. The AIs provide more accuracy than the fixed batteries, with improvements over random selection for the adaptive versus fixed batteries ranging from a modest 0.2 percent for the NFC battery to a substantial 13.3 percent improvement for the NTE battery. In all, these results show that AIs provide more accurate estimates than widely used fixed batteries, even when there is only space for a few items.

Table 4.

Assessing Fit of Adaptive vs Fixed Batteries in Experimental Study

Inventory name
NFANTENFCSDORWA
Battery length 
Random (RMSE) 1.08 1.03 1.08 0.41 1.39 
Adaptive (RMSE) 0.47 0.47 0.49 0.36 0.44 
% Improvement over random 56.55% 54.53% 54.74% 10.27% 68.63% 
Random (RMSE) 1.13 0.94 1.09 0.42 1.41 
Fixed (RMSE) 0.55 0.55 0.49 0.40 0.48 
% Improvement over random 51.52% 41.20% 54.51% 5.78% 65.75% 
Difference in improvement for adaptive vs fixed 4.96% 13.34% 0.23% 4.50% 2.88% 
N = 1,335      
Inventory name
NFANTENFCSDORWA
Battery length 
Random (RMSE) 1.08 1.03 1.08 0.41 1.39 
Adaptive (RMSE) 0.47 0.47 0.49 0.36 0.44 
% Improvement over random 56.55% 54.53% 54.74% 10.27% 68.63% 
Random (RMSE) 1.13 0.94 1.09 0.42 1.41 
Fixed (RMSE) 0.55 0.55 0.49 0.40 0.48 
% Improvement over random 51.52% 41.20% 54.51% 5.78% 65.75% 
Difference in improvement for adaptive vs fixed 4.96% 13.34% 0.23% 4.50% 2.88% 
N = 1,335      

Note.—Values are the root mean squared error for respondents randomly assigned to either answer the fixed batteries as they appeared on the ANES (see table 1) or adaptive batteries of the same length. Randomization occurred before each battery. Estimates were also calculated as if each respondent received a random battery of the same length by sampling from each response set. Point estimates were calculated relative to estimates generated for each respondent using the full inventory. In each case, a Wilcoxon Rank-Sum tests finds the adaptive battery provides less bias than the fixed battery with p <0.05 for NFA, SDO, and RWA and p <0.10 for NTE and NFC.

Table 4.

Assessing Fit of Adaptive vs Fixed Batteries in Experimental Study

Inventory name
NFANTENFCSDORWA
Battery length 
Random (RMSE) 1.08 1.03 1.08 0.41 1.39 
Adaptive (RMSE) 0.47 0.47 0.49 0.36 0.44 
% Improvement over random 56.55% 54.53% 54.74% 10.27% 68.63% 
Random (RMSE) 1.13 0.94 1.09 0.42 1.41 
Fixed (RMSE) 0.55 0.55 0.49 0.40 0.48 
% Improvement over random 51.52% 41.20% 54.51% 5.78% 65.75% 
Difference in improvement for adaptive vs fixed 4.96% 13.34% 0.23% 4.50% 2.88% 
N = 1,335      
Inventory name
NFANTENFCSDORWA
Battery length 
Random (RMSE) 1.08 1.03 1.08 0.41 1.39 
Adaptive (RMSE) 0.47 0.47 0.49 0.36 0.44 
% Improvement over random 56.55% 54.53% 54.74% 10.27% 68.63% 
Random (RMSE) 1.13 0.94 1.09 0.42 1.41 
Fixed (RMSE) 0.55 0.55 0.49 0.40 0.48 
% Improvement over random 51.52% 41.20% 54.51% 5.78% 65.75% 
Difference in improvement for adaptive vs fixed 4.96% 13.34% 0.23% 4.50% 2.88% 
N = 1,335      

Note.—Values are the root mean squared error for respondents randomly assigned to either answer the fixed batteries as they appeared on the ANES (see table 1) or adaptive batteries of the same length. Randomization occurred before each battery. Estimates were also calculated as if each respondent received a random battery of the same length by sampling from each response set. Point estimates were calculated relative to estimates generated for each respondent using the full inventory. In each case, a Wilcoxon Rank-Sum tests finds the adaptive battery provides less bias than the fixed battery with p <0.05 for NFA, SDO, and RWA and p <0.10 for NTE and NFC.

We can demonstrate that this improved accuracy has important consequences beyond mere measurement. To do this, we focus on the RWA measure, which originally had thirty items but was reduced to five on the ANES 2013 Internet Recontact Study Figure 2 shows the distributions estimated for individuals assigned to fixed and adaptive battery conditions. The shaded distributions show the density estimated using only the reduced battery, while the unshaded distributions show the density estimated after these same respondents complete the entire thirty-item inventory. The figure shows that the fixed battery does a particularly poor job estimating positions on the low end of the spectrum, shown by the difference in the shaded and unshaded densities in the left panel.

Figure 2.

Revealed Right Wing Authoritarianism (RWA) Estimates for Adaptive and Fixed Measures. These figures show the distribution of RWA as estimated using the five-item reduced batteries (shaded histograms) and using the complete thirty-item inventory (unshaded histograms). Estimates for respondents randomly assigned to answer a fixed battery (n = 684) are on the left while estimates for respondents randomly assigned to answer the adaptive battery (n = 639) are on the right. The adaptive battery does a superior job in recovering the positions of respondents with more extreme values on the latent scale.

Figure 2.

Revealed Right Wing Authoritarianism (RWA) Estimates for Adaptive and Fixed Measures. These figures show the distribution of RWA as estimated using the five-item reduced batteries (shaded histograms) and using the complete thirty-item inventory (unshaded histograms). Estimates for respondents randomly assigned to answer a fixed battery (n = 684) are on the left while estimates for respondents randomly assigned to answer the adaptive battery (n = 639) are on the right. The adaptive battery does a superior job in recovering the positions of respondents with more extreme values on the latent scale.

Our aim is to show that by inaccurately measuring RWA with fixed-reduced scales, we can inadvertently distort (in any direction depending on the measurement of the underlying trait) our understanding for how RWA relates to other important factors. However, this distortion is ameliorated by using an adaptive battery. To illustrate this, we measured several constructs theoretically related to RWA, including presidential approval, ideology, defense spending attitudes, civil liberties attitudes, symbolic racism, modern racism, and prejudice toward Arabs and Muslims (Sidanius et al. 2004).

We estimated separate regressions by treatment condition (adaptive or fixed battery) using RWA as an explanatory variable and these related constructs as dependent variables. We control for race, gender, and level of education. We then estimated the “true” value for these regression coefficients using respondents’ scores as estimated from the full battery. We then calculated the difference between the regression coefficients (and 95 percent confidence intervals [CIs]) estimated using the reduced battery measures and the full battery measures of RWA. The results, shown in figure 3, illustrate that the measure of RWA from the fixed battery leads to greater regression coefficients due to the censoring, which leads us to conclude that RWA is a stronger predictor than is actually the case. However, the coefficient estimates, when using a five-item AI for RWA measurement, hardly differ from the coefficient value we would get using the full thirty-item battery.

Figure 3.

Difference in Regression Estimates for RWA and Seven Related Constructs. This figure shows that regression coefficients measuring the relationship between right wing authoritarianism (RWA) and related constructs are greater than the true coefficient estimates when measures of respondents’ latent positions are poorly estimated by fixed-reduced batteries. The vertical axis shows the degree to which regression coefficients between RWA and various outcomes differ when using a five-item reduced scale relative to regression coefficients when RWA is estimated using the full thirty-item inventory. The names of the various dependent variables are shown on the x-axis. The closed circles and dashed lines are point estimates and 95 percent confidence intervals for subjects randomly assigned to answer a fixed-reduced battery (n = 684), and the open diamonds and solid lines show the same for subjects randomly assigned to answer an AI of the same length (n = 649). All regressions are controlled for gender, race, and level of education. All question wordings are provided in the supplementary material online.

Figure 3.

Difference in Regression Estimates for RWA and Seven Related Constructs. This figure shows that regression coefficients measuring the relationship between right wing authoritarianism (RWA) and related constructs are greater than the true coefficient estimates when measures of respondents’ latent positions are poorly estimated by fixed-reduced batteries. The vertical axis shows the degree to which regression coefficients between RWA and various outcomes differ when using a five-item reduced scale relative to regression coefficients when RWA is estimated using the full thirty-item inventory. The names of the various dependent variables are shown on the x-axis. The closed circles and dashed lines are point estimates and 95 percent confidence intervals for subjects randomly assigned to answer a fixed-reduced battery (n = 684), and the open diamonds and solid lines show the same for subjects randomly assigned to answer an AI of the same length (n = 649). All regressions are controlled for gender, race, and level of education. All question wordings are provided in the supplementary material online.

4.3 Case Study: 2016 ANES Pilot Study

In our third application, we present a detailed case study of an AI measuring the need for cognition that was included on the 2016 ANES Pilot Study. (A more technical guide for calibrating and administering an AI with catSurv is shown in the supplementary material online.) In addition to providing an illustrative example, the purpose of this section is to test the validity of AI measures on a nationally representative survey conducted by a professional polling firm (YouGov).

Cacioppo and Petty (1982) originally proposed the need for cognition scale as a method for measuring “the tendency for an individual to engage in and enjoy thinking” (p. 116). While originating in social psychology, this trait has been used extensively in political science (e.g., Druckman 2004). The original battery was a thirty-four–item inventory that was subsequently reduced to an eighteen-item “efficient” battery (Cacioppo and Petty 1984). It is from this eighteen-item inventory that Bizer et al. (2000) chose the items for inclusion on the ANES.

To calibrate the adaptive personality inventory, we combined data from three separate samples. First, we used data from the December 2014 wave of The American Panel Survey (TAPS). This is a monthly online panel survey for which panelists were recruited as a national probability sample with an address-based sampling frame in the fall of 2011 by GfK-Knowledge Networks. After removing respondents who completed less than 25 percent of the items, we had 1,506 respondents.

To supplement TAPS, we used responses to the eighteen-item NFC battery from the two convenience samples recruited via Amazon’s Mechanical Turk (AMT) online workforce used in the experiment described previously. While not a representative sample, AMT provides an easy way to administer the survey to a larger set of respondents. As Embretson (1996) notes, one of the “new rules” of item response models is that “unbiased estimates of item properties may be obtained from unrepresentative samples” (p. 342).

We fit a GRM with the combined sample and selected a prior based on the TAPS sample. Then we precalculated a complete branching scheme. Figure 4 depicts portions of the scheme for the four-item NFC AI. The labels on the branches indicate possible answers. (The NA indicates item nonresponse.) For example, a respondent who answers “1” to NFC23 will be asked NFC32, and a respondent who then answers “5” will be asked NFC29.

Figure 4.

Selected Portions of a Complete Branching Scheme for the Four-Item Need for Cognition Adaptive Personality Inventory. The figure describes selected sub-trees of the complete branching scheme for the four-item need for cognition AI included on the 2016 ANES Pilot Study. The labels on the branches indicate possible respondent answers. An “NA” indicates item nonresponse.

Figure 4.

Selected Portions of a Complete Branching Scheme for the Four-Item Need for Cognition Adaptive Personality Inventory. The figure describes selected sub-trees of the complete branching scheme for the four-item need for cognition AI included on the 2016 ANES Pilot Study. The labels on the branches indicate possible respondent answers. An “NA” indicates item nonresponse.

For longer batteries, a full enumeration of the scheme might be difficult. However, since this battery is only four items in length, the tree contains only 63=216 complete branchings. Indeed, the entire tree can be represented as a table with 259 rows. We provided the table to YouGov in advance, and the survey was administered to 1,200 respondents drawn from an opt-in online panel.

Since the ANES Pilot did not include the fixed battery, it is not possible to compare the adaptive and fixed batteries as we did in the applications mentioned before. However, it is possible to evaluate predictive validity. In particular, we test whether NFC is a moderator for the effect of issue framing as has been argued in the existing literature (e.g., Druckman 2004). Specifically, we hypothesize that the framing treatment should be less effective on individuals who score highly on NFC.

We take advantage of a framing experiment on the ANES Pilot Study. (We provide a similar analysis of a second framing experiment in the supplementary material online.) In the experiment, respondents were randomly assigned to answer the question, “Do you favor, oppose, or neither favor nor oppose allowing [Syrian refugees/refugees fleeing the Syrian civil war] to come to the United States?” (Emphasis added), where 587 received the “Syrian refugees” frame and 613 received the civil war frame. Respondents indicated their level of support on a seven-point scale. We test the hypothesis that the civil war frame will make respondents less opposed to allowing Syrian refugees to enter the United States, but that this effect will be moderated by respondents’ level of NFC. Exact question wordings and response rates are provided in the supplementary material online.

The main results are presented in table 5, which shows the coefficients of interest from a weighted least squares regression where the dependent variable is the degree of opposition to Syrian refugees on a seven-point scale. We also controlled for feeling thermometer toward Muslims, support for intervening in Syria to combat ISIS, racial resentment, party identification, ideology, gender, education, race, and ethnicity. The first column in table 5 shows that the civil war framing does not by itself appear to have a statistically reliable effect on opposition to Syrian refugees being admitted to the United States. However, model 2 shows that there is a significant interaction between this treatment and NFC as measured by the AI (p =0.046). Figure 5 shows the estimated marginal effect of the civil war framing on opposition to Syrian refugees for differing levels of NFC. Consistent with expectations, the plot indicates that the framing experiment had little or no effect for respondents with high levels of NFC, but that it had a significant and negative effect for respondents lower on this trait.

Figure 5.

Interaction Plot Estimating the Effect of the Civil War Frame on Opposition to Syrian Refugees for Differing Levels of Need for Cognition. Lines represent point estimates and shaded region represents a 95 percent confidence interval. Parameter estimates for this model are shown in table 5.

Figure 5.

Interaction Plot Estimating the Effect of the Civil War Frame on Opposition to Syrian Refugees for Differing Levels of Need for Cognition. Lines represent point estimates and shaded region represents a 95 percent confidence interval. Parameter estimates for this model are shown in table 5.

Table 5.

Effect of Civil War Framing on Opposition to Syrian Refugees

Model 1Model 2
Intercept  4.563  4.553 
  (0.319)  (0.319) 
Civil war framing –0.094 –0.092 
  (0.088)  (0.088) 
Need for Cognition –0.083 –0.200 
  (0.061)  (0.84) 
Civil war × NFC   0.224 
   (0.112) 
1,064 1,064 
R2 0.530 0.532 
Model 1Model 2
Intercept  4.563  4.553 
  (0.319)  (0.319) 
Civil war framing –0.094 –0.092 
  (0.088)  (0.088) 
Need for Cognition –0.083 –0.200 
  (0.061)  (0.84) 
Civil war × NFC   0.224 
   (0.112) 
1,064 1,064 
R2 0.530 0.532 

Note.—Estimates are from weighted least squares regression using survey weights. We also controlled for a feeling thermometer toward Muslims, support for intervening in Syria to combat ISIS, racial resentment, party identification, ideology, gender, education, race, and ethnicity. These coefficients are suppressed for clarity. All question wordings are shown in the supplementary material online.

Table 5.

Effect of Civil War Framing on Opposition to Syrian Refugees

Model 1Model 2
Intercept  4.563  4.553 
  (0.319)  (0.319) 
Civil war framing –0.094 –0.092 
  (0.088)  (0.088) 
Need for Cognition –0.083 –0.200 
  (0.061)  (0.84) 
Civil war × NFC   0.224 
   (0.112) 
1,064 1,064 
R2 0.530 0.532 
Model 1Model 2
Intercept  4.563  4.553 
  (0.319)  (0.319) 
Civil war framing –0.094 –0.092 
  (0.088)  (0.088) 
Need for Cognition –0.083 –0.200 
  (0.061)  (0.84) 
Civil war × NFC   0.224 
   (0.112) 
1,064 1,064 
R2 0.530 0.532 

Note.—Estimates are from weighted least squares regression using survey weights. We also controlled for a feeling thermometer toward Muslims, support for intervening in Syria to combat ISIS, racial resentment, party identification, ideology, gender, education, race, and ethnicity. These coefficients are suppressed for clarity. All question wordings are shown in the supplementary material online.

5. CONCLUSION

Survey researchers face a constant trade-off between the desire to better measure concepts and the need to reduce survey length. While these tensions will always exist, AIs are capable of obviating the need for public opinion researchers to choose between administering a large, costly, multi-item scale or a single reduced scale that may drastically reduce measurement precision. Our results show that AIs allow for the administration of fewer questions while achieving superior levels of statistical precision and accuracy relative to any fixed-reduced scale. At a minimum, we believe that AIs can dramatically expand the ability for scholars to explore the role of various personality traits on public opinion and political behavior. However, we believe that AIs could be applied to many tasks beyond measuring personality.

Nonetheless, there are several potential limitations to AIs and areas for continued research. First, survey time is perhaps the greatest constraint for improving the measurement of latent traits. Yet the relative advantage of adaptive surveys to static batteries actually increases for longer batteries, which may lead some to question the usefulness of adopting the method. One answer to this concern is that adaptive surveys provide superior measurement of latent constructs even if space allows for three or four items, as we have shown. However, an additional approach is to include informative priors based on earlier survey responses as part of the CAT algorithm (van der Linden 1999). This will allow the algorithm to begin tailoring question items for respondents at the outset, further improving performance.

A second limitation to our implementation of AIs pertains to inventories such as the Big Five that are used to measure multiple traits. In such batteries, different questions are used to measure each dimension; therefore, our approach currently requires administration of a separate AI for each trait. Future work could extend AIs for multidimensional adaptive testing, which has several advantages when the researcher believes the latent dimensions are correlated. Research shows that multidimensional adaptive testing provides better balancing of the content administered for each dimension and that this approach provides more efficient and precise measurement relative to administering independent AIs for each dimension of the trait (Segall 1996).

A third concern is that random error will interfere with the performance of adaptive surveys, since noisy responses may lead to the “wrong” question being selected—especially in early stages of the battery. One particularly promising approach to addressing this issue is using a stratified multistage adaptive algorithm, where less discriminating items are used early in the adaptive process and highly discriminating items are reserved for later stages when respondents’ locations in the latent space are more accurately estimated (e.g., Chang and Ying 1999; Chen, Ankenmann, and Chang 2000).

Fourth, the advantages of CAT depend on the accuracy of the item-level parameters. Indeed, within the CAT framework, poorly estimated item parameters may have particularly pernicious effects on the quality of the final measure (van der Linden and Glas 2000). Survey researchers may, therefore, be particularly interested in uncovering parameter drift, wherein items are no longer functioning as expected based on the calibration sample. Fortunately, numerous solutions have been proposed in the literature for uncovering such changes, often termed “differential item functioning” (e.g., Kim, Cohen, and Park 1995; Glas 2010; Wang, Tay, and Drasgow 2013).

A final limitation of AIs is that they require pre-testing of battery items to calibrate the model. While this may seem burdensome, two factors make it a reasonable requirement. First, calibrating these models can be done using large convenience samples. Adaptive inventory performance will be improved if the models can be “normed” to national samples such that our prior beliefs are correctly calibrated toward the target population; however, this is not strictly necessary. Ideally, researchers will work collaboratively to pair large convenience samples with nationally representative samples to calibrate and test AIs.

Second, pretesting costs may be ameliorated by making survey data and item calibrations widely available to other researchers. The calibrations in this study, for instance, will be included in the replication archive for this article at the time of publication. Clearly, additional research is called for to develop, calibrate, and field-test specific AIs measuring other constructs. Our hope is that once these are developed, scholars will disseminate them to the wider academic community—facilitating adoption of this promising technology in public opinion research.

SUPPLEMENTARY MATERIALS

Supplementary materials are available online at academic.oup.com/jssam.

Previous versions of this article were presented at the 2015 meeting of the Asian Political Methods in Tapei, Taiwan and the 2015 Annual Summer Meeting of the Society for Political Methodology in Rochester, New York. We are grateful to Josh Cutler, Tom Wilkinson, Haley Acevedo, Alex Weil, Ryden Butler, Matt Malis, and Min Hee Seo for their programming assistance. Valuable feedback for this project was provided by Harold Clarke, Brendan Nyhan, and audience members at Washington University in St. Louis, the University of Chicago, Dartmouth College, New York University, and Princeton University. Funding for this project was provided by the Weidenbaum Center on the Economy, Government, and Public Policy and the National Science Foundation (SES-1558907).

References

Altemeyer
B.
(
1988
),
Enemies of Freedom: Understanding Right-Wing Authoritarianism
,
San Francisco, CA
:
Jossey-Bass
.

American National Election Studies (2013), American National Election Studies 2013 Internet Recontact Study Questionnaire, American National Election Study (ANES) Series, available at http://anesold.isr.umich.edu/studypages/anes_panel_2013_inetrecontact/anes_panel_2013_inetrecontact_qnaire.pdf. Accessed July 31, 2019.

Ames
D. R.
,
Rose
P.
,
Anderson
C. P.
(
2006
), “
The NPI-16 as a Short Measure of Narcissism
,”
Journal of Research in Personality
,
40
,
440
450
.

Arceneaux
K.
,
Vander Wielen
R. J.
(
2013
), “
The Effects of Need for Cognition and Need for Affect on Partisan Evaluations
,”
Political Psychology
,
34
,
23
42
.

Bafumi
J.
,
Herron
M. C.
(
2010
), “
Leapfrog Representation and Extremism: A Study of American Voters and Their Members in Congress
,”
American Political Science Review
,
104
,
519
542
.

Baker
F. B.
,
Kim
S.-H
(
2004
),
Item Response Theory: Parameter Estimation Techniques
,
New York
:
Marcel Dekker
.

Baron-Cohen
S.
,
Richler
J.
,
Bisarya
D.
,
Gurunathan
N.
,
Wheelwright
S.
(
2003
), “
The Systemizing Quotient: An Investigation of Adults with Asperger Syndrome or High–Functioning Autism, and Normal Sex Differences
,”
Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences
,
358
,
361
374
.

Barrada
J. R.
,
Olea
J.
,
Ponsoda
V.
,
Abad
F. J.
(
2010
), “
A Method for the Comparison of Item Selection Rules in Computerized Adaptive Testing
,”
Applied Psychological Measurement
,
34
,
438
452
.

Bizer
G. Y.
,
Krosnick
J. A.
,
Holbrook
A. L.
,
C. Wheeler
S.
,
Rucker
D. D.
,
Petty
R. E.
(
2004
), “
The Impact of Personality on Cognitive, Behavioral, and Affective Political Processes: The Effects of Need to Evaluate
,”
Journal of Personality
,
72
,
995
1028
.

Bizer
G. Y.
,
Krosnick
J. A.
,
Petty
R. E.
,
Rucker
D. D.
,
C. Wheeler
S.
(
2000
), “Need for Cognition and Need to Evaluate in the 1998 National Election Survey Pilot Study,” National Election Studies Report, available at https://www.electionstudies.org/wp-content/uploads/2018/05/nes008997.pdf. Accessed July 31, 2019.

Cacioppo
J. T.
,
Petty
R. E.
(
1982
), “
The Need for Cognition
,”
Journal of Personality & Social Psychology
,
42
,
116
131
.

Cacioppo
J. T.
,
Petty
R. E.
(
1984
), “
The Efficient Assessment of Need for Cognition
,”
Journal of Personality Assessment
,
48
,
306
307
.

Chang
H.-H.
,
Ying
Z.
(
1999
), “
A-Stratified Multistage Computerized Adaptive Testing
,”
Applied Psychological Measurement
,
23
,
211
222
.

Chen
S.-Y.
,
Ankenmann
R. D.
,
Chang
H.-H.
(
2000
), “
A Comparison of Item Selection Rules at the Early Stages of Computerized Adaptive Testing
,”
Applied Psychological Measurement
,
24
,
241
255
.

Chen
S.-K.
,
Hou
L.
,
Dodd
B. G.
(
1998
), “
A Comparison of Maximum Likelihood Estimation and Expected a Posteriori Estimation in CAT Using the Partial Credit Model
,”
Educational and Psychological Measurement
,
58
,
569
595
.

Choi
S. W.
,
Swartz
R. J.
(
2009
), “
Comparison of CAT Item Selection Criteria for Polytomous Items
,”
Applied Psychological Measurement
,
33
,
419
440
.

Chong
D.
,
Druckman
J. N.
(
2013
), “
Counterframing Effects
,”
Journal of Politics
,
75
,
1
16
.

Christie
R.
,
Geis
F. L.
,
Berger
D.
(
1970
),
Studies in Machiavellianism
,
New York
:
Academic Press
.

Druckman
J. N.
(
2004
), “
Political Preference Formation: Competition, Deliberation, and the (Ir)Relevance of Framing Effects
,”
American Political Science Review
,
98
,
671
686
.

Embretson
S. E.
(
1996
), “
The New Rules of Measurement
,”
Psychological Assessment
,
8
,
341
349
.

Erikson
R. S.
,
Panagopoulos
C.
,
Wlezien
C.
(
2004
), “
Likely (and Unlikely) Voters and the Assessment of Campaign Dynamics
,”
Public Opinion Quarterly
,
68
,
588
601
.

Glas
C. A. W.
(
2010
), “Item Parameter Estimations and Item Fit Analysis,” in
Elements of Adaptive Testing
, eds.
van der Linden
Wim J.
,
Glas
Cees A. W.
, pp.
269–288,
New York
:
Springer
.

Gosling
S. D.
,
Rentfrow
P. J.
,
Swann
W. B.
(
2003
), “
A Very Brief Measure of the Big-Five Personality Domains
,”
Journal of Research in Personality
,
37
,
504
528
.

Hol
A. M.
,
Vorst
H. C. M.
,
Mellenbergh
G. J.
(
2007
), “
Computerized Adaptive Esting for Polytomous Motivation Items: Administration Mode Effects and a Comparison with Short Forms
,”
Applied Psychological Measurement
,
31
,
412
429
.

Jarvis
W. B. G.
,
Petty
R. E.
(
1996
), “
The Need to Evaluate
,”
Journal of Personality and Social Psychology
,
70
,
172
194
.

John
O. P.
,
Donahue
E. M.
,
Kentle
R.L.
(
1991
), The "Big Five" Inventory—Versions 4a and 52. Berkeley, CA: University of California, Berkeley, Institute of Personality and Social Research.

Kim
S.-H.
,
Cohen
A. S.
,
Park
T.-H.
(
1995
), “
Detection of Differential Item Functioning in Multiple Groups
,”
Journal of Educational Measurement
,
32
,
261
276
.

Kingsbury
G. G.
,
Weiss
D. J.
(
1983
), “A Comparison of IRT-Based Adaptive Mastery Testing and a Sequential Mastery Testing Procedure,” in
New Horizons in Testing: Latent Trait Test Theory and Computerized Adaptive Testing
, ed.
Weiss
David J.
,
New York
:
Academic Press
.

Ling
J.
,
Burton
T. C.
,
Salt
J. L.
,
Muncer
S. J.
(
2009
), “
Psychometric Analysis of the Systemizing Quotient (SQ) Scale
,”
British Journal of Psychology
,
100
,
539
552
.

Maio
G. R.
,
Esses
V. M.
(
2001
), “
The Need for Affect: Individual Differences in the Motivation to Approach or Avoid Emotions
,”
Journal of Personality
,
69
,
583
614
.

McCrae
R. R.
,
John
O. P.
(
1992
), “
An Introduction to the Five-Factor Model and Its Applications
,”
Journal of Personality
,
60
,
175
215
.

Mondak
J. J.
,
Hibbing
M. V.
,
Canache
D.
,
Seligson
M. A.
,
Anderson
M. R.
(
2010
), “
Personality and Civic Engagement: An Integrative Framework for the Study of Trait Effects on Political Behavior
,”
American Political Science Review
,
104
,
85
110
.

Montgomery
J. M.
,
Cutler
J.
(
2013
), “
Computerized Adaptive Testing for Public Opinion Surveys
,”
Political Analysis
,
21
,
172
192
.

Muncer
S. J.
,
Ling
J.
(
2006
), “
Psychometric Analysis of the Empathy Quotient (EQ) Scale
,”
Personality and Individual Differences
,
40
,
1111
1119
.

Peña
Y.
,
Sidanius
J.
(
2002
), “
US Patriotism and Ideologies of Group Dominance: A Tale of Asymmetry
,”
The Journal of Social Psychology
,
142
,
782
790
.

Pratto
F.
,
Sidanius
J.
,
Stallworth
L. M.
,
Malle
B. F.
(
1994
), “
Social Dominance Orientation: A Personality Variable Predicting Social and Political Attitudes
,”
Journal of Personality and Social Psychology
,
67
,
741
.

R Core Team (

2017
), R: A Language and Environment for Statistical Computing, Vienna: R Foundation for Statistical Computing, available at https://www.R-project.org/.

Raskin
R.
,
Terry
H.
(
1988
), “
A Principal-Components Analysis of the Narcissistic Personality Inventory and Further Evidence of Its Construct Validity
,”
Journal of Personality and Social Psychology
,
54
,
890
902
.

Rauthmann
J. F.
(
2013
), “
Investigating the MACH–IV with Item Response Theory and Proposing the Trimmed MACH*
,”
Journal of Personality Assessment
,
95
,
388
397
.

Rizopoulos
D.
(
2006
), “
ltm: An R Package for Latent Variable Modeling and Item Response Theory Analyses
,”
Journal of Statistical Software
,
17
,
1
25
.

Samejima
F.
(
1969
), “
Estimation of Latent Ability Using a Response Pattern of Graded Scores
,”
Psychometrika Monograph Supplement
,
34(
4
),
100
.

Schwartz
S. H.
(
1992
), “Universals in the Content and Structure of Values: Theoretical Advances and Empirical Tests in 20 Countries,” in M. P. Zanna (Ed.)
Advances in Experimental Social Psychology
(vol.
25
), pp.
1
65
,
San Diego, CA
:
Academic Press
.

Segall
D. O.
(
1996
), “
Multidimensional Adaptive Testing
,”
Psychometrika
,
61
,
331
354
.

Segall
D. O.
(
2005
), “Computerized Adaptive Testing,” in K. Kempf-Leonard (Ed.)
Encylopedia of Social Measurement
(Vol.
1)
, pp.
429
438,
Oxford
:
Elsevier
.

Sidanius
J.
,
Pratto
F.
,
Van Laar
C.
,
Levin
S.
(
2004
), “
Social Dominance Theory: Its Agenda and Method
,”
Political Psychology
,
25
,
845
880
.

Toplak
M. E.
,
West
R. F.
,
Stanovich
K. E.
(
2011
), “
The Cognitive Reflection Test as a Predictor of Performance on Heuristics-and-Biases Tasks
,”
Memory & Cognition
,
39
,
1275
.

van der Linden
W. J.
(
1998
), “
Bayesian Item Selection Criteria for Adaptive Testing
,”
Psychometrika
,
63
,
201
216
.

van der Linden
W. J.
(
1999
), “
Empirical Initialization of the Trait Estimator in Adaptive Testing
,”
Applied Psychological Measurement
,
23
,
21
29
.

van der Linden
W. J.
,
Glas
C. A. W.
(
2000
), “
Capitalization on Item Calibration Error in Adaptive Testing
,”
Applied Measurement in Education
,
13
,
35
53
.

van der Linden
W. J.
,
Pashley
P.J.
(
2010
),
Elements of Adaptive Testing
,
New York
:
Springer
.

Wakabayashi
A.
,
Baron-Cohen
S.
,
Wheelwright
S.
,
Goldenfeld
N.
,
Delaney
J.
,
Fine
D.
,
Smith
R.
,
Weil
L.
(
2006
), “
Development of Short Forms of the Empathy Quotient (EQ-Short) and the Systemizing Quotient (SQ-Short)
,”
Personality and Individual Differences
,
41
,
929
940
.

Wang
W.
,
Tay
L.
,
Drasgow
F.
(
2013
), “
Detecting Differential Item Functioning of Polytomous Items for an Ideal Point Response Process
,”
Applied Psychological Measurement
,
37
,
316
335
.

Weiss
D. J.
(
1982
), “
Improving Measurement Quality and Efficiency with Adaptive Testing
,”
Applied Psychological Measurement
,
6
,
473
492
.

Weiss
D. J.
,
Kingsbury
G. G.
(
1984
), “
Application of Computerized Adaptive Testing to Educational Problems
,”
Journal of Educational Measurement
,
21
,
361
375
.

Yook
K. C.
,
Everett
R.
(
2003
), “
Assessing Risk Tolerance: Questioning the Questionnaire Method
,”
Journal of Financial Planning
,
16
,
48
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data