Message Pretesting Using Assessments of Expected or Perceived Persuasiveness: Evidence About Diagnosticity of Relative Actual Persuasiveness

Do formative assessments of the expected or perceived e ﬀ ectiveness of persuasive messages provide a good guide to the relative actual e ﬀ ectiveness of those messages? The correlational evidence usually invoked on this question is less than ideal. The most relevant evidence compares two messages ’ relative standing on perceived message e ﬀ ectiveness (PME) and actual message e ﬀ ectiveness (AME) as assessed in separate samples. Across 151 such comparisons, the direction of di ﬀ erence in PME matched that of AME in 58% of cases (ns). Diagnostic accuracy did not di ﬀ er signi ﬁ cantly depending on the size or signi ﬁ cance of the PME di ﬀ erence, the size of the PME sample, whether PME assessments came from experts or target-audience representatives, the referent of the PME assessment, or whether the PME assessment involved comparing messages.

and how different forms of iterative feedback can improve web designers' effectiveness (Dow et al., 2010).
A common practice in formative research is that of eliciting participant preferences for alternative messages or formats.For example, in the context of risk communication, pretest participants may be asked for assessments of alternative ways of presenting risk information, with these preferences then shaping subsequent message design (e.g., Morgan, Fischhoff, Bostrom, & Atman, 2002, pp. 111-120).
This article focuses on the realization of this kind of practice in the specific context of persuasive communication design: pretesting messages or message concepts by asking respondents how persuasive or convincing a message will be, or how important various factors are in influencing the behavior of interest.Such data are used to inform message design decisions; designers use the messages thought to be more effective, or create messages focused on the factors described as having the largest influence.
The purpose of this article is to assess the research evidence bearing on this practice.The central question is whether such pretest data are dependably diagnostic of differences in actual message persuasiveness.In what follows, the current focus is initially concretized by considering examples of formative assessments of perceived or expected persuasiveness.The discussion then addresses how to identify the best evidence bearing on the diagnosticity of those assessments.The article then turns to locating and analyzing such evidence.

Examples of assessments
Formative research often gathers information concerning the likely effects of persuasive messages by asking respondents about expected or perceived persuasiveness.The purpose of these assessments is to provide guidance about future actual message effectiveness (AME), understood in this context as a message's producing the intended persuasive effects on attitudes, intentions, or behaviors.
Some investigators have used a straightforward one-or two-item assessment.For example, in a study of messages to promote stair climbing, Webb and Eves (2007, p. 51) asked respondents "to rate how much each message would encourage them to use the stairs."In a study of antismoking advertisements, Pechmann, Zhao, Goldberg, and Reibling (2003, p. 6) asked seventh-and tenth-graders to assess ads using a single question: "Overall, I think this ad is effective for kids my age" (with a 5-point response scale anchored by "strongly agree" and "strongly disagree").Byrne, Katz, Mathios, and Niederdeppe (2015) had participants evaluate cigarette package warnings by responding to two items: "The images I just viewed are convincing" and "The images I just viewed would have the intended effect."(Similarly, see Dillard & Ha, 2016;Ganzach, Weber, & Ben-Or, 1997, Study 2;Malo, Gilkey, Hall, Shah, & Brewer, 2016;Noar, Palmgreen, Zimmerman, Lustria, & Li, 2010;Volk et al., 2015.)Some researchers have obtained message evaluations using multi-item scales with a variety of items, with some (but not all) items specifically focused on persuasiveness (or convincingness, effectiveness, etc.), reporting that such effect-focused items were sufficiently highly correlated with others to warrant combining the items.For example, Popova, Neilands, and Ling (2014) used five scales with endanchors of convincing-unconvincing, effective-ineffective, believable-unbelievable, realistic-unrealistic, and memorable-not memorable (α = .95).(Similarly, see Bogale, Boer, & Seydel, 2010;Jasek et al., 2015;McLean et al., 2016;Santa & Cochran, 2008;cf. Dillard, 2013.)In some studies, participants have been asked to rank-order messages in terms of effectiveness (e.g., Mendez et al., 2012) or to indicate the most effective message from a set (e.g., Healey & Hoek, 2016;Hernandez et al., 2014).In related procedures, message design has been guided by pretest respondents' assessments of the importance of various factors in influencing the behavior of interest.For example, Latimer et al. (2012) had participants rate, on a 10-point scale, how much each of nine different factors affected their desire to quit smoking (factors such as shortterm health risks, long-term health risks, and financial costs, etc.); this information then shaped the selection of content for subsequent antismoking videos.(Similarly, see Bartlett, Webb, & Hawley, 2017;Glynn et al., 2003;Nolan, Schultz, Cialdini, Goldstein, & Griskevicius, 2008, Study 1.) 1  Formative-research participants are commonly representatives of the eventual target audience of interest.For example, in developing a condom-use campaign aimed at African American women, Hood, Shook, and Belgrave (2017) undertook formative research with African American women as participants.But in some studies, participants are relevant experts-professionals working in the substantive domain of interest, marketing experts, and the like.For example, Taylor (2015Taylor ( , p. 1169) ) had 86 infection-control professionals assess six message strategies aimed at encouraging handwashing, asking them to rate whether "the statement will lead to health care workers washing their hands more often.""Perceived effectiveness" At least some of the assessments of interest here have previously been studied under the label of "perceived effectiveness" (PE) (e.g., Dillard, Shen, & Vail, 2007) or "perceived message effectiveness" (PME) (Yzer, LoRusso, & Nagler, 2015).As Yzer et al. (2015) have pointed out, there is considerable diversity among the measures that have been collected under these labels.But one element common to many such measures, though not all, has been obtaining people's perceptions of whether the message influenced them.
The current project is both broader and narrower than an interest in PE thusly understood, that is, perceived effects on oneself.It is broader, because the interest here is with any sort of assessment that bears on likely (future) persuasiveness, not just those concerning whether respondents believed that the message was effective in influencing them.For example, a question such as "which of these two messages would be more persuasive?"does not ask respondents whether a given message persuaded them, and so by some definitions might not be taken to be a measure of PE.
The current project is also narrower, because PE-for-oneself represents a phenomenon worthy of study in its own right.For example, researchers might pursue questions concerning the causes and effects of people's believing they were persuaded by a message.Such questions are valuable, but are not the current interest.
The specific focus of the present project is the practice of asking formativeresearch participants for assessments of expected or perceived persuasiveness, either for the self or for others.For reasons of convenience and familiarity, the assessments of interest are given the acronym PME-with the understanding that this includes measures that under some definitions would not be included under that label. 2  The commonality of this practice bespeaks a belief that such assessments are useful for message design purposes-and specifically a belief that such assessments are diagnostic of differences in AME.The next section considers what evidence might underwrite such a belief.

Correlational evidence
Because the question at hand concerns the predictability of AME from PME, PME-AME correlations would seem to be a natural source of evidence.For example, Dillard, Weber, and Vail's (2007) meta-analysis reviewed 40 cases, reporting that the mean correlation between PME and AME was .41.Their conclusion was that "overall, the results empirically demonstrate the value of PE judgments in formative research" (p.613).
Such correlational findings are commonly invoked as underwriting the formative use of PME data.As Yzer et al. (2015, p. 125) put it, "Clearly, if PE measures can predict the likely effects of a health message with sufficient precision, then PE can at the very least help filter out ineffective messages before allocating resources to message implementation."(For other examples, see Brennan, Durkin, Wakefield, & Kashima, 2014;Choi & Cho, 2016;Davis, Nonnemaker, Duke, & Farrelly, 2013;Davis, Uhrig, Bann, Rupert, & Fraze, 2011;Noar et al., 2016.)Similarly, weak or negative PME-AME correlations have been offered as a reason for thinking that PME data will not be diagnostic of AME (e.g., O'Keefe, 2002, p. 28).
But the utility of PME-AME correlations for assessing the diagnosticity of PME data varies depending on how the PME-AME correlation is computed.PME-AME correlations might be computed either within the data for a given message or across the data for a set of messages.(For an earlier treatment of this distinction, see Dillard & Ha, 2016.)Within-message PME-AME correlations When a PME-AME correlation is computed within the data for a single message, that correlation does not provide evidence relevant to the question of the diagnosticity of PME data for formative decisions.Even if messages' PME ratings are individually (within-message) very strongly positively correlated with their AME ratings, that does not necessarily mean that the relative PME standing of two messages will match the relative AME standing of those messages.
Abstractly put, the reason is that such correlations do not contain information about the means of the variables involved.To see this, imagine a small data set in which PME and AME data (with each variable scored from 0 to 100) are available for two messages, with n = 5 for each message.For message A, the participants have the following (PME, AME) pairs of scores: (77, 47), (76, 46), (75, 45), (74, 44), and (73, 43).For message B, the participants have the following (PME, AME) pairs of scores: (52, 62), (51, 61), (50, 60), (49, 59), and (48, 58).For message A, the PME-AME correlation is +1.00; for message B, the PME-AME correlation is also +1.00.Message A has a better mean PME score (75.0) than message B (50.0)-but message B has a better mean AME score (60.0) than message A (45.0).
Even though in this hypothetical data set PME and AME are perfectly positively correlated within each message, the messages' relative standing on PME is the opposite of their relative standing on AME.As this illustrates, positive within-message PME-AME correlations do not, and cannot, show that relative PME standing will match relative AME standing.
For similar reasons, weak within-message PME-AME correlations also cannot possibly provide good evidence.Imagine a second small data set (with PME and AME again scored from 0 to 100).For message C, the five participants have the following (PME, AME) pairs of scores: (82, 66), (81, 68), (80, 70), (79, 72), and (78, 74).For message D, the five participants have the following (PME, AME) pairs of scores: (62, 20), (61, 25), (60, 30), (59, 35), and (58, 40).For message C, the PME-AME correlation is −1.00; for message D, the PME-AME correlation is also −1.00.Even so, the relative standing of the two messages on PME matches their relative standing on AME: message C has a better mean PME score (80.0) than message D (60.0), and message C also has a better mean AME score (70.0) than message D (30.0).As this illustrates, it is possible for within-message PME-AME correlations to be strongly negative and yet for PME data to give the right answer about which of two messages will actually be more effective.
In short, within-message PME-AME correlations are not relevant to the question of whether messages' relative standing on PME is diagnostic of their relative standing on AME.

Across-message PME-AME correlations
When PME and AME data are collected concerning two messages, the PME-AME correlation computed for data combined across messages does provide evidence relevant to the question of the diagnosticity of PME data.Strong positive acrossmessage PME-AME correlations are an indication that messages' relative standing on PME will be diagnostic of their relative standing on AME; weak or negative correlations are a sign of poor diagnosticity.The two hypothetical data sets above illustrate this: Across the data for messages A and B, the PME-AME correlation is -.96, correctly indicating poor diagnosticity; across the data for messages C and D, the PME-AME correlation is .92,correctly suggesting good diagnosticity. 3  Correlational evidence reconsidered Whether PME-AME correlations are relevant to the diagnosticity of PME assessments depends on whether those correlations are within-message or across-message correlations.But the distinction between these two kinds of correlation does not appear to have been sufficiently appreciated.For example, Dillard, Weber, et al.'s (2007) meta-analysis of PME-AME correlations included both irrelevant withinmessage correlations (e.g., Hullett, 2004) and relevant across-message correlations (e.g., Hullett, 2002).
Some PME-AME correlations that have been offered as relevant to the formative use of PME assessments for message selection have been irrelevant withinmessage correlations.For example, Davis et al.'s (2011) study of an HIV testing campaign found that perceived ad effectiveness predicted subsequent intentions, and so concluded that their PME measures "may be useful for quantitatively pretesting messages in future campaigns" (p.58).But in this research, all participants were exposed to the same campaign materials.That is, the PME-AME correlations were within-message correlations-and hence are not relevant to the use of PME data for pretesting alternative messages.But other proffered PME-AME correlations have been the relevant sort, across-message correlations (e.g., Davis et al., 2013;Popova et al., 2014).
However, even though across-message PME-AME correlations are relevant to the assessment of the utility of PME measures in formative research, such correlations are an imperfect source of evidence.To compute across-message PME-AME correlations, the same participants must supply both PME and AME data, as when participants are exposed to a message and then complete both PME measures and AME measures (e.g., Chen, McGlone, & Bell, 2015).This might upwardly bias across-message correlations.If a participant reports that the sunscreen message she saw was very effective and then is asked "do you intend to wear sunscreen in the future?"one might reasonably expect some rough consistency.

Comparison of relative PME and AME standing
Against this backdrop, an alternative source of evidence naturally recommends itself.Because the concrete formative task uses the relative standing of two messages with respect to PME as a guide to the relative standing of those messages with respect to AME, more suitable evidence will consist simply of data that permit one to see whether relative PME standing matches relative AME standing.
Most formative studies that have collected PME data do not provide relevant evidence, because-understandably-those studies do not collect appropriate AME data.A message designer who collects PME data during formative research will choose the message that seems most likely to be effective, meaning that no comparative AME data would be available.
Thus the kind of study that will provide the most suitable data is one in which PME data are collected on at least two messages from one set of participants, and AME data are collected on those same messages but from a different set of participants.Such studies maximize the realism (external validity) of the research design, at least in the sense of paralleling the circumstances faced in formative research.
A simple metric Given a set of such studies, one can compute what amounts to a batting average: the percentage of cases in which two messages' relative PME standing matches their relative AME standing.If relative PME standing correctly predicts relative AME standing in (say) 90% of cases, then PME data will look to be a very good guide for message selection; on the other hand, if relative PME standing were to match relative AME standing only 50% of the time, then message designers would be just as well served by randomly choosing a message as by collecting PME data.
This metric speaks directly to the needs of formative researchers in ways that across-message correlational data do not.Formative decisions are commonly based on a simple comparison of PME means, with or without a significance test (see, e.g., Maddock et al., 2008;Malo et al., 2016;Mendez et al., 2012;Pechmann et al., 2003;Webb & Eves, 2007).It is entirely reasonable for a message designer to want to have an answer to the question "If I follow this decision procedure, how often will I be choosing the more effective message?"Across-message correlations obscure, rather than clarify, the degree of diagnosticity of PME data.It is not obvious just how good an indicator PME assessments are if the average across-message PME-AME correlation is .63 or .09or .41-but the diagnosticity is apparent if correct predictions are found to occur in 68% or 85% or 51% of cases.

Identification of relevant cases
Literature search Relevant research reports were located by searching Google, PsycINFO, ProQuest Dissertations and Theses Global, and the Web of Science databases for "perceived message effectiveness" (and the latter for "perceived effectiveness"); through personal knowledge of the literature; and by examining review discussions of formative research (e.g., Abraham & Kools, 2012;Atkin & Freimuth, 2013), citations to key articles (Dillard, Weber, et al., 2007;Yzer et al., 2015;Zhao, Strasser, Cappella, Lerman, & Fishbein, 2011), and reference lists in relevant reports.

Inclusion criteria
To be included, a study had to meet three criteria.First, the study had to provide quantitative PME and AME data on each of two (or more) messages, such that it was possible to compare messages' relative rankings on PME and AME.Excluded by this criterion were studies in which PME was assessed through focus groups or other non-quantitative methods (e.g., Booth-Butterfield et al., 2007); focus groups, although useful in formative research, might be thought to provide insufficiently systematic assessments of PME.Also excluded by this criterion were studies in which the effects of individual messages or message types could not be distinguished (Bigsby, Cappella, & Seitz, 2013) and studies that assessed messages' PME but not AME (e.g., Andsager, Austin, & Pinkleton, 2001).
To be included as a measure of PME, a measure had to provide some manifest assessment of a message's expected or perceived persuasive effectiveness.This criterion thus included indices composed entirely of effect-oriented questions, whether a single item (e.g., Pechmann et al., 2003) or a multi-item index (e.g., Byrne et al., 2015).A multi-item index containing both effect-oriented items and other items (e.g., focused on other attributes such as "logical" or "biased") was included only if there was appropriate evidence of inter-item consistency such as Cronbach's alpha (e.g., Hullett, 2000).This criterion also included assessments in which participants rank-ordered or selected messages or message contents based on expected persuasiveness (e.g., Paul, Redman, & Sanson-Fisher, 1997).This criterion excluded measures that did not directly assess expected or perceived persuasiveness (e.g., measures of message liking, memorability, bias, clarity, and so forth; e.g., Latimer et al., 2012).To determine relative PME standing when more than one measure was available, an effect size (r) was computed for each and then these were averaged (all such averages used the r-to-z-to-r transformation procedure, weighted by n).
To be included as a measure of AME, a measure had to assess one or more of three common persuasion outcomes: attitude, intention, and behavior.To determine relative AME standing when more than one measure was available, an effect size (r) was computed for each and then these were averaged.These three measures were treated as interchangeable because relative persuasiveness appears to be invariant across these three outcomes (O'Keefe, 2013).
Second, the PME data and AME data had to come from different sets of participants, those providing PME data had to be either plausible representatives of a potential target audience or putative experts in the relevant domain, and those providing AME data had to represent a corresponding potential target audience.This criterion maximized the similarity of the included cases to the circumstance of formative research.Excluded by this criterion were studies in which PME and AME data came from the same participants (e.g., Jasek et al., 2015) and studies with PME data from participants who were neither experts nor representatives of a potential target audience (Dillard & Ha, 2016).
Third, the messages being compared had to represent plausible formative message comparisons, that is, ones that might reasonably arise in formative research.This criterion excluded cases in which researchers purposefully set out to create relatively ineffective messages (e.g., Druckman, Peterson, & Slothuus, 2013), such as research on elaboration likelihood model hypotheses about argument quality (e.g., Cacioppo, Petty, & Morris, 1983).

Main analysis Unit of analysis
The main unit of analysis was the message pair.A study that reported data for only two messages thus provided one case (one message pair).When a design had three or more messages not arising from a factorial design, all possible message pairs were included.For example, Byrne et al. (2015) studied five substantively different designs for antismoking messaging on cigarette packages, which yielded 10 message pairs.When a design had three or more messages because two or more message variables were manipulated in a factorial design, only the contrasts associated with the message factors were included.For example, Piccolino's (1966) study of safety messages had 12 experimental messages generated by a 3 (threat: high, medium, low) × 2 (realism: high, low) × 2 (specificity: high, low) design.Rather than examining all 66 pairs, only the comparisons associated with each message factor were included.This choice was meant to reflect the likely formative interest in learning about the message factors (as opposed to any particular message).

Metric
The metric of interest was whether, for each message pair, the direction of effect on PME (i.e., which message had the higher PME mean) matched the direction of effect on AME (which message had the higher AME mean). 4If two messages differed on PME but had identical AME means, that case was scored as having the same direction of effect; in such circumstances, the PME data would not have led to choosing a demonstrably inferior message (and so the PME data would not have led to a poor message design decision).If two messages had identical PME means but differed on AME, that case was scored as having different directions of effect for PME and AME; in such a circumstance, the PME data would not have identified the more effective message.

Other properties
For each case, six other properties were also recorded (when relevant information was available, either in the research report or through correspondence with authors), because each represented a potential moderator of PME diagnosticity.Several of these concerned the PME effect size, that is, the difference between the PME mean for one message and the PME mean for the other message; the effect size was computed as r and then converted to d for reporting.(When more than one PME measure was available, an average effect size was computed.)First, whether the PME effect size was statistically significant: Diagnosticity might be greater where significant differences are found between the two messages' PME values.
Second, the magnitude of the PME effect size: Larger PME differences between the two messages, whether statistically significant or not, might be expected to be more diagnostic.
Third, the PME sample size, that is, the number of participants contributing to the PME effect size: Independent of whether a statistically significant PME effect is observed, diagnosticity might be expected to be greater as the size of the pretest sample increases.
Fourth, whether the participants providing PME data were representatives of the target audience or were experts: Data from one kind of participant might be more diagnostic.
Fifth, the referent for the PME assessment: As Yzer et al. (2015) noted, PME assessments vary in the specification of the referent.Some assessments ask about the respondent (e.g., "would this motivate you to do X?").Some assessments ask about convincing other people (sometimes specifying a particular sort of other: "kids your age," "green consumers," et al.).Some leave the referent unspecified, as when respondents are asked to rate how "persuasive" or "convincing" a message is, without specifying for whom.And some combine two or more referents, as when respondents are asked both how convincing a message is and whether it persuaded them.
Sixth, whether the PME assessment was comparative or non-comparative: It might be that PME assessments will be more diagnostic when respondents make what amount to comparative judgments about relative persuasiveness, as opposed to rating a single message.A PME assessment was classified as comparative if a PME respondent assessed both messages (or message kinds) that were compared on AME.So, for example, cases in which the PME assessment involved rank-ordering a set of messages or rating a number of different kinds of message were classified as comparative.Cases in which a PME respondent assessed only one message or message type were classified as non-comparative.

Additional analysis
The collected cases were also analyzed in a second way to address potential issues of statistical independence.Such issues can arise when a given study examines more than two messages.For example, a study with three messages (A, B, and C) yields three message pairs (A vs. B, A vs. C, and B vs. C), but the same participants would contribute to more than one comparison.As long as message pair is the unit of analysis, this problem can be avoided only by discarding cases, and there is no obviously defensible way of choosing cases to be put aside.
Hence the data were also analyzed using study as the unit of analysis, with the rank-order correlation between the PME standings and the AME standings computed for each study.A meta-analytic mean correlation was then computed across studies.This study-based metric is not quite as transparent as that based on message pairs, because the mean rank-order correlation does not convey the diagnosticity of relative PME standing in a straightforward manner.But this additional way of analyzing the data is useful for addressing issues of statistical independence.
Across all 151 message pairs, the direction of difference in PME matched the direction of difference in AME in 58% of the cases (87/151, .576);see Table 1.This proportion is not significantly different from .50 (z = 1.87, p = .061).With 151 cases, power for a population effect of .75exceeded .99,and for a population effect of .65 was .96. 5

Significance of PME effects
In the 72 cases in which the observed PME effect was statistically significantly different from zero, the direction of difference in PME matched that of AME in 67% of the cases (48/72, .667).This proportion is significantly different from .50 (z = 2.83, p = .005).
In the 53 cases in which the observed PME effect was not statistically significantly different from zero, the direction of difference in PME matched that of AME in 53% of the cases (28/53, .528).This proportion is not significantly different from .50 (z = .41,p = .680).With 53 cases, power for a population effect of .75 was .97,and for a population effect of .65 was .59.

Size of PME effects
A median split distinguished cases with a relatively large (d ≥ .326,k = 63) or small (d < .326,k = 62) PME effect size.In the 63 cases with a relatively large effect size, the direction of difference in PME matched that of AME in 63% of the cases (40/63, .635).This proportion is significantly different from .50 (z = 2.14, p = .032).
In the 62 cases with a relatively small effect size, the direction of difference in PME matched that of AME in 58% of the cases (36/62, .581).This proportion is not significantly different from .50 (z = 1.27, p = .204).With 62 cases, power for a population effect of .75 was .99,and for a population effect of .65 was .67.

Size of PME sample
A median split distinguished cases with a relatively large (N ≥ 87, k = 74) or small (N < 87, k = 77) PME sample size.In the 74 cases with a relatively large sample, the direction of difference in PME matched that of AME in 61% of the cases (45/74, .608).This proportion is not significantly different from .50 (z = 1.86, p = .063).With 74 cases, power for a population effect of .75exceeded .99,and for a population effect of .65 was .74.
In the 77 cases with a relatively small sample, the direction of difference in PME matched that of AME in 55% of the cases (42/77, .545).This proportion is not significantly different from .50 (z = .80,p = .425).With 77 cases, power for a population effect of .75exceeded .99,and for a population effect of .65 was .76.These two proportions (.608 and .545)are not significantly different: χ 2 (1) = .603,p = .438.For a medium-sized difference between proportions (per Cohen, 1988), power was .85.

Composition of PME sample
In the 133 cases in which PME data were obtained from representatives of the target audience, the direction of difference in PME matched that of AME in 57% of the cases (76/133, .571).This proportion is not significantly different from .50 (z = 1.65, p = .100).With 133 cases, power for a population effect of .75exceeded .99,and for a population effect of .65 was .94.
In the 18 cases in which PME data were obtained from experts, the direction of difference in PME matched that of AME in 61% of the cases (11/18, .611).This proportion is not significantly different from .50 (z = .94,p = .346).With 18 cases, power for a population effect of .75 was .57,and for a population effect of .65 was .24.

Referent of PME assessment
When the formative-research participants were experts, the PME referent was, understandably, never the self; experts were not asked how they themselves would react.To remove this confound, the analysis of the PME referent moderator examined only cases in which the participants were representatives of the target audience.
In the 18 cases in which the referent of the PME assessment was the respondent, the direction of difference in PME matched that of AME in 56% of the cases (10/18, .556).This proportion is not significantly different from .50 (z = .47,p = .637).With 18 cases, power for a population effect of .75 was .57,and for a population effect of .65 was .24.
In the 59 cases in which the referent of the PME assessment was some other, the direction of difference in PME matched that of AME in 53% of the cases (31/59, .525).This proportion is not significantly different from .50 (z = .391,p = .696).With 59 cases, power for a population effect of .75 was .99,and for a population effect of .65 was .64.
In the five cases in which the PME assessment had multiple referents, the direction of difference in PME matched that of AME in 60% of the cases (3/5, .600).This proportion is not significantly different from .50 (z = .447,p = .655).With five cases, power for a population effect of .75 was .17,and for a population effect of .65 was .09.
In the 51 cases in which the referent of the PME assessment was unspecified, the direction of difference in PME matched that of AME in 63% of the cases (32/51, .627).This proportion is not significantly different from .50 (z = 1.82, p = .069).
With 51 cases, power for a population effect of .75 was .97,and for a population effect of .65 was .58.

Comparative vs. non-comparative PME assessments
In the 92 cases in which PME assessments were comparative, the direction of difference in PME matched that of AME in 55% of the cases (51/92, .554).This proportion is not significantly different from .50 (z = 1.04, p = .297).With 92 cases, power for a population effect of .75exceeded .99,and for a population effect of .65 was .83.
In the 59 cases in which PME assessments were non-comparative, the direction of difference in PME matched that of AME in 61% of the cases (36/59, .610).This proportion is not significantly different from .50 (z = 1.69, p = .091).With 59 cases, power for a population effect of .75 was .99,and for a population effect of .65 was .64.

Additional analysis
A random-effects meta-analysis was undertaken examining the rank-order correlation of messages on PME and AME, using study as the unit of analysis and weighting cases by the number of participants providing PME data (Borenstein & Rothstein, 2005); the list of cases, with codings and reference citations, is archived at the Open Science Framework: osf.io/rh2bf.Across 35 cases, the mean rank-order correlation was -.053, not significantly different from zero (p = .916);the 95% confidence interval is [−.776, .730].

The present results
These results offer at best a mixed picture concerning the formative use of assessments of perceived or expected persuasiveness.Across all cases, pretesting messages by asking respondents about perceived or expected persuasiveness was no more informative about relative actual persuasiveness than flipping a coin: Such measures matched the observed direction of actual difference only 58% of the time, a value statistically indistinguishable from 50% (despite excellent statistical power for detecting a population diagnosticity of 65%).And the mean correlation between PME rank and AME rank was almost zero (−.05); how messages ranked on PME was unrelated to how they ranked on AME.
There is a hint in these results that message designers might have more confidence in relying on PME assessments when the PME effect size is statistically significant or relatively large; under those conditions, the diagnosticity of PME measures was statistically significantly different from 50%.But even this conclusion must be tempered, because the relevant moderator tests did not yield significant effects: Studies with statistically significant PME effects were no more diagnostic than those with nonsignificant effects, and studies with studies with larger PME differences were no more diagnostic than those with smaller effects.And, similarly, studies with relatively large PME differences were not more (or less) diagnostic than those with relatively small effects, studies with participants representing the target audience were no more (or less) diagnostic than those with experts, studies with comparative PME assessments were no more (or less) diagnostic than those with noncomparative assessments, and diagnosticity did not vary as a function of the referent of the PME assessment.

Moving forward
Read optimistically, these results might at least suggest some potential usefulness of PME assessments.All of the observed mean diagnosticity values exceeded 50%, even if not always statistically significantly different from 50%.And the absence of dependable moderator effects might in some cases be ascribed to weak statistical power.Even so, the present results suggest that there is considerable room for improvement in the use of PME assessments for diagnosing relative message persuasiveness.
As a starting point for improving PME diagnosticity, consider that when pretest respondents assess how "persuasive" a message will be, they are presumably relying on their naive (perhaps nonconscious) conceptions of what makes messages persuasive.When a given message has the properties they associate with persuasiveness, respondents judge it as persuasive.Booth-Butterfield et al.'s (2007) pretesting of messages about risks to firefighters may provide a useful illustration."Our message pretesting with focus groups (…) conclusively demonstrated that virtually every participant strongly preferred executions that featured more color, graphics, and design qualities.The typical government documents were almost uniformly declared to be less useful, attention getting, and memorable" (p.87).But two subsequent randomized field experiments "found better reception and processing results with the standard format than with the highdesign format" (p.87).Plainly, pretest respondents had erroneous conceptions of what would make these messages effective.
Indeed, whenever PME pretesting is not diagnostic of actual differences in effectiveness, it might be that the pretest respondents had inaccurate lay theories of persuasiveness-theories that misled them.PME pretests can be expected to be diagnostic only when the relevant lay beliefs (on which the PME judgments are based) are accurate.
This reasoning suggests that the nature of the messages being pretested is a crucial influence on PME diagnosticity.The very same PME measure (e.g., "will this message be persuasive?")used in two different pretesting circumstances might vary dramatically in diagnosticity-not because of some shortcoming of the measure itself, but because of variations in respondents' accuracy for judging different kinds of messages.
Hence one general approach to improving the diagnosticity of PME measures may be to give more attention both to the lay beliefs that underlie PME judgments and to the messages being pretested.This will eventually require not only an articulated account of lay conceptions, but also an appropriate abstract framework for describing message variations-a framework that distinguishes different message variations on the basis of their susceptibility to accurate lay assessment.Improving the diagnosticity of PME assessments cannot simply be a matter of making adjustments to PME measurement procedures; researchers will want also to consider whether the variations in the messages being pretested are variations about which respondents are likely to have sound lay theories.
However, there may be limits to the improvement of PME diagnosticity.If message designers are sufficiently good at devising initial candidate messages so that there are only small differences between them in effectiveness, then a pretesting procedure would need to be especially sensitive to detect such differences.It may not be possible for any pretesting procedure to be highly diagnostic under such circumstances. 6 Perhaps the best way to pretest message effectiveness is to do just that: pretest message effectiveness.Over 25 years ago, the Advertising Research Foundation undertook to examine the predictive validity of a number of different advertising pretesting ("copy-testing") procedures such as ad likeability and recall (Haley & Baldinger, 1991).Among all these, the best single general-purpose copy-testing measure appeared to be assessments of persuasion-brand preference, purchase intention, and the like (Rossiter & Eagleson, 1994).
Thus, in formative research, message designers might dispense with questions about expected or perceived persuasiveness, and instead pretest messages for actual effectiveness (e.g., Whittingham, Ruiter, Zimbile, & Kok, 2008).That solution will not always be feasible; for instance, sometimes it will be difficult to recruit many participants from the target audience (for an example, see Siegel, Lienemann, & Rosenberg, 2017), and sometimes the number of candidate messages may be so large as to prevent efficient AME assessment (e.g., Bigsby et al., 2013).But where pretesting using AME assessments is possible, it surely should be considered.

The larger picture
The present discussion of pretesting persuasive messages can be seen as part of an emerging broad discussion of best practices for formative research.Indeed, the kinds of concerns raised here have also arisen in other formative contexts.For example, Barnes, Hanoch, Miron-Shatz, and Ozanne (2016) found that women preferring graphical risk formats had lower risk comprehension when information was presented in that format than when it was presented in a numeric format.This was particularly striking, given that less numerate women were more likely to prefer graphical rather than numeric risk formats.As Barnes et al. concluded, such findings "point to the potential perils of tailoring risk communication formats to patient preferences" (p.1007).
Results such as these underscore the need for continuing attention to the practices of formative research.Designing effective messages is too important to be left to chance.

Acknowledgment
Thanks to Mark Barnett, Sahara Byrne, Jos Hornikx, Sherri Jean Katz, Christine Paul, and Connie Pechmann for additional information, and to Nancy Grant Harrington, Andy King, Associate Editor Robin Nabi, and three anonymous reviewers for manuscript comments.

Notes
1 All of the preceding examples provide some sort of quantitative basis for assessing expected or perceived persuasive effectiveness, but formative research can instead, or also, use qualitative assessments such as derived from focus-group discussions (e.g., Booth-Butterfield, Welbourne, Williams, & Lewis, 2007;Maddock, Silbanuz, & Reger-Nash, 2008;Mowbray, Marcu, Godinho, Michie, & Yardley, 2016;Pollard et al., 2016;Riker et al., 2015). 2 Thus in two ways, the label "perceived message effectiveness" is perhaps a bit misleading for the present purposes.First, any sort of assessment of perceived likely future persuasiveness is relevant to the present undertaking (not just respondents' reports about whether a message was persuasive to them).Second, the interest here is specifically with persuasive effectiveness (as opposed to, say, effectiveness in informing).But "perceived message effectiveness" (PME) is a familiar term and hence is used here.

Table 1
Results: Message-Pair Analysis Note:The confidence interval (CI) is the 95% adjusted Wald confidence interval.