Truth or Consequences: The Intertemporal Consistency of Adolescent Self-report on the Youth Risk Behavior Survey

Surveys are the primary information source about adolescents’ health risk behaviors, but adolescents may not report their behaviors accurately. Survey data are used for formulating adolescent health policy, and inaccurate data can cause mistakes in policy creation and evaluation. The author used test-retest data from the Youth Risk Behavior Survey (United States, 2000) to compare adolescents’ responses to 72 questions about their risk behaviors at a 2-week interval. Each question was evaluated for prevalence change and 3 measures of unreliability: inconsistency (retraction and apparent initiation), agreement measured as tetrachoric correlation, and estimated error due to inconsistency assessed with a Bayesian method. Results showed that adolescents report their sex, drug, alcohol, and tobacco histories more consistently than other risk behaviors in a 2-week period, opposite their tendency over longer intervals. Compared with other Youth Risk Behavior Survey topics, most sex, drug, alcohol, and tobacco items had stable prevalence estimates, higher average agreement, and lower estimated measurement error. Adolescents reported their weight control behaviors more unreliably than other behaviors, particularly problematic because of the increased investment in adolescent obesity research and reliance on annual surveys for surveillance and policy evaluation. Most weight control items had unstable prevalence estimates, lower average agreement, and greater estimated measurement error than other topics.

Adolescents engage in risk behaviors such as smoking, illegal drug use, and early or unprotected sex that threaten their future health.Surveys are the primary source of information about many risk behaviors, and the only source for some behaviors (1).Federal, state, and local governments monitor risk behavior prevalence, set policy priorities, and promote legislation by using surveys, including the Youth Risk Behavior Survey (YRBS) (2,3) and Monitoring the Future (4).The reliability of survey information is important for accurately measuring changes over time, determining geographic areas and demographics with a greater prevalence of risk behavior, and targeting and evaluating public health interventions.Inaccurate data can easily lead to mistakes in policy creation and evaluation.
Inconsistent reports may also carry information on adolescents' beliefs about the identity salience of their behaviors, including which behaviors they see as most central to their identities.Respondents are likely to inaccurately report behavior that conflicts with their identities or values (26,27) and beliefs (26,28,29).For example, adults with greater levels of political interest are more likely to overreport voting (30)(31)(32)(33), and respondents with more negative views of traffic violations and bankruptcy report fewer of their own traffic violations and bankruptcies (34).Adolescents' retraction of earlier-reported risk behaviors is most common for intimate, deviant, or illegal behaviors (20) and for experimental behaviors initially reported as infrequent (21,22,35).Adolescents seem to revise their pasts as their current behavior changes: their retrospective reports of substance use are more highly correlated with self-reported present use than with actual past use (36), adolescents who take a virginity pledge or become born-again Christians are more likely to retract earlier reports of having had sex, and adolescents who have sex or stop being born-again Christians are more likely to retract earlier reports of having taken a virginity pledge (24).Adolescents' self-image may influence them to be less likely to report weight control practices, both healthy and unhealthy practices, in interviews than self-administered surveys including exercise, diet, vomiting, and fasting (37); and to report drug use when they likely do not use drugs, because they report using fictitious drugs and many other drugs (38,39).
This study compares adolescents' responses to 72 questions about their risk behaviors at a 2-week interval using methods that may overcome potential threats to validity in an earlier analysis of these data (40) and describe more aspects of survey response inconsistency.It assesses prevalence changes, measures unreliability in 3 ways, and identifies question properties that predict more reliable reporting.

Data
The data were derived from contingency tables from a 2-week test-retest reliability study of the YRBS conducted in 2000 by the Centers for Disease Control and Prevention (CDC) (40).The YRBS was first developed by invited participants in a 1989 CDC workshop, was validated by the Questionnaire Design Research Laboratory at the National Center for Health Statistics with laboratory and field testing with high school students, was revised 3 times before it was first administered in 1991 (41), and was tested twice for reliability (40,42).
The reliability study used a convenience cluster sample of classes from 61 schools in urban (48%), suburban (39%), and rural (13%) settings in 20 geographically dispersed US states plus the District of Columbia.On survey day, 77% of the students were present in class with a signed parental consent form.Of students completing the first survey, 89% completed the second survey.The CDC excluded questionnaires with fewer than 20 valid responses or with the same response option 15 times in a row, yielding a final sample of 4,619 students that overrepresented females, African Americans, grades 9 and 10, and ages 15-16 years and underrepresented whites, Latinos, grades 11 and 12, and ages 13-14 years (Table 1).
Students answered 97 questions from the YRBS in two 40-minute classroom periods between February and April 2000, at approximately a 2-week interval.To assure students' anonymity, the survey was administered by trained data collectors from Macro International (Washington, DC), and students alone had access to identification numbers used to link responses.The survey used a computer-scannable booklet with questions above answer choices to avoid offby-1 errors.CDC dichotomized questions with multiple response categories into ''no risk'' and ''at risk.'' The reliability study was conducted for CDC's internal use.CDC policy is not to disseminate data collected for internal use, so the full data set is unavailable (N.D. Brener, CDC, personal communication, 2006).CDC published an analysis of these reliability data that included prevalence data at each survey administration (p1, p2) and Cohen's kappa (j) for 72 of 97 questions (omitted items include contraception and substance abuse at last sex) rounded to 1 decimal place (40).These published data can be used to recover the contingency tables.The number of respondents who said ''yes'' at both surveys was which has an error due to rounding in the original paper of no more than 1 respondent.
The survey questions include items on use of tobacco, alcohol, and illegal drugs; sexual intercourse; symptoms of depression and eating disorders; suicide; violence and weapons use; physical activity; and health-preserving behaviors such as wearing seatbelts, helmets, and sunblock and visiting a dentist and doctor.Questions were coded by possible predictors of inconsistent responses.Question topic was the primary predictor of interest, so questions were coded for whether they concerned deviant, illegal, or stigmatized behavior (43), including sexual intercourse; illegal drugs; alcohol and tobacco; perpetrating a violent crime; being a victim of a crime; and history of suicide, depression, and eating disorders.Other potential predictors were question time frame because memories of more recent events are more accurate (43,44), and true change is more likely for questions about short time frames; readability, including question word count, number of response choices, whether the previous question concerned a different topic (whether the question was preceded by a transition sentence), concerned a different time frame, or had different answer choices; and whether the question was dichotomized from multiple response choices because dichotomization may artificially lower agreement because of loss of information.
The convenience sample was compared with the nationally representative sample in the YRBS by computing z scores of time 1 and 1999 YRBS (45)

Data analysis
Each question was evaluated for prevalence change and 3 measures of unreliability.These measures were inconsistency (retraction and apparent initiation), agreement measured as tetrachoric correlation (TCC), and estimated error due to inconsistency measured as a Bayesian estimate of the standard error due to inconsistent reporting.
First, prevalence change was assessed by the McNemar test.The earlier analysis of these data (40) compared 95% confidence intervals for prevalence constructed with sampling error under the assumption that the 2 observations were independent, which biases results toward finding no difference between groups since independent observations have a higher standard error than nonindependent, repeated observations from the same individuals.
Second, inconsistency was measured as absolute and relative retraction and as absolute and relative initiation.These measures provide easily interpretable means to compare observed inconsistency with inconsistency expected from chance.Absolute retraction is the proportion of the sample contradicting an earlier reported behavior: an affirmative answer followed by a negative.Relative retraction is the proportion of those who initially reported the behavior and subsequently retract their report: absolute retraction divided by wave 1 prevalence.Absolute apparent initiation is the proportion of the sample that appears to initiate the behavior between waves by reporting the behavior at wave 2 but not at wave 1.Finally, relative initiation is the proportion of wave 2 endorsers who did not report the behavior at wave 1: absolute initiation divided by wave 2 prevalence.Retraction and initiation depend on prevalence: absolute retraction and initiation are bounded from above by the prevalence of the risk behaviors; rare behaviors have more variable relative retraction and initiation because the denominator is small.
Third, agreement was measured by using TCC instead of the more common agreement measure kappa, used in the original analysis (40).TCC is constructed to be independent of prevalence, so rare and common behaviors may be compared on the same scale (46)(47)(48)(49)(50)(51) and low agreement cannot be attributed to either low prevalence or prevalence change between waves.TCC can be interpreted as conventional correlation, with 0.0 chance agreement and 1.0 perfect agreement.TCC can also adjust for potential differences in response tendency by wave, such as if adolescents redefine risk behaviors on retest (26); TCC is high if the primary response tendency difference is a shift.TCC is computed with standard error by the maximum likelihood method in the R statistical package (52).
Predictors of agreement were found in 2 ways: comparing mean TCC by category and through linear regression.Past results suggest that adolescents are more likely to misreport sensitive or unusual behaviors (20,24), so behavior category was considered the primary predictor of agreement, especially behaviors with higher levels of inconsistency in past research: sex and tobacco, alcohol, and drug use.The mean TCCs of the categories were compared by using the Tukey test for honest significant difference.The linear regression had outcome variable TCC, and the model was built beginning with question topic as predictors and by also considering time frame and the question characteristics described above.If agreement was due to memory or true change, agreement would be associated with time frame, tested in 2 ways: by including time frame as a predictor in regression on TCC and comparing TCC for the same risk behavior by time frame.
Fourth, error due to inconsistency was estimated as a standard error multiplier derived from a Bayesian simulation model (53)(54)(55), which is described in the Web Appendix (this supplementary material is posted on the Journal's website (http://aje.oupjournals.org/)).Estimated error due to inconsistency is another method of quantifying the impact of inconsistency on adolescent risk behavior surveillance.Error is derived from a model that makes different assumptions than TCC, but both are independent of prevalence.Regressions were replicated by using this estimated error multiplier as an outcome variable.

Prevalence change
The prevalence of 41 of the 72 behaviors changed in a 2-week interval (Table 2), some in logically impossible directions.No change was expected between waves 1 and 2 in respondents' reports of their behavior prior to age 13 years because all respondents were older than that, yet more respondents reported having used cigarettes (P 0.0001) and marijuana (P 0.05), and fewer reported sexual intercourse (P 0.0001).No decrease was expected in reported lifetime prevalence, but fewer respondents reported having ever used cigarettes (P 0.0001), alcohol (P 0.0001), and marijuana (P 0.01) and having had 4 or more lifetime sexual partners (P 0.01).Prevalence change was not associated with any question characteristics in 2 logistic regressions and 1 linear regression.

Inconsistency
Even when prevalence does not change, inconsistency may be high.For example, the proportion of respondents reporting pregnancy history-having ever been pregnant or making another person pregnant-was about 8%-9% at both waves.Although prevalence did not change, 45.3% of those initially reporting pregnancy retracted their report 2 weeks later.Furthermore, 42.7% of pregnancies reported at wave 2 seem to have occurred in the 2 weeks between surveys because these pregnancies were not reported at wave 1.
Median relative retraction for the 72 questions was 27% (interquartile range (IQR) ¼ 19.5-38.2);that is, on average 27% of those reporting a behavior at wave 1 denied the behavior at wave 2. Median relative initiation was 28% (IQR ¼ 19.3-44.2); on average 28% of those reporting a behavior at wave 2 had not reported it at wave 1, as if it were initiated in the 2-week interval between surveys.
No retraction and moderate initiation were expected for the 15 items concerning whether respondents engaged in the behaviors in their lifetimes, but median relative retraction was 23.7% (IQR ¼ 11.9-32.7)and median relative apparent initiation was 28.7% (IQR ¼ 15.5-38.8).That is, almost a quarter of those reporting having ever engaged in a behavior at wave 1 denied the behavior at wave 2, and about a quarter of those reporting the behavior at wave 2 had not reported the behavior 2 weeks earlier at wave 1, as if they had initiated the behavior in the interim.No retraction or initiation was expected for items about behavior before age 13 years because all respondents were older than that; however, for the 4 items, 23.3% of respondents at median retracted and 27.7% of respondents at median apparently initiated.Variation regarding rare behaviors may be larger, but when analysis was restricted to the 13 of 19 lifetime and pre-age-13-years behaviors with a prevalence of 10%-90%, 18.4% of respondents retracted (IQR ¼ 6.9-26.4)and 19.4% of respondents apparently initiated (IQR ¼ 19.4-26.3).
Retraction regarding items about the past year only was expected for respondents who performed a behavior 50-52 weeks before wave 1 and not in the 2 weeks between waves.The behavior changes of such respondents would produce a relative retraction of 2/54 (3.7%) and relative initiation of 3.7%, assuming the behavior had a uniform distribution.Relative retraction and initiation for nearly all (17 of 18) questions about the past year were higher than the levels expected if changed reports were due to true change.
Weight control behaviors had the largest retraction rate.More than 20% of those initially reporting that they consider themselves overweight, are trying to lose weight, or exercise and diet to lose weight retracted these reports 2 weeks later; and more than 50% of those initially reporting that they fast, vomit, and take diet pills retracted these reports 2 weeks later.Apparent initiation of these behaviors was similarly high.

Agreement
Agreement, measured by TCC, was high and left skewed (median ¼ 0.87, IQR ¼ 0.80-0.92)(Table 2).The questions with the highest agreement (TCC ¼ 0.99) involved having ever had sex and having used marijuana.Other questions in the top quartile of agreement (TCC > 0.92) included 3 of the 4 items on marijuana; 7 of the 13 items on smoking; having ever used alcohol, cocaine, and methamphetamines; and 2 of the 4 items on suicide.Questions in the bottom quartile of agreement (TCC < 0.80) included having been taught about AIDS or HIV infection in school (TCC ¼ 0.45, an outlier), 6 of the 7 weight control items, and having seen a doctor when not sick.Agreement regarding the remaining weight control question, whether the respondent considers himself or herself to be overweight, was close to the bottom quartile (TCC ¼ 0.82).
Average agreement (TCC) for the topics of tobacco, alcohol, and drugs was significantly higher than for weight control and miscellaneous topics (doctor, dentist, sunscreen, and HIV education) when Tukey's honest significant difference was used.Agreement for the topic of depression was higher than for the miscellaneous topic and was marginally higher than for weight control/physical activity.
Agreement (TCC) for questions on sexual intercourse and on tobacco, alcohol, and illegal drug use was substantially higher than average, and agreement for questions on weight control was substantially lower than average (Table 3).Agreement was not associated with time frame, question length, or other topics and was marginally lower for questions for which the answer choices had been dichotomized from a multi-item scale (P ¼ 0.07, not shown).As expected from its derivation, TCC was not associated with prevalence.
For risk behaviors asked about in multiple items, agreement varied by question time frame within the same risk behavior.Agreement was higher for longer time periods: higher for lifetime than for the past 30 days regarding all 6 risk behaviors for which both time frames were asked; higher for lifetime than for before age 13 years for all 4; higher for the past 30 days than for the past 30 days at school for all 3; higher for the past 30 days than for before age 13 years for all 4; and higher for before age 13 years than for the past 30 days at school for 2 of the 3 risk behaviors (Table 4).Additional discussion of the relative levels of agreement (TCC) between questions can be found in the Web Appendix.
The Bayesian simulation model estimated that unreliable data increased standard error at median by a factor of 3 (Table 2); that is, confidence intervals that account for measurement error due to respondents' inconsistent reporting Intertemporal Consistency of Adolescent Self-Report 1391 Am J Epidemiol 2009;169:1388-1397 c Absolute retraction is the proportion of all respondents reporting behavior at wave 1 and denying it at wave 2. Relative retraction is the proportion of respondents reporting the behavior at wave 1 who denied the behavior at wave 2.
d Absolute initiation is the proportion of all respondents newly reporting the behavior at wave 2, having not reported the behavior at wave 1. Relative initiation is the proportion of those who reported the behavior at wave 2 who apparently initiated the behavior and had not reported the behavior at wave 1.
e TCC measures average agreement between wave 1 and wave 2 responses, with 0.0 representing chance agreement and 1.0 perfect agreement.
f The standard error multiplier (SE) is estimated from the Bayesian model as the factor by which the usual standard error calculation underestimates total error, including inconsistency.
Intertemporal Consistency of Adolescent Self-Report 1393 Am J Epidemiol 2009;169:1388-1397 would be at median 3 times as wide as usual confidence intervals.The questions with the lowest error concerned sexual intercourse (standard error multiplier ¼ 1.6), marijuana use (standard error multiplier ¼ 1.7), and smoking cigarettes (standard error multiplier ¼ 2.0-2.1).The questions with the highest error were those related to fasting to lose weight in the past month (standard error multiplier ¼ 4.9), ever being taught about HIV in school (standard error multiplier ¼ 5.1), and rarely/never wearing a motorcycle helmet when riding a motorcycle in the past month (standard error multiplier ¼ 5.6).As found in regressions using TCC as an outcome, sex, drug, alcohol, and tobacco items had lower error than other topics, and weight had higher error (not shown).As expected, the standard error multiplier was not associated with prevalence but was significantly associated with relative retraction.

DISCUSSION
In a 2-week period, adolescents' reports of their sex, drug, alcohol, and tobacco histories were more reliable than their reports of other behaviors; by contrast, in longer intervals, these behaviors were reported much less reliably than other behaviors.Most sex, drug, alcohol, and tobacco items had stable prevalence estimates, higher average agreement, and lower estimated error than other YRBS topics.In short peri-ods, these behaviors may be reported consistently because the behaviors have high identity salience and few adolescents change identities in a 2-week period.In longer periods, these items may be reported less consistently because adolescents report past sex and substance use in accordance with their current identities (24,26,36), so adolescents who change their identities and habits will report inconsistently (20,24).
The validity of adolescent weight control items is particularly critical because of increased investment in obesity research and reliance on annual surveys for surveillance, but adolescents report their weight control behaviors more unreliably than any other behavior.For most weight control items, compared with any other topic, prevalence estimates were unstable; average agreement was lower, the only category for which agreement was low for all questions; and estimated error was higher.
Adolescents change reporting of their past behaviors as their present behaviors change (24,26,36).If adolescents changed their weight control behaviors more frequently than the 1-month time frame of the weight control questions, low agreement on reports of weight control may be due to adolescents' reporting their most recent behavior rather than their past-month behavior.Low agreement may also be due to changed inhibitions about reporting weight control behaviors on retest, which would be consistent with past findings that adolescents underreport both healthy and unhealthy weight control behaviors, including vomiting to lose weight and dieting to lose weight, in interviews compared with self-administered surveys (37).However, no trend was evident regarding how inhibitions to report weight control might change on retest: more adolescents reported that they consider themselves overweight and are trying to lose weight, but fewer adolescents reported exercise, diet, and fasting to lose weight, and the proportion reporting vomiting or using pills did not change.The first explanation seems more likely: adolescents may change their weight control behaviors more frequently than a question about the past month can capture accurately.Questions about weight control practices may yield more accurate responses if phrased in terms of a more recent time period, such as the past week, as dietary intake questions are currently formatted, with repeated measures necessary for longer-term surveillance.a TCC measures agreement between wave 1 and wave 2 responses, with 0.0 representing chance agreement and 1.0 perfect agreement.Averages were found in linear regression.R 2 ¼ 0.32.Inconsistent responses increase measurement error in proportion to the level of inconsistency.No pattern was evident in the direction or magnitude of prevalence changes from attempted regressions, so prevalence changes may be another manifestation of this measurement error.The error regarding prevalence of an inconsistent behavior such as exercising to lose weight (error multiplier ¼ 4.0) was underestimated by twice as much in magnitude as that for consistent behaviors such as smoking cigarettes (error multiplier ¼ 2.0).This study does not advocate that confidence intervals be constructed to account for all measurement error including inconsistency.Researchers should nonetheless be aware of the limitations of their instruments, as survey experts have advocated (44,56).For example, borderlinesignificant findings for items with higher estimated measurement error may be attributable to that error.

Limitations
The hypothesis advanced in this paper about high consistency in short intervals being due to the identity salience of these behaviors to adolescents is a post hoc explanation, but it is plausible because identity is thought to be related to inconsistency during long intervals.The identity salience hypothesis could be studied systematically by using the complete data to find associations between inconsistency and gender, grade, race/ethnicity, and age.This analysis was limited to contingency tables, however, because complete data are not available publicly.Because of a lack of access to full data, this study also could not determine whether inconsistency is a property of the individual, with some individuals more likely to be inconsistent, or the question, with inconsistency correlated among related questions, vital information for improving YRBS validity.
Dichotomization of questions with multilevel categorical responses may have artificially lowered agreement because of loss of information.With full data, agreement on these items could be measured by polychoric correlation, a generalized version of TCC (46,47,50,51).In addition, not all questions were included in the original publication, such as those regarding contraception and substance use during sexual behavior (40).
Another limitation is that the geographically diverse convenience sample is not nationally representative.In addition, this sample is somewhat less likely to engage in risk behaviors compared with the nationally representative YRBS sample.
The Bayesian simulation model for estimating error due to inconsistency was underidentified: there are 3 degrees of freedom in the data to estimate 7 parameters, so many combinations of the 7 parameters could create the observed data, but the use of priors for 4 parameters-sensitivity and specificity for each of the 2 waves-restricted the problem.The estimates of all parameters were stable, so it can be concluded that the priors restricted the problem sufficiently that underidentification was not a major concern.

Comparison with earlier analysis
This study replicates some findings of the original analysis of these data, and it adds others.Brener et al. (40) conducted the original data collection rigorously, analyzed the data thoroughly, and explored some of the same issues as those discussed in this paper but, in a few instances, used ambiguous or inappropriate measures.As in the original analysis (40), this study found substance use and sex to be the most consistent topics, no statistically significant consistency difference across all questions by question time frame, and some instances in which inconsistency could be due to true change.
This study is distinct from the earlier analysis (40).It used a more appropriate test for prevalence change, used a less ambiguous measure of agreement so that low agreement could not be attributed to low prevalence, quantified the impact of inconsistency on measurement error, and found a lack of reliability regarding weight control questions and proposed a potential solution.

Conclusions
Adolescents reported sex and substance use consistently in a 2-week interval, but they reported weight control less consistently than any other risk behaviors.This inconsistency is especially problematic because adolescent obesity is a central public health issue and is potentially more dangerous to adolescents' future health than are other risk behaviors.Revising YRBS weight control questions to encompass a shorter time period may allow more accurate surveillance of adolescents' self-initiated weight control and physical activity.Future survey validity research could examine alternatives to current YRBS weight control questions.In the meantime, researchers should be aware of limitations of the current YRBS data, especially regarding weight control.
Survey report consistency may be connected to adolescents' identities.In short periods, adolescents present their sex and substance use consistently, but, in long periods, adolescents may change their social affiliations and these behaviors and thus report inconsistently.

Table 1 .
Demographic Characteristics (%) a of Test-Retest Survey Respondents (in 2000) vs. a Nationally Representative 1999 YRBS Sample, United States b Abbreviation: YRBS, Youth Risk Behavior Survey.a Percentages may not add to 100 because of rounding.

Table 2 .
Two-Week Response Consistency in the YRBS Reliability Study, 2000, United States (n ¼ 4,619) a

Table continues 1392 Rosenbaum Table 2 .
Continued bComparison of prevalence at waves 1 and 2 is from the McNemar test.

Table 4 .
Agreement (Tetrachoric Correlation) by Time Frame, YRBS, 2000, United States a Abbreviation: YRBS, Youth Risk Behavior Survey.aStandarderrors, in parentheses, were computed by using the maximum likelihood estimator in the polychoric correlation library for the R statistical program (52).b Sex is not reported for the past 30 days, but rather for the past 3 months.1394Rosenbaum Am J Epidemiol 2009;169:1388-1397 Presented at the 64th Annual Conference of the American Association for Public Opinion Research, Hollywood, Florida, May 14-17, 2009; Federal Committee on Statistical Methodology, Washington, DC, November 8-10, 2007; 134th Annual Meeting of the American Public Health Association, Boston, Massachusetts, November 4-8, 2006; and the American Statistical Association's Joint Statistical Meeting, Seattle, Washington, August 6-10, 2006.Conflict of interest: none declared.