Test-Retest Reliability of a Sexual Behavior Interview for Men Residing in Brazil , Mexico , and the United States The HPV in Men ( HIM ) Study

Understanding the natural history of sexually transmitted infections requires the collection of data on sexual behavior. However, there is concern that self-reported information on sexual behavior may not be valid, especially if study participants are culturally and linguistically distinct. The authors completed a test-retest reliability study of 1,069 men recruited in Brazil, Mexico, and the United States in 2005 and 2006. All of the men completed the same computer-assisted self-interview approximately 3 weeks apart. Refusal rates, kappa coefficients, and intraclass correlation coefficients were calculated for the full sample and by country, age, and lifetime number of female sex partners. Reliability coefficients for each study site and the combined population were high for almost all questions. With few exceptions, the authors found high test-retest reliability with a computer-assisted self-interview on sexual behavior used in 3 culturally and linguistically distinct countries.

Understanding the natural history of sexually transmitted infections requires the collection of data on sexual behavior. However, there is concern that self-reported information on sexual behavior may not be valid, especially if study participants are culturally and linguistically distinct. The authors completed a test-retest reliability study of 1,069 men recruited in Brazil, Mexico, and the United States in 2005 and 2006. All of the men completed the same computer-assisted self-interview approximately 3 weeks apart. Refusal rates, kappa coefficients, and intraclass correlation coefficients were calculated for the full sample and by country, age, and lifetime number of female sex partners. Reliability coefficients for each study site and the combined population were high for almost all questions. With few exceptions, the authors found high test-retest reliability with a computer-assisted self-interview on sexual behavior used in 3 culturally and linguistically distinct countries. data collection; internationality; men; questionnaires; reproducibility of results; sexual behavior Abbreviations: CASI, computer-assisted self-interviewing; HPV, human papillomavirus; ICC, intraclass correlation coefficient.
Understanding the natural history of sexually transmitted infections and related disease requires the collection of data on sexual behavior. However, there is concern that study participants' self-reports on sexual behavior may not be valid (1)(2)(3)(4)(5)(6) because of measurement error from several sources, including the demands of the recall task and the survey method (7). In addition, measuring sexual behaviors with a common instrument across multiple countries may pose a threat to data quality (8), because such situations are affected not only by the survey method and burdens placed on the participant but also by differing population characteristics-for example, social attitudes toward disclosing sexual behavior (7,(9)(10)(11).
While validating human behavioral surveys, including sexual behavior surveys, is difficult, test-retest studies can be used to assess their reliability. These studies assess the consistency of participant responses between 2 time points (4). High consistency does not ensure validity of data, but low consistency can highlight potentially invalid data (7). In other words, reliability is necessary for validity but not sufficient (12).
We are not aware of studies that have assessed CASI reliability in the context of cross-national populations. The objective of the current study was to assess test-retest reliability for a non-audio CASI that collected information on sexual health history and sexual behavior in 3 languages from men recruited in 3 different countries.

Study population
Beginning in March 2005, men were recruited in Brazil (São Paulo), Mexico (Cuernavaca), and the United States (Tampa, Florida) for the HPV in Men (HIM) Study-a cohort study of the natural history of anogenital human papillomavirus (HPV). Men were enrolled if they were between the ages of 18 and 70 years; resided in one of the targeted recruitment areas; had no prior anal cancer, penile cancer, or genital warts; had no current diagnosis of a sexually transmitted disease, including human immunodeficiency virus; had no history of imprisonment, homelessness, or drug treatment in the prior 6 months; and were willing to engage in study visits every 6 months for 4 years. Additional details on the study design and population have been previously published (36,37).
In Brazil, men were recruited from a large clinic in São Paulo that tests for human immunodeficiency virus and sexually transmitted diseases and from the general population through radio and print advertisements. In Mexico, men were recruited through a large health plan in the state of Morelos. In the United States, men were recruited from a large university campus and the general community in Tampa, Florida. Participants received a nominal monetary incentive for their participation. Men found to be illiterate or innumerate during the consenting or interview process were removed from analysis for the current CASI reliability study. All enrolled participants consented to the study protocol, which was approved by the human subjects protection committee of the Ludwig Institute for Cancer Research in Brazil; the ethical committee of the Center for Sexually Transmitted Diseases and AIDS in São Paulo; the National Institute of Public Health of Mexico; and the University of South Florida.
Recruited in 2005 and 2006, the first 1,069 men to complete their run-in (test) and baseline (retest) visits by CASI comprised the participants in the current reliability study. Age varied by study site, with the median age of participants in Brazil and Mexico (33 years in both countries) being higher than that of participants recruited in the United States (23 years). As expected, racial and ethnic characteristics also varied by study site. Overall, approximately one-half (50.8%) of participants reported a nonwhite race, while 41.4% reported a Hispanic ethnicity. Other population characteristics are provided in Table 1.

Procedure
Men expressing interest in the study came to the clinic for an initial visit. After consenting to the research and receiving instructions for using the CASI, participants completed the self-interview and then were sampled at anogenital sites for HPV. The CASI was written in the primary language of the region (Portuguese, Spanish, or English) and elicited information about participant demographic characteristics, substance use, sexual health history, and sexual behaviors implicated in the transmission of HPV. The men completed an identical CASI retest approximately 3 weeks later (the median test-retest interval was 21 days in Brazil, 25 days in Mexico, and 16 days in the United States). A Kruskal-Wallis test confirmed a statistically significant difference in testretest interval by site (P < 0.0001). Per the protocol, men were not counseled or educated about HPV at either the test or the retest, although the informed consent form contained basic information on HPV and staff answered men's impromptu questions. Men did not receive their first HPV test results until 6 months later, at a subsequent clinic visit.

Interview measures
The interview contained 88 items. The majority of the questions had previously been administered to US men in a paper-and-pencil format and generally were found to have excellent reliability (38).
Participants' sexual health was assessed with 18 questions about past sexually transmitted infections, the existence of a current sex partner, circumcision status, and the sexual health histories of their partners (ever having a partner with a sexually transmitted disease, genital warts, or an abnormal Papanicolaou smear). In addition, 45 sexual behavior items assessed incidence and frequency of penetrative sexual behaviors (vaginal, anal, and oral) with women and men; age at first intercourse; number of female and male partners; frequency of condom use for vaginal and anal sex; incidence and frequency of sex with ''steady'' and casual partners; time since last vaginal sex and anal sex; and history of paying for sex. Participants were asked to recall their frequency of sexual intercourse and numbers of sex partners for varying periods of time, including the last month, the last 3 months, and over a lifetime. Participants could choose to refuse to respond to any question by clicking a ''refuse'' button. Participants could answer ''Don't know'' or ''Don't remember'' for some nominal items-for example, regarding past sexually transmitted disease diagnoses and use of a condom; however, participants were not given the option of answering ''Don't know'' for interval and ordinal itemsfor example, regarding their lifetime number of sex partners.
A subset of interview items was selected for assessment of reliability, with preference being given to items for which reliability coefficients would be less affected by the testretest interval; therefore, items with only a 1-month recall period were not assessed. A total of 38 variables were assessed, including 9 interval, 4 ordinal, and 25 nominal variables. Variables assessed included 14 sexual health history variables and 24 sexual behavior variables.

Data analysis
For each item, reliability coefficients were calculated for each of the 3 study sites. We calculated combined population coefficients by averaging study-site coefficients after weighting them by the inverse of their variances (39). Because age (40,41) and number of sex partners (40,42) are associated with increasing measurement error, reliability coefficients were stratified by age (<30 years vs. 30 years) and lifetime number of female sex partners (7 partners (median) vs. >7 partners) reported at retest.
For nominal variables, the kappa (j) statistic was calculated (43). Because the j statistic can be unstable in situations where there are sparse data (44), j was not computed for variables where the number of cases or noncases was less than 5 (7). For ordinal variables, a weighted j statistic was calculated (45) to allow credit for partial agreement. Benchmarks for interpreting j and weighted j values followed those of Landis and Koch (46): poor reliability, j < 0.00; slight reliability, j ¼ 0.00-0.20; fair reliability, j ¼ 0.21-0.40; moderate reliability, j ¼ 0.41-0.60; substantial reliability, j ¼ 0.61-0.80; and almost perfect reliability, j 0.81.
Interval variables were assessed using the intraclass correlation coefficient (ICC) (47). All ICCs created using nonnormal variables were transformed using Fisher's z transformation before calculation of confidence intervals (48). Confidence intervals were then transformed back to the original scale. ICCs approaching 1.0 indicate high test-retest reliability.
During exploratory analysis, extreme outliers were identified in 2 variables: number of different female sex partners in the past 3 months (a value of 11,111,109,632 on both test and retest) and age at first sexual intercourse with women (a value of 1,993 on test). Each observation was removed prior to analyses. Subsequent text regarding outliers identified in scatterplots does not include these 2 observations.
Refusal rates were assessed. Refusals were not included in reliability coefficient calculations.

RESULTS
With exceptions for skip patterns, participants at each study site answered virtually all of the 38 questions under study. The average refusal rate on retest for all questions was 1.0% in Mexico, 1.3% in the United States, and 2.5% in Brazil (data not shown).
For all nominal and ordinal questions, j and weighted j reliability coefficients for each study site and the combined population were almost perfect (j 0.81) or substantial (j ¼ 0.61-0.80). Table 2 provides coefficients for 18 items for which reliability was less than 0.81.
Site-specific ICCs for all interval questions were 0.85 or more, with the exception of ICCs in Brazil and Mexico for 3 items asking men to report their numbers of sex partners. Scatterplots identified several extreme outliers in the bivariate distributions of all 3 items. Specifically, for the variable ''number of sex partners other than a 'steady' partner in the past 3 months,'' when 1 outlying participant in the Mexico sample was removed, the Mexico ICC increased from 0.61 to 0.84. For the same item, when 2 outlying participants in the Brazil sample were removed, the Brazil ICC increased from 0.10 to 0.79. For ''lifetime number of male anal sex partners,'' when 1 outlier identified in the Brazil scatterplot was removed from the data set, the ICC for Brazil increased from 0.50 to 0.99. For the variable ''number of different female sex partners in the past 3 months,'' when 1 outlier in the Brazil scatterplot was removed, the ICC for Brazil increased from 0.58 to 0.92.
After taking into account these outliers, test-retest reliability was generally high and consistent across sites: Reliability coefficients differed by no more than 17 percentage points among study sites.
All nominal and ordinal items had substantial or almost perfect reliability regardless of participant age. All interval variables had ICC scores greater than or equal to 0.85 for both age groups, with the exception of the older men's answers to the same 3 interval variables as those discussed above: lifetime number of male anal sex partners, number of different female sex partners in the past 3 months, and number of sex partners other than a ''steady'' partner in the past 3 months (Table 3). After removing the outliers discussed above (all of which involved men over age 32 years), the ICCs for these 3 questions increased to 0.84 or more.
Reliability coefficients were also stratified by lifetime number of female sex partners. Whether men reported numbers of partners above or below the median number of 7, reliability coefficients for all nominal and ordinal variables were substantial or almost perfect, except for 2: ever having vaginal, anal, or oral sex (j ¼ 0.39 for men with >7 partners) and ever paying a man for sex (j ¼ 0.54 for men with 7 partners) ( Table 3). Two interval variables also showed lower reliability: lifetime number of male anal sex partners (ICC ¼ 0.50 for men with 7 partners) and number of sex partners other than a ''steady'' partner in the past 3 months (ICC ¼ 0.29 for men with >7 partners). Removal of the outliers identified above increased the ICC for these 2 interval variables to 0.85 or more.

DISCUSSION
In this test-retest reliability study of a CASI instrument, 1,069 men in Brazil, Mexico, and the United States were asked the same questions on sexual health history and sexual behavior at test and retest. For each study site and the 3 study sites combined, j and weighted j reliability coefficients for nominal and ordinal questions, respectively, were substantial (0.61-0.80) or almost perfect (0.81). However, while the combined population ICC scores for all interval variables were greater than or equal to 0.85, the study sitespecific reliability of 3 interval variables was of concern: lifetime number of male anal sex partners, number of different female sex partners in the past 3 months, and number of sex partners other than a ''steady'' partner in the past 3 months. The apparently poorer reliability of these variables was due to the presence of a small number of outliers identified in scatterplots. When these outliers were removed, study site-specific ICCs increased to 0.79 or more. These outlying observations also distorted the ICCs for several interval variables after results were stratified by age and median lifetime number of female sex partners.
The ability of a small number of observations to distort a reliability coefficient is discussed in the literature (7,49); however, to our knowledge, the impact on reliability in practice has rarely been described (50,51). In the current study, 1 or 2 participants with highly discrepant test and retest answers were able to increase the variance of an item by more than 2 magnitudes. For example, 1 Brazilian participant reported 2,000 lifetime male anal sex partners at test and only 20 such partners at retest. Removing this individual decreased the variance in the item from 0.00602 to 0.00002 and increased the ICC from 0.50 to 0.98. Such scenarios also underscore the importance of weighting by the inverse of the variance when combining coefficients in order to reduce the contribution of less reliable coefficients.
In total, outliers were observed in data reported by 3 out of 338 men in Brazil, 1 out of 327 men in Mexico, and no men out of 404 in the United States. The numbers of outliers in the 3 countries did not differ significantly (P ¼ 0.11). Nevertheless, a higher number of outliers in Brazil may have occurred if the Brazilian men had less lifetime exposure to computer technology. Since a substantially higher percentage of Brazilian participants were aged 45 years or older as compared with Mexican and US participants, it is possible that these older men had less comfort with the technology and therefore had more reporting errors (17,30,32). It is also possible that a higher number of outliers for Brazil, in comparison with Mexico or the United States, may have occurred if this cross-national instrument was less culturally appropriate for Brazilian men. The questionnaire from which the current study's CASI was created was developed in the southwestern United States near the Mexican border. Creation of the questionnaire in this region may have led to an instrument that was somewhat more culturally appropriate for US and Mexican participants and less so for Brazilian participants. If this was the case, this may also account for the somewhat higher rate of question refusal among Brazilians. A review of the 4 outlier participants' responses for all interview items revealed that 2 of the Brazilian men also gave illogical answers to a number of other questions, while the remaining 2 outliers' interviews were generally unremarkable.
Men found to be illiterate or innumerate during the consenting and interviewing procedures were not included in the current CASI study; however, staff may not have been able to identify all of the persons whose level of illiteracy increased their risk of providing unreliable responses. Reliability can also be affected by the number of days between test and retest (4). However, the outliers in Brazil had testretest intervals of less than 22 days. In addition, reliability in this study was largely consistent across sites, even though median test-retest intervals varied from 16 days at the US study site to 25 days at the Mexican study site. After stratification by lifetime number of female sex partners, lower reliability was also identified for 2 nominal variables: ever having vaginal, anal, or oral sex (j ¼ 0.39 for men with >7 partners) and ever paying a man for sex (j ¼ 0.54 for men with 7 partners). For the variable ''ever having vaginal, anal, or oral sex,'' 7 out of 441 men who reported more than the median number of 7 sex partners at retest reported at test that they had never had anal, vaginal, or oral sex. In addition to inviting concern about validity, this item's j coefficient may have been rendered unstable because of the small number of cases. It is also plausible that the multiple sexual behaviors addressed in the question confused some men. For the second variable, only 6 men with fewer than the median number of sex partners acknowledged ever paying a man for sex, possibly lending instability to the j coefficient. Variables with sparse data may simply reflect a lack of behavioral heterogeneity in the population or cultural stigmatization attached to certain behaviors (9). During the design of this study, we decided not to report coefficients where the number of cases or noncases was less than 5, in an effort to report only stable coefficients. In future evaluations of test-retest reliability, investigators may wish to consider increasing this minimum requirement.
However, absent the small number of extreme outliers and the presence of sparse data for 2 questions, virtually all coefficients indicated that the CASI interview items under study captured data on these men's sexual behaviors in a highly reliable manner. This result may have been  Abbreviations: CI, confidence interval; HPV, human papillomavirus; ICC, intraclass correlation coefficient. a Omitted are reliability coefficients for 20 items for which all study site coefficients were greater than or equal to 0.81. b Unless otherwise noted, the total coefficient was derived by averaging the 3 study site coefficients after weighting them by the inverse of the variance of each site, as described by Fleiss (39). c Number of participants, excluding refusals and missing observations. d Kappa and ICC coefficients were z-transformed for estimation of confidence intervals, as described by Rosner (48). e One outlier (a value of 11,111,109,632 on both test and retest) was removed before calculation of ICCs. f Weighted j, following the method of Cicchetti and Allison (45). g The j value was unstable because of a low number of cases or noncases; likewise, a stable j value could not be calculated for any study site for ''ever being diagnosed with syphilis'' and ''frequency of condom use for paid anal/vaginal sex.'' h Because of instability in the j value for at least 1 study site for this item, j was derived directly from the total sample, rather than from the average of the j coefficients from each study site. obtained because participants in each country were primarily educated men living in an urban setting. In collecting data on abortions from Mexican women, Lara et al. (32) found audio-CASI more appropriate for urban and educated participants than for rural residents in Mexico. Success with audio-CASI has also been reported from Brazil, where it was not only acceptable to men recruited at a health clinic in Rio de Janeiro but also elicited more reports of sensitive sexual behaviors (24,52).
To our knowledge, no studies have evaluated cross-national test-retest reliability for a sexual behavior survey administered by CASI; however, investigators in 3 reliability studies of adults have reported results for audio-CASI in more homogenous samples (34,53,54). These studies are difficult to compare with the current study because of different study populations, test-retest intervals, or reporting methods. Schlecht et al. (40) assessed the reliability of sexual behavior data collected by face-to-face interview and selfadministered questionnaire using pooled, multinational data from 6 studies. In that analysis, reliability suffered substantially when women in different countries reported their lifetime number of sex partners (ICC ¼ 0.08-0.94). In contrast, the current study found almost perfect reliability when this question was asked of men in Brazil, Mexico, and the United States. These heterogeneous results may be due to the use of different survey methods or to the fact that Schlecht et al. assessed the reliability of separate and distinct studies that not only had different survey methods and study protocols but also different and lengthy test-retest intervals (6 weeks to 5 years) (40). Therefore, the current study may have been more suited for a comparison of cross-national reliability, since identical protocols with relatively similar and short test-retest intervals were used at all study sites.
Refusal rates were generally low, which has been found previously with surveys of human sexuality (28,55). However, the question on lifetime number of female sex partners was refused by 11.7% and 12.6% of Brazilian participants on test and retest, respectively. This rate of question refusal may be due to the targeted recruitment of Brazilian men in a clinic setting. Higher refusal rates from participants and less preference for using audio-CASI in a clinic setting, as compared with a face-to-face interview, have been reported previously (14,52). Also noteworthy is that items requiring the participant to provide a numeral, as opposed to a nominal, answer were approximately twice as likely to be refused in Brazil and Mexico as in the United States, where participants refused numerical items and nominal items at about the same rates (data not shown).
This study had limitations. Reliability cannot be used as a surrogate measure for validity, since item reliability is not sufficient for validity. Additionally, because of the targeted recruitment, these results should not be generalized to the entire populations of the 3 study countries.
Comparisons of sexual behavior by country may be helpful in attempts to deliver large-scale programs for the prevention of sexually transmitted diseases; however, if sexual behavior measures are to be used cross-nationally, they should produce reliable data for each locale (8). With few exceptions, we found high reliability using a single CASI instrument in 3 culturally and linguistically distinct countries. While not guaranteeing validity, these results indicate that for the current instrument, use of CASI among men in diverse settings produces reliable data on sexual behavior.