Abstract

Nonprobability samples have gained mass popularity and use in many research circles, including market research and some political research. One justification for the use of nonprobability samples is that low response rate probability surveys have nothing significant to offer over and above a “well built” nonprobability sample. Utilizing an elemental approach, we compare a range of samples, weighting, and modeling procedures in an analysis that evaluates the estimated bias of various cross-tabulations of core demographics. Specifically, we compare a battery of bias related metrics for nonprobability panels, dual-frame telephone samples, and a high-quality in-person sample. Results indicate that there is roughly a linear trend, with nonprobability samples attaining the greatest estimated bias, and the in-person sample, the lowest. Results also indicate that the bias estimates vary widely for the nonprobability samples compared to either the telephone or in-person samples, which themselves tend to have consistently smaller amounts of estimated bias. Specifically, both weighted and unweighted dual-frame telephone samples were found to have about half the estimated bias compared to analogous nonprobability samples. Advanced techniques such as propensity weighting and sample matching did not improve these measures, and in some cases made matters worse. Implications for “fit for purpose” in survey research are discussed given these findings.

Telephone surveys are dead, claim some survey researchers, and should be replaced with nonprobability samples from Internet panels. That Internet panels are less costly is indisputable. At the heart of the question, however, is the resulting data quality from these types of panels. In particular, while the cost is lower, how do the errors from nonprobabilistic Internet panels and samples compare to the errors obtained using common probability-based samples of the day, that is, low response rate RDD samples? Certainly in a fit-for-purpose paradigm one must balance cost with error, but at some point even the cheapest means of data collection will have error properties that are not tolerable, regardless of the price. Probability-based samples are certainly not immune to cost and error considerations either. In recent years we have seen declining response rates for RDD samples, with many such surveys consistently reporting response rates below 10 percent (Pew Research 2012). Such changes have, in fact, led many to now claim that “there is no real random sampling anymore” (Gelman 2012). Of course, at question is not whether the samples are random (they are), but whether the actual respondents who are acquired from such samples differ in considerable and substantive ways from nonrespondents, and thus whether respondents can themselves still adequately represent the population from which they were sampled. While research of the prior decade finds that telephonic research has been quite resistant to bias stemming from non-response (Keeter et al. 2000; Groves 2006; Keeter et al. 2006; Groves and Peytcheva 2008), response rates for studies in that era were about twice as large as they are for similar studies conducted today.

Of interest here is not just the question of the quality of low response rate telephone surveys but also whether and when nonprobabilistic samples, like those obtained through Internet panels, could serve as viable and reliable replacements or alternatives. While the literature on the comparative quality between nonprobability and probability samples is vibrant in European countries (see Callegaro et al. [2014] for an example), the research within the US context is just beginning to emerge.

One of the earliest comparisons in the United States was by Malhotra & Krosnick (2007), who explored representation of the opt-in Internet surveys of YouGov and Harris to the American National Election Survey, a randomly sampled face-to-face interview, and the Current Population Survey. Walker and colleagues (2009) extended the comparisons between probability and nonprobability samples to also include comparisons across nonprobability samples obtained from several online panels on behalf of the Online Quality Research Council of the Advertising Research Foundation. Krosnick also collaborated with Chang (2009) and extended the comparisons of nonprobability samples to an online probability sample as well as to an RDD telephone sample. Yeager et al. (2011) expanded the number of nonprobability samples to seven and compared these to both an online probability sample as well as an RDD telephone sample. In all of these studies, a general pattern has been consistently reported—namely, that more often than not, estimates derived from probability samples were more accurate than those obtained from nonprobability samples across a wide array of political, attitudinal, factual, behavioral, and demographic outcomes. And while the differences in the accuracy in estimates derived using nonprobability and probability samples have been small in some instances, differences as large as 20 percentage points on measures like health behavior and technology use are also not uncommon (see Walker et al. [2009]).

Importantly, nearly all of the research reported thus far employed fairly standard weighting procedures to limit the biases found in nonprobability samples, and in most cases the results between weighted and unweighted estimates were roughly equal. Indeed, a major effort looking at the impact of weighting (Tourangeau, Conrad, and Couper 2013) found that while standard iterative proportional fitting (raking) procedures sometimes reduce bias, and often significantly (by up to 60 percent), they also sometimes increase bias. Of great concern is the substantial impact of weighting in nonprobability samples, altering estimates by as much as 20 percentage points, far more than typically seen when similar weighting adjustments are applied to probability samples.

Perhaps to improve upon standard weighting methods, the online survey research industry has developed additional methods for reducing biases in nonprobability samples, including propensity modeling and sample matching. Propensity modeling (Rosenbaum and Rubin 1983; Rosenbaum 1987) is a straightforward technique in which estimates are obtained from both a nonprobability sample as well as a probability sample. Typically, a logistic regression model is built to “predict” the likelihood of belonging to one sample versus the other, with independent variables including demographics and behavioral and attitudinal variables as well. Nonprobability samples can then be weighted using the inverse of the predicted probability derived from these propensity models, as described in Dever and Shook-Sa (2015).

Schonlau et al. (2003) made an early effort at propensity modeling, with mixed success in that biases were reduced substantially in about 20 percent of the estimates while the remaining estimates still held substantial biases even after the adjustments were applied. Similar results were found by Duffy et al. (2005), where propensity adjustments were very helpful for vote intention and horserace questions but offered no marked improvement over standard raking adjustments for outcomes related to political participation, immigration policies, and knowledge and attitudes about cholesterol and technology use. The authors also note an additional concern, that, bias aside, propensity modeling can introduce significant variance into the estimates due to unequal weighting effects (UWEs).1

In addition to propensity modeling, sample matching is a technique sometimes used to reduce biases in nonprobability samples (Rivers 2007). Simply put, sample matching is a technique in which a probability sample is utilized as a gold standard and nonprobability online panelists are matched on a one-to-one (and in some cases many-to-one) basis to members of the probability sample based on a specified number of variables. Typically, these are demographic measures but can include behavioral or attitudinal metrics as well. Note that this procedure occurs before anyone in the online panel is invited to participate in a specific study, the notion here being to attempt to whittle the pool of nonprobability panelists to a set that “match” probability respondents on key matching variables. Unlike propensity weighting, sample matching is not an explicit weighting technique, but a method that essentially attempts to balance a nonprobability sample to a probability sample based on key target variables. Rivers and Bailey (2009) reported a great degree of success in using panel matching in estimating the 2008 election. Note, however, that vote intention and horserace have traditionally done well in web panels, using simple raking and propensity modeling, as noted above, and so at least in this respect the success of sample matching to accurately predict election outcomes is perhaps unsurprising.

Recent submissions to both the AAPOR and Total Survey Error Conferences have included several papers experimenting with calibration (raking), propensity modeling, and sample matching techniques (Brick et al. 2015; Buskirk and Dutwin 2015a, 2015b; Dever and Shook-Sa 2015; DiSogra et al. 2015; Dutwin and Buskirk 2015; Peters, Driscoll, and Saavedra 2015; ZuWallack et al. 2015). Without reviewing each of these contributions specifically, it is notable that nearly all uniformly found propensity modeling and sample matching were not clear panaceas for reducing error in nonprobability samples, and due to the many inconclusive findings, most authors pointed to the need for further research and development of these techniques.

Despite the emerging literature that consistently documents larger errors for nonprobability samples relative to probability samples and varying levels of success or failure with adjustment methods applied to nonprobability samples, some claim that the persistent low response rates attained from phone probability sample surveys render such probabilistic approaches “almost dead” (Hopper 2014). As a possible replacement, many have touted that “online [opt-in] surveys…work really well” (Hopper 2014), and such nonprobability samples are capable of “generating quality data that mirrors census representative data” (Peanut Labs website advertisement on their AAPOR presentation [Petit 2015]). But empirical investigation and support for these claims appears to be relatively absent from the research literature. While there have been a growing number of studies that compare nonprobability and probability samples on their relative merits and data quality (e.g., Callegaro et al. [2014]; Yeager et al. [2011]), there are no research articles, to date, that have explicitly made a point to compare low response rate (which we define as below 10 percent) probability samples to nonprobability samples in terms of basic data quality. Moreover, there is relatively little literature directly comparing propensity weighting and sample matching as possible solutions for improving the quality of nonprobability samples within the context of the United States.

Responding to the tumultuous change faced by the industry in the past decade, and in part to the general dearth of research on the relative merits of low response rate probability and nonprobabilistic samples and Internet panels, AAPOR organized a task force and mini-conference in 2015 concerning New Developments in Survey Research. Additionally, the call for this Special Issue of Public Opinion Quarterly was for “assessing the quality of research in the current environment” along with a specific prerequisite to go beyond just an assessment, but to offer as well new methods of measuring “representivity,” that is, new metrics for evaluating the quality of estimates derived from different samples. This paper is a response to these calls and represents a substantive effort to assess the quality of a number of nonprobability Internet samples in direct comparison to low response rate probability samples using an elemental approach that utilizes the most fundamental survey data available: core demographics. We compare the estimated bias in conditional demographic distributions from low response rate RDD surveys with the estimated bias from nonprobability Internet panel surveys. We further add to the literature by exploring multiple methods for adjusting the nonprobability samples including propensity weighting, and sample matching in addition to raking.

Data

The data utilized for this study come from five principal sources, with data collection spanning two years, from October 2012 through October 2014, for all sources except the National Health Interview Survey (NHIS). Specifically, the survey data used in this paper were obtained from (1) the nonprobability Internet panel survey interviews from the Centris survey of communication, entertainment, and telephony (referred to as Panel 1)2; (2) a national dual-frame telephone RDD omnibus survey (Telephone 1)3; (3) a general population, aged 18–54, dual-frame RDD telephone survey (Telephone 2)4; (4) a companion nonprobability Internet panel sports-tracker survey (referred to as Panel 2)5; and (5) the NHIS of 2013. More specific information about these studies and sample composition, target populations, and response rates is provided in table 1.

Table 1.

Descriptions of the Five Samples, Including Sample Type, Scope, Response Rates (if applicable) and Types of Adjustments Applied to Each Sample

Sample nameTypeScope of sampleSample sizeResponse rate (RR3)Methods of adjustment applied
Telephone 1aProbabilityAdults 18+107,7018.2%None (unweighted); raked
Telephone 1: Cell phonesProbabilityAdults 18+45,7786.7%None (unweighted); raked
Telephone 2bProbabilityAdults 18–5420,4838.0%None (unweighted); raked
National Health Interview Survey (NHIS)cProbabilityAdults 18+34,55781.7%None (unweighted); raked
Panel 1NonprobabilityAdults 18+81,797N/ANone (unweighted); raked
Panel 1 matchedNonprobabilityAdults 18+5,110N/AMatched; matched & raked
Panel 2NonprobabilityAdults 18–5463,147N/ANone (unweighted); raked; propensity weighted; propensity weighted & raked
Panel 2 matchedNonprobabilityAdults 18–545,013N/AMatched; matched & raked
Sample nameTypeScope of sampleSample sizeResponse rate (RR3)Methods of adjustment applied
Telephone 1aProbabilityAdults 18+107,7018.2%None (unweighted); raked
Telephone 1: Cell phonesProbabilityAdults 18+45,7786.7%None (unweighted); raked
Telephone 2bProbabilityAdults 18–5420,4838.0%None (unweighted); raked
National Health Interview Survey (NHIS)cProbabilityAdults 18+34,55781.7%None (unweighted); raked
Panel 1NonprobabilityAdults 18+81,797N/ANone (unweighted); raked
Panel 1 matchedNonprobabilityAdults 18+5,110N/AMatched; matched & raked
Panel 2NonprobabilityAdults 18–5463,147N/ANone (unweighted); raked; propensity weighted; propensity weighted & raked
Panel 2 matchedNonprobabilityAdults 18–545,013N/AMatched; matched & raked

aThe SSRS Telephone Omnibus survey is a national, dual-frame, bilingual telephone survey of 1,000 adults fielded weekly. While 40 percent of interviews were gathered via cell phones in 2012, the ratio increased to 50 percent in March 2013. One-third of Hispanic interviews are gathered in Spanish. The response rate for these two years of data collectively was 8.2 percent.

bSportsPoll is a survey of sports, leisure, and other issues, and has run continuously since 1981. The telephone survey gathered 40 percent of its interviews via cell phones at the beginning of 2012 but increased to 50 percent in July 2013. The study is conducted in English and Spanish. Response rate for the survey was 8 percent.

cThe National Health Interview Survey is a multistage area probability survey on a range of health status and behavior. Fielded since 1957, it is used here as our high-quality survey for measures of bias and variance given its purpose to produce official governmental estimates on health and its high response rate, 81.7 percent in 2013.

Table 1.

Descriptions of the Five Samples, Including Sample Type, Scope, Response Rates (if applicable) and Types of Adjustments Applied to Each Sample

Sample nameTypeScope of sampleSample sizeResponse rate (RR3)Methods of adjustment applied
Telephone 1aProbabilityAdults 18+107,7018.2%None (unweighted); raked
Telephone 1: Cell phonesProbabilityAdults 18+45,7786.7%None (unweighted); raked
Telephone 2bProbabilityAdults 18–5420,4838.0%None (unweighted); raked
National Health Interview Survey (NHIS)cProbabilityAdults 18+34,55781.7%None (unweighted); raked
Panel 1NonprobabilityAdults 18+81,797N/ANone (unweighted); raked
Panel 1 matchedNonprobabilityAdults 18+5,110N/AMatched; matched & raked
Panel 2NonprobabilityAdults 18–5463,147N/ANone (unweighted); raked; propensity weighted; propensity weighted & raked
Panel 2 matchedNonprobabilityAdults 18–545,013N/AMatched; matched & raked
Sample nameTypeScope of sampleSample sizeResponse rate (RR3)Methods of adjustment applied
Telephone 1aProbabilityAdults 18+107,7018.2%None (unweighted); raked
Telephone 1: Cell phonesProbabilityAdults 18+45,7786.7%None (unweighted); raked
Telephone 2bProbabilityAdults 18–5420,4838.0%None (unweighted); raked
National Health Interview Survey (NHIS)cProbabilityAdults 18+34,55781.7%None (unweighted); raked
Panel 1NonprobabilityAdults 18+81,797N/ANone (unweighted); raked
Panel 1 matchedNonprobabilityAdults 18+5,110N/AMatched; matched & raked
Panel 2NonprobabilityAdults 18–5463,147N/ANone (unweighted); raked; propensity weighted; propensity weighted & raked
Panel 2 matchedNonprobabilityAdults 18–545,013N/AMatched; matched & raked

aThe SSRS Telephone Omnibus survey is a national, dual-frame, bilingual telephone survey of 1,000 adults fielded weekly. While 40 percent of interviews were gathered via cell phones in 2012, the ratio increased to 50 percent in March 2013. One-third of Hispanic interviews are gathered in Spanish. The response rate for these two years of data collectively was 8.2 percent.

bSportsPoll is a survey of sports, leisure, and other issues, and has run continuously since 1981. The telephone survey gathered 40 percent of its interviews via cell phones at the beginning of 2012 but increased to 50 percent in July 2013. The study is conducted in English and Spanish. Response rate for the survey was 8 percent.

cThe National Health Interview Survey is a multistage area probability survey on a range of health status and behavior. Fielded since 1957, it is used here as our high-quality survey for measures of bias and variance given its purpose to produce official governmental estimates on health and its high response rate, 81.7 percent in 2013.

The two nonprobability samples were obtained from two different Internet panels. There is little information on how respondents are specifically attained by each panel, other than the usual notion of website banners, pop-ups, and other means. Because each study is a continuous tracking survey, panelists could in time be re-interviewed (no more frequently than once per year). As such, panelist interviews were de-duplicated so that a single respondent had only a single case of data. Deduplication was conducted at random to attain a single case. Only 14 percent of cases were duplicates and never more than twice, and an analysis of the full sample and the de-duplicated sample did not reveal any significant differences on a range of demographic measures.

The 2013 US Current Population Survey (CPS) served as the probability sample source for the sample matching. We used the 2012 US Census Bureau’s American Community Survey (ACS) as our “gold standard” source for population benchmarks in order to assess the estimated biases.6 The ACS data also serve as the source for control totals for calibration via raking, which we describe in more detail in the next section.

Methods

While our study brings together several nonprobability samples and probability samples of varying sizes and scopes, demographic variables are common to all. To facilitate comparisons, we wanted to identify a key set of variables that are not likely to be subject to satisficing, social desirability biases, or other measurement errors that could confound the impact of sample type beyond mode—which is inherent in the nature of this work (e.g., nonprobability Internet versus low response telephone surveys) (Callegaro et al. 2014).

Instead, we explore core demographics as the basis for our analysis. Where one lives in the United States; how old they are; their race/ethnicity; and the level of education are attributes foundational to a vast array of other potential attitudinal and behavioral metrics. They have served as key covariates in myriad research projects across the social sciences. Raking, propensity models, and sample matching all widely employ these variables to reduce errors associated with nonresponse, coverage, and other sources. Of course, therein lies “the rub” in utilizing these measures to compare different samples. Raking to main effects of key demographics constrains the weighted distribution to match the marginal distribution of each of these key demographics (up to error tolerances), thereby eliminating any utility these variables have for sample comparisons. However, this method in no way constrains the individual cells within a cross-tabulation table nor the row-wise or column-wise conditional distributions of two demographic variables considered together (e.g., the percentage of males who are young or the percentage of young adults who are male). Our elemental approach will look at evaluating conditional estimates computed from the cross-tabulation of pairs of demographic variables.

The notion behind our elemental cross-tabulation approach is simple: to quantify the bias of the estimates of one demographic variable within each level of the second demographic variable or, more simply, to examine the conditional distribution of one demographic variable given a category of another. For example, if we look at the cross-tabulation of age group by race/ethnicity (e.g., non-Hispanic White, non-Hispanic Black, Hispanic, non-Hispanic Other), then our key estimates are not the marginal distributions of age or race (which are fixed if calibration/raking is applied to the data using these demographics), but rather the distribution of age group within each of the four race groups and similarly, the distribution of race within each of the four age groups. This is akin to column versus row percentages in a cross-tabulated table of age group by race. In this way, we might expect that the bias from raking to known demographic main effect benchmarks would be minimized to the greatest extent possible by virtue of the strong association between our outcomes (conditional distributions of demographic variables) and these demographic variables.

For our analyses, we consider four specific demographic variables: age group (18–34, 35–49, 50–64, 65+); race/ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic, non-Hispanic Other); education (Less than High School, High School, Some College, College or Beyond); and region (Northeast, South, Midwest, West). For each possible pair (A, B) of demographic variables, we will estimate conditional distributions using the cross-tabulation table of demographic variable A (rows) and demographic variable B (columns). The column and row percentages computed from this table will be referred to collectively as “A within B” and “B within A,” respectively. Using the four demographic variables, there are 12 possible sets of estimates formed by the six cross-tabulations of the demographic variables that in turn result in a total of 216 target percentages that are evaluated from each sample source.7 We evaluate the accuracy of these estimates using three different statistics, as described by Callegaro et al. (2014), including (a) the mean estimated absolute bias, (b) the standard deviation of the estimated absolute biases, and (c) the maximum estimated absolute error. Every metric is expressed in percentage points (pps).

The estimated absolute bias for each of the row and column percentages from the cross-tabulation of demographic variable A with demographic variable B are computed simply as the absolute value of the difference between the specific row/column percentage computed from the sample data and the corresponding percentage computed from the ACS data. Finally, the overall mean estimated absolute bias for a given sample represents the arithmetic mean of the average estimated absolute biases from the 12 possible sets of row and column percentages over all cross-tabulation tables.

The standard deviation of the estimated absolute bias measure for a given collection of column percentages from “A within B” represents the standard deviation of the estimated absolute bias measures in these statistics contained within this cross-tabulation. Small values of the standard deviation metric for the collection “A within B” imply that the biases in the column percentages are generally consistent across the table, while large values of this metric imply that the biases in the column percentages are inconsistent in magnitude across the table.

Finally, the maximum estimated absolute bias (also referred to as the largest absolute error [Callegaro et al. 2014]) is defined for each set of column percentages from “A within B” as the largest estimated absolute bias among this set of estimates. This metric is similarly defined for row percentages from “B within A.”

The 216 target percentages were computed and evaluated from each of the six sample sources using both unweighted as well as weighted data. Overall, there are 18 sample/weighting combinations. First, there are the six unweighted samples: Panel 1, Panel 2, Telephone 1, Telephone 1 limited to only cell phones, Telephone 2, and the NHIS. Next, there are two matched samples: Panel 1 matched, and Panel 2 matched. The first panel was matched to a simple random sample of 5,010 anonymized cases (e.g., approximately 3.5 percent sample) selected from the 2013 CPS, while the second panel was matched to a second, independently drawn simple random sample of 5,013 anonymized cases (3.5 percent sample) selected from the 2013 CPS.8 To obtain the one-to-one matches, a simple matching coefficient index was used to identify the nonprobability sample case with the best match to each probability case based on eight categorical variables, including region, sex, age group, race, education, homeownership, marital status, and current employment status.9

Finally, there is the propensity weighted sample. Specifically, in the case of Panel 2, additional “webographic” variables (see appendix) were available, allowing us to compute propensity models for being in the probability panel based on main effects of both demographic and “webographic” variables. Such variables are typically used to discriminate nonprobability and probability respondents (Schonlau, van Soest, and Kapteyn 2007). Propensity weights were computed as the inverse of the predicted probabilities derived from the propensity models and mark the ninth possible combination of weighting and sample: Panel 2 propensity weighted. Since there were no “webographic” questions available, Panel 1 was not propensity weighted.10

Each of the nine samples was then weighted by calibrating the base weights to identical targets, including education (less than high school, high school diploma, some college, or bachelor’s degree, and graduate degree), gender, age (18–29, 30–49, 50–64, 65+),11 and region, giving us a total of 18 sample types (nine unweighted versions and nine weighted/raked versions). For the nonprobability samples, the base weights were either 1 or the applicable propensity weight. The base sampling weights for all telephone datasets were computed using a single-frame estimator approach (Bankier 1986; Kalton and Anderson 1986; Buskirk and Best 2012) to account for multiple sampling frames and disproportionate probabilities of selection as functions of the number of persons and phones associated with a household.

In addition to the four metrics evaluating the estimated biases in the estimates, we also measure the impact of various weighting/adjustment methods for each of the samples using a more traditional approach, that is, with measures of unequal weighting effects (UWEs). While generally defined for probability designs, we apply a simple computation of the UWE (as 1 plus the squared coefficient of variation in the final weights) across all sample types as a means to consistently measure the impact of given adjustments on the “effective” sample sizes and speak to the question of cost.12

Results

For the first part of our analyses, we will focus on the three metrics across the 12 sets of row/column percentage estimates computed from the demographic cross-tabulations. In the second part of our analyses, we focus on comparing samples based on aggregated statistics. We start with an overview of the estimated mean absolute bias depicted in figure 1 for each of these 12 sets by all 18 sampling/weighting combinations. A number of patterns emerge in this graph. First, there are four sets of row/column percentages with estimated mean absolute biases in excess of 10 percent, all from the unweighted nonprobability panels. In general, there appears a clear bifurcation in the average estimated absolute biases in the unweighted samples. The nonprobability samples attain relatively high estimated absolute biases, on average, while the unweighted telephone samples and the NHIS attain low estimated bias. Among weighted or matched samples, the pattern appears to be similar, though of course with much lower overall levels of estimated absolute bias, on average, than is seen in the unweighted samples.

Mean Estimated Absolute Biases for each of the 12 Sets of Row/Column Percentages by Sample/Weighting Type.
Figure 1.

Mean Estimated Absolute Biases for each of the 12 Sets of Row/Column Percentages by Sample/Weighting Type.

The estimated mean absolute biases displayed in figure 1 are also provided along with the other two summary metrics, computed for each of the 12 sets of row/column percentage estimates, for the unweighted/matched and propensity weighted nonprobability samples in table 2; the raked/weighted nonprobability samples in table 3; the unweighted probability samples in table 4; and the weighted/raked probability samples in table 5.

Table 2.

Statistical Metrics for Each of the 12 Sets of Row/Column Percentage Estimates for the Unweighted/Matched/Propensity Weighted Nonprobability Samples (EAB = estimated absolute bias)

Panel 1 unweightedPanel 2 unweightedPanel 1 matchedPanel 2 matchedPanel 2 propensity weighted
Demographic cross-tabulationMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EAB
Race within education5.077.6728.7016.2910.7832.653.454.0817.323.793.937.783.802.816.55
Age within education5.805.9517.729.4712.2238.866.715.7317.095.326.7324.937.546.3322.33
Region within education1.881.707.743.252.007.592.321.705.983.102.677.781.901.837.01
Race within age6.364.0714.401.982.017.492.321.374.851.951.024.298.145.6319.83
Education within age9.244.8815.6515.9211.0832.373.863.237.753.292.986.464.273.8812.03
Region within age1.580.933.451.111.132.952.071.393.361.891.194.057.143.9614.74
Age within race7.586.0224.144.642.359.096.013.8412.911.580.733.272.822.126.93
Region within race2.401.795.654.202.646.602.821.955.831.251.104.091.631.193.85
Education within race11.806.9218.825.007.8223.105.024.477.513.494.8717.964.066.2317.91
Race within region4.863.9512.542.701.593.602.442.249.842.641.994.221.340.972.72
Age within region4.954.4310.053.131.644.225.994.1313.153.022.214.191.681.023.11
Education within region9.124.5515.1715.0010.7929.673.832.376.992.082.094.752.031.444.57
Panel 1 unweightedPanel 2 unweightedPanel 1 matchedPanel 2 matchedPanel 2 propensity weighted
Demographic cross-tabulationMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EAB
Race within education5.077.6728.7016.2910.7832.653.454.0817.323.793.937.783.802.816.55
Age within education5.805.9517.729.4712.2238.866.715.7317.095.326.7324.937.546.3322.33
Region within education1.881.707.743.252.007.592.321.705.983.102.677.781.901.837.01
Race within age6.364.0714.401.982.017.492.321.374.851.951.024.298.145.6319.83
Education within age9.244.8815.6515.9211.0832.373.863.237.753.292.986.464.273.8812.03
Region within age1.580.933.451.111.132.952.071.393.361.891.194.057.143.9614.74
Age within race7.586.0224.144.642.359.096.013.8412.911.580.733.272.822.126.93
Region within race2.401.795.654.202.646.602.821.955.831.251.104.091.631.193.85
Education within race11.806.9218.825.007.8223.105.024.477.513.494.8717.964.066.2317.91
Race within region4.863.9512.542.701.593.602.442.249.842.641.994.221.340.972.72
Age within region4.954.4310.053.131.644.225.994.1313.153.022.214.191.681.023.11
Education within region9.124.5515.1715.0010.7929.673.832.376.992.082.094.752.031.444.57
Table 2.

Statistical Metrics for Each of the 12 Sets of Row/Column Percentage Estimates for the Unweighted/Matched/Propensity Weighted Nonprobability Samples (EAB = estimated absolute bias)

Panel 1 unweightedPanel 2 unweightedPanel 1 matchedPanel 2 matchedPanel 2 propensity weighted
Demographic cross-tabulationMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EAB
Race within education5.077.6728.7016.2910.7832.653.454.0817.323.793.937.783.802.816.55
Age within education5.805.9517.729.4712.2238.866.715.7317.095.326.7324.937.546.3322.33
Region within education1.881.707.743.252.007.592.321.705.983.102.677.781.901.837.01
Race within age6.364.0714.401.982.017.492.321.374.851.951.024.298.145.6319.83
Education within age9.244.8815.6515.9211.0832.373.863.237.753.292.986.464.273.8812.03
Region within age1.580.933.451.111.132.952.071.393.361.891.194.057.143.9614.74
Age within race7.586.0224.144.642.359.096.013.8412.911.580.733.272.822.126.93
Region within race2.401.795.654.202.646.602.821.955.831.251.104.091.631.193.85
Education within race11.806.9218.825.007.8223.105.024.477.513.494.8717.964.066.2317.91
Race within region4.863.9512.542.701.593.602.442.249.842.641.994.221.340.972.72
Age within region4.954.4310.053.131.644.225.994.1313.153.022.214.191.681.023.11
Education within region9.124.5515.1715.0010.7929.673.832.376.992.082.094.752.031.444.57
Panel 1 unweightedPanel 2 unweightedPanel 1 matchedPanel 2 matchedPanel 2 propensity weighted
Demographic cross-tabulationMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EABMean EABStd. dev EABsMax EAB
Race within education5.077.6728.7016.2910.7832.653.454.0817.323.793.937.783.802.816.55
Age within education5.805.9517.729.4712.2238.866.715.7317.095.326.7324.937.546.3322.33
Region within education1.881.707.743.252.007.592.321.705.983.102.677.781.901.837.01
Race within age6.364.0714.401.982.017.492.321.374.851.951.024.298.145.6319.83
Education within age9.244.8815.6515.9211.0832.373.863.237.753.292.986.464.273.8812.03
Region within age1.580.933.451.111.132.952.071.393.361.891.194.057.143.9614.74
Age within race7.586.0224.144.642.359.096.013.8412.911.580.733.272.822.126.93
Region within race2.401.795.654.202.646.602.821.955.831.251.104.091.631.193.85
Education within race11.806.9218.825.007.8223.105.024.477.513.494.8717.964.066.2317.91
Race within region4.863.9512.542.701.593.602.442.249.842.641.994.221.340.972.72
Age within region4.954.4310.053.131.644.225.994.1313.153.022.214.191.681.023.11
Education within region9.124.5515.1715.0010.7929.673.832.376.992.082.094.752.031.444.57
Table 3.

Statistical Metrics for Each of the 12 Sets of Row/Column Percentage Estimates for the Unweighted Probability Samples (EAB = estimated absolute bias)

Telephone 1 (cell) unweightedTelephone 1 unweightedTelephone 2 unweightedNHIS unweighted
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education2.622.264.272.532.197.395.143.2611.963.342.215.50
Age within education5.084.2214.867.033.7314.914.312.707.602.471.926.11
Region within education1.821.293.511.631.453.132.261.502.811.550.983.58
Race within age1.691.533.712.101.424.773.842.397.823.902.705.44
Education within age3.312.226.413.792.406.775.784.0211.811.421.012.83
Region within age1.991.083.551.400.912.633.261.655.651.601.123.87
Age within race5.925.2422.365.503.5813.552.602.397.063.081.936.74
Region within race3.554.419.963.034.5610.882.222.549.381.601.383.53
Education within race3.932.517.824.112.366.502.552.699.161.401.323.66
Race within region3.152.835.222.482.8910.812.843.366.532.922.546.79
Age within region5.363.7914.755.133.2512.632.161.442.792.441.756.22
Education within region2.631.504.683.411.374.404.963.6910.591.090.812.91
Telephone 1 (cell) unweightedTelephone 1 unweightedTelephone 2 unweightedNHIS unweighted
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education2.622.264.272.532.197.395.143.2611.963.342.215.50
Age within education5.084.2214.867.033.7314.914.312.707.602.471.926.11
Region within education1.821.293.511.631.453.132.261.502.811.550.983.58
Race within age1.691.533.712.101.424.773.842.397.823.902.705.44
Education within age3.312.226.413.792.406.775.784.0211.811.421.012.83
Region within age1.991.083.551.400.912.633.261.655.651.601.123.87
Age within race5.925.2422.365.503.5813.552.602.397.063.081.936.74
Region within race3.554.419.963.034.5610.882.222.549.381.601.383.53
Education within race3.932.517.824.112.366.502.552.699.161.401.323.66
Race within region3.152.835.222.482.8910.812.843.366.532.922.546.79
Age within region5.363.7914.755.133.2512.632.161.442.792.441.756.22
Education within region2.631.504.683.411.374.404.963.6910.591.090.812.91
Table 3.

Statistical Metrics for Each of the 12 Sets of Row/Column Percentage Estimates for the Unweighted Probability Samples (EAB = estimated absolute bias)

Telephone 1 (cell) unweightedTelephone 1 unweightedTelephone 2 unweightedNHIS unweighted
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education2.622.264.272.532.197.395.143.2611.963.342.215.50
Age within education5.084.2214.867.033.7314.914.312.707.602.471.926.11
Region within education1.821.293.511.631.453.132.261.502.811.550.983.58
Race within age1.691.533.712.101.424.773.842.397.823.902.705.44
Education within age3.312.226.413.792.406.775.784.0211.811.421.012.83
Region within age1.991.083.551.400.912.633.261.655.651.601.123.87
Age within race5.925.2422.365.503.5813.552.602.397.063.081.936.74
Region within race3.554.419.963.034.5610.882.222.549.381.601.383.53
Education within race3.932.517.824.112.366.502.552.699.161.401.323.66
Race within region3.152.835.222.482.8910.812.843.366.532.922.546.79
Age within region5.363.7914.755.133.2512.632.161.442.792.441.756.22
Education within region2.631.504.683.411.374.404.963.6910.591.090.812.91
Telephone 1 (cell) unweightedTelephone 1 unweightedTelephone 2 unweightedNHIS unweighted
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education2.622.264.272.532.197.395.143.2611.963.342.215.50
Age within education5.084.2214.867.033.7314.914.312.707.602.471.926.11
Region within education1.821.293.511.631.453.132.261.502.811.550.983.58
Race within age1.691.533.712.101.424.773.842.397.823.902.705.44
Education within age3.312.226.413.792.406.775.784.0211.811.421.012.83
Region within age1.991.083.551.400.912.633.261.655.651.601.123.87
Age within race5.925.2422.365.503.5813.552.602.397.063.081.936.74
Region within race3.554.419.963.034.5610.882.222.549.381.601.383.53
Education within race3.932.517.824.112.366.502.552.699.161.401.323.66
Race within region3.152.835.222.482.8910.812.843.366.532.922.546.79
Age within region5.363.7914.755.133.2512.632.161.442.792.441.756.22
Education within region2.631.504.683.411.374.404.963.6910.591.090.812.91
Table 4.

Statistical Metrics for Each of the 12 Sets of Row/Column Percentage Estimates for the Weighted/Raked Nonprobability Samples (EAB = estimated absolute bias)

Panel 1 rakedPanel 2 rakedPanel 1 matched & rakedPanel 2 matched & rakedPanel 2 propensity weighted & raked
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education4.005.2918.962.893.538.713.353.6915.333.612.167.993.383.719.69
Age within education3.033.379.308.057.5325.613.985.4320.624.836.2122.557.586.5322.03
Region within education0.890.823.701.151.304.251.891.415.061.201.836.220.830.952.81
Race within age2.382.116.742.141.796.771.110.682.151.492.107.091.851.604.65
Education within age2.422.507.436.993.2510.822.712.439.504.553.6410.816.933.1611.53
Region within age0.510.391.321.181.253.250.900.701.640.750.621.911.221.323.83
Age within race3.673.7414.882.161.976.352.322.6010.193.042.478.481.821.745.74
Region within race1.781.664.390.660.381.392.822.277.803.402.609.110.830.711.50
Education within race3.593.547.274.305.9816.802.683.0212.034.836.9024.764.816.1516.97
Race within region0.920.761.971.111.212.691.901.877.841.181.284.810.901.012.34
Age within region0.680.562.151.261.093.560.910.691.360.780.782.461.291.093.03
Education within region1.170.722.361.321.223.191.380.882.882.931.635.480.910.861.71
Panel 1 rakedPanel 2 rakedPanel 1 matched & rakedPanel 2 matched & rakedPanel 2 propensity weighted & raked
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education4.005.2918.962.893.538.713.353.6915.333.612.167.993.383.719.69
Age within education3.033.379.308.057.5325.613.985.4320.624.836.2122.557.586.5322.03
Region within education0.890.823.701.151.304.251.891.415.061.201.836.220.830.952.81
Race within age2.382.116.742.141.796.771.110.682.151.492.107.091.851.604.65
Education within age2.422.507.436.993.2510.822.712.439.504.553.6410.816.933.1611.53
Region within age0.510.391.321.181.253.250.900.701.640.750.621.911.221.323.83
Age within race3.673.7414.882.161.976.352.322.6010.193.042.478.481.821.745.74
Region within race1.781.664.390.660.381.392.822.277.803.402.609.110.830.711.50
Education within race3.593.547.274.305.9816.802.683.0212.034.836.9024.764.816.1516.97
Race within region0.920.761.971.111.212.691.901.877.841.181.284.810.901.012.34
Age within region0.680.562.151.261.093.560.910.691.360.780.782.461.291.093.03
Education within region1.170.722.361.321.223.191.380.882.882.931.635.480.910.861.71
Table 4.

Statistical Metrics for Each of the 12 Sets of Row/Column Percentage Estimates for the Weighted/Raked Nonprobability Samples (EAB = estimated absolute bias)

Panel 1 rakedPanel 2 rakedPanel 1 matched & rakedPanel 2 matched & rakedPanel 2 propensity weighted & raked
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education4.005.2918.962.893.538.713.353.6915.333.612.167.993.383.719.69
Age within education3.033.379.308.057.5325.613.985.4320.624.836.2122.557.586.5322.03
Region within education0.890.823.701.151.304.251.891.415.061.201.836.220.830.952.81
Race within age2.382.116.742.141.796.771.110.682.151.492.107.091.851.604.65
Education within age2.422.507.436.993.2510.822.712.439.504.553.6410.816.933.1611.53
Region within age0.510.391.321.181.253.250.900.701.640.750.621.911.221.323.83
Age within race3.673.7414.882.161.976.352.322.6010.193.042.478.481.821.745.74
Region within race1.781.664.390.660.381.392.822.277.803.402.609.110.830.711.50
Education within race3.593.547.274.305.9816.802.683.0212.034.836.9024.764.816.1516.97
Race within region0.920.761.971.111.212.691.901.877.841.181.284.810.901.012.34
Age within region0.680.562.151.261.093.560.910.691.360.780.782.461.291.093.03
Education within region1.170.722.361.321.223.191.380.882.882.931.635.480.910.861.71
Panel 1 rakedPanel 2 rakedPanel 1 matched & rakedPanel 2 matched & rakedPanel 2 propensity weighted & raked
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education4.005.2918.962.893.538.713.353.6915.333.612.167.993.383.719.69
Age within education3.033.379.308.057.5325.613.985.4320.624.836.2122.557.586.5322.03
Region within education0.890.823.701.151.304.251.891.415.061.201.836.220.830.952.81
Race within age2.382.116.742.141.796.771.110.682.151.492.107.091.851.604.65
Education within age2.422.507.436.993.2510.822.712.439.504.553.6410.816.933.1611.53
Region within age0.510.391.321.181.253.250.900.701.640.750.621.911.221.323.83
Age within race3.673.7414.882.161.976.352.322.6010.193.042.478.481.821.745.74
Region within race1.781.664.390.660.381.392.822.277.803.402.609.110.830.711.50
Education within race3.593.547.274.305.9816.802.683.0212.034.836.9024.764.816.1516.97
Race within region0.920.761.971.111.212.691.901.877.841.181.284.810.901.012.34
Age within region0.680.562.151.261.093.560.910.691.360.780.782.461.291.093.03
Education within region1.170.722.361.321.223.191.380.882.882.931.635.480.910.861.71
Table 5.

Statistical Metrics for Each of the 12 Sets of Row/Column Percentage Estimates for the Weighted/Raked Probability Samples (EAB = estimated absolute bias)

Telephone 1 (cell) rakedTelephone 1 rakedTelephone 2 rakedNHIS weighted
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education1.441.013.071.341.063.401.901.966.370.890.672.23
Age within education2.101.794.662.342.064.722.482.517.230.860.642.17
Region within education0.990.951.810.760.701.560.750.871.740.790.471.81
Race within age0.750.662.521.441.293.732.373.0511.080.350.220.65
Education within age1.771.795.092.021.865.613.122.526.121.100.942.87
Region within age1.130.962.600.590.431.120.720.652.140.930.683.10
Age within race2.123.1212.062.553.3313.433.302.658.250.620.550.98
Region within race3.293.798.712.793.879.833.053.2212.000.730.652.11
Education within race1.381.906.191.281.925.821.651.485.560.981.313.32
Race within region2.112.246.471.571.485.001.772.695.790.440.380.56
Age within region1.171.134.230.620.531.820.920.702.020.920.853.06
Education within region0.760.631.600.690.521.682.671.384.270.910.812.04
Telephone 1 (cell) rakedTelephone 1 rakedTelephone 2 rakedNHIS weighted
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education1.441.013.071.341.063.401.901.966.370.890.672.23
Age within education2.101.794.662.342.064.722.482.517.230.860.642.17
Region within education0.990.951.810.760.701.560.750.871.740.790.471.81
Race within age0.750.662.521.441.293.732.373.0511.080.350.220.65
Education within age1.771.795.092.021.865.613.122.526.121.100.942.87
Region within age1.130.962.600.590.431.120.720.652.140.930.683.10
Age within race2.123.1212.062.553.3313.433.302.658.250.620.550.98
Region within race3.293.798.712.793.879.833.053.2212.000.730.652.11
Education within race1.381.906.191.281.925.821.651.485.560.981.313.32
Race within region2.112.246.471.571.485.001.772.695.790.440.380.56
Age within region1.171.134.230.620.531.820.920.702.020.920.853.06
Education within region0.760.631.600.690.521.682.671.384.270.910.812.04
Table 5.

Statistical Metrics for Each of the 12 Sets of Row/Column Percentage Estimates for the Weighted/Raked Probability Samples (EAB = estimated absolute bias)

Telephone 1 (cell) rakedTelephone 1 rakedTelephone 2 rakedNHIS weighted
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education1.441.013.071.341.063.401.901.966.370.890.672.23
Age within education2.101.794.662.342.064.722.482.517.230.860.642.17
Region within education0.990.951.810.760.701.560.750.871.740.790.471.81
Race within age0.750.662.521.441.293.732.373.0511.080.350.220.65
Education within age1.771.795.092.021.865.613.122.526.121.100.942.87
Region within age1.130.962.600.590.431.120.720.652.140.930.683.10
Age within race2.123.1212.062.553.3313.433.302.658.250.620.550.98
Region within race3.293.798.712.793.879.833.053.2212.000.730.652.11
Education within race1.381.906.191.281.925.821.651.485.560.981.313.32
Race within region2.112.246.471.571.485.001.772.695.790.440.380.56
Age within region1.171.134.230.620.531.820.920.702.020.920.853.06
Education within region0.760.631.600.690.521.682.671.384.270.910.812.04
Telephone 1 (cell) rakedTelephone 1 rakedTelephone 2 rakedNHIS weighted
Demographic cross-tabulationMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EABMean EABStd. dev of EABsMax EAB
Race within education1.441.013.071.341.063.401.901.966.370.890.672.23
Age within education2.101.794.662.342.064.722.482.517.230.860.642.17
Region within education0.990.951.810.760.701.560.750.871.740.790.471.81
Race within age0.750.662.521.441.293.732.373.0511.080.350.220.65
Education within age1.771.795.092.021.865.613.122.526.121.100.942.87
Region within age1.130.962.600.590.431.120.720.652.140.930.683.10
Age within race2.123.1212.062.553.3313.433.302.658.250.620.550.98
Region within race3.293.798.712.793.879.833.053.2212.000.730.652.11
Education within race1.381.906.191.281.925.821.651.485.560.981.313.32
Race within region2.112.246.471.571.485.001.772.695.790.440.380.56
Age within region1.171.134.230.620.531.820.920.702.020.920.853.06
Education within region0.760.631.600.690.521.682.671.384.270.910.812.04

If we focus on the nonprobability samples, we see, in table 2, that the matched versions of each of the nonprobability samples offered improvements over the unweighted versions for a majority of the 12 sets of row/column percentage estimates across the three metrics. The degree of improvement for the mean absolute bias metric varied a bit more across the two nonprobability samples (e.g., seven of 12 for Panel 1 and 11 of 12 for Panel 2). But, as we observe comparing table 2 with table 3, matching alone was generally not better than raking alone. In fact, for only two of the 12 sets was the estimated mean absolute bias improved for the Panel 1 matched sample compared to the Panel 1 raked sample (and for five of 12 for the comparable Panel 2 samples).

While generally the matched and raked nonprobability samples fared better than matching alone for at least eight of 12 sets across the three metrics for the Panel 1 sample, the results were not as conclusive for Panel 2 (seven of 12 sets had improvements in estimated mean absolute bias and standard deviation, but only four of 12 for maximum estimated absolute bias). The propensity weighted Panel 2 sample did generally fare better than the unweighted sample on most of the row/column percentage estimate sets. However, more generally, the estimates from propensity weighting and raking were generally inferior in terms of the three metrics when compared to the other nonprobability samples/methods.

In general, a closer look across table 3 reveals that there is no single method (e.g., raking, matching + raking, propensity weighting + raking) that performs better consistently across the 12 row/column percentage sets on any of the three metrics for either of the two nonprobability samples. While some of the “newer” methods (e.g., propensity weighting and matching) offered some improvements for nonprobability samples in terms of absolute biases, these “improved” estimates generally had larger and more variable errors compared to those from the probability samples. In particular, a comparison of table 2 and table 4 reveals that the nonprobability samples had the largest maximum estimated absolute bias for 10 out of the 12 sets of row/column pps estimates and higher standard deviations of the estimated absolute biases for 11 out of 12 of these sets. A comparison between tables 3 and 5 shows that even after applying weighting/raking to the nonprobability samples (and their adjusted versions), these weighted samples still have the largest maximum estimated absolute bias statistics for 11 of the 12 sets of column/row percentage estimates compared to the weighted/raked probability samples. While weighting helped reduce the variability in the estimated absolute biases measured in each of the 12 sets, the weighted/raked nonprobability samples collectively had higher standard deviations for eight of the 12 sets of row/column pps estimates.

To further simplify and clarify our analysis, we now focus on measures of overall estimated absolute bias and average standard deviation for each of the 18 sample/weighting combinations obtained by aggregating these metrics across the 12 row/column percentage tables computed for each sample. The overall mean estimated absolute bias for the unweighted and matched/weighted samples are given in tables 6 and 7, respectively. In table 6, we see that the probability samples overall hold substantially less inherent (that is, unweighted) estimated absolute bias compared to the nonprobability samples. The mean overall average estimated absolute bias for the two web panels is 6.4 pps, compared to 3.5 pps for the dual frame samples, 3.4 pps for the cell-only sample, and 2.2 pps for the NHIS. Put another way, the nonprobability samples hold about 1.8 times more estimated absolute bias, on average, than the dual-frame samples. As measured by percent difference to the ACS, nonprobability samples were about 30 percent different in their estimates, compared to 16 percent different in the dual frame samples.

Table 6.

Statistical Metrics Averaged over the 12 Sets of Row/Column Percentages for Each of the Unweighted Samples (EAB = estimated absolute bias)

Unweighted sampleOverall averageEAB for row/ column percentagesOverall median EAB for row/ column percentagesOverall average standard deviation of EABs
Panel 1 unweighted5.895.444.40
Panel 2 unweighted6.894.425.50
Telephone 1 (cell) unweighted3.423.232.74
Telephone 1 unweighted3.513.222.51
Telephone unweighted3.493.052.64
NHIS unweighted2.232.021.64
Unweighted sampleOverall averageEAB for row/ column percentagesOverall median EAB for row/ column percentagesOverall average standard deviation of EABs
Panel 1 unweighted5.895.444.40
Panel 2 unweighted6.894.425.50
Telephone 1 (cell) unweighted3.423.232.74
Telephone 1 unweighted3.513.222.51
Telephone unweighted3.493.052.64
NHIS unweighted2.232.021.64
Table 6.

Statistical Metrics Averaged over the 12 Sets of Row/Column Percentages for Each of the Unweighted Samples (EAB = estimated absolute bias)

Unweighted sampleOverall averageEAB for row/ column percentagesOverall median EAB for row/ column percentagesOverall average standard deviation of EABs
Panel 1 unweighted5.895.444.40
Panel 2 unweighted6.894.425.50
Telephone 1 (cell) unweighted3.423.232.74
Telephone 1 unweighted3.513.222.51
Telephone unweighted3.493.052.64
NHIS unweighted2.232.021.64
Unweighted sampleOverall averageEAB for row/ column percentagesOverall median EAB for row/ column percentagesOverall average standard deviation of EABs
Panel 1 unweighted5.895.444.40
Panel 2 unweighted6.894.425.50
Telephone 1 (cell) unweighted3.423.232.74
Telephone 1 unweighted3.513.222.51
Telephone unweighted3.493.052.64
NHIS unweighted2.232.021.64
Table 7.

Statistical Metrics Averaged over the 12 Sets of Row/Column Percentages for each of the Matched/Weighted Samples

Matched/weighted sampleOverall average EAB for row/ column percentagesOverall median EAB for row/column percentagesOverall average standard deviation of EABs
Panel 1 matched3.903.643.04
Panel 2 matched2.782.832.63
Panel 2 propensity weighted3.863.313.12
Panel 1 raked2.092.082.12
Panel 2 raked2.771.732.54
Panel 1 matched & raked2.162.112.14
Panel 2 matched & raked2.712.992.68
Panel 2 propensity weighted & raked2.701.552.40
Telephone 1 (cell) raked1.581.411.66
Telephone 1 raked1.501.391.59
Telephone 2 raked2.062.141.97
NHIS weighted0.790.870.68
Matched/weighted sampleOverall average EAB for row/ column percentagesOverall median EAB for row/column percentagesOverall average standard deviation of EABs
Panel 1 matched3.903.643.04
Panel 2 matched2.782.832.63
Panel 2 propensity weighted3.863.313.12
Panel 1 raked2.092.082.12
Panel 2 raked2.771.732.54
Panel 1 matched & raked2.162.112.14
Panel 2 matched & raked2.712.992.68
Panel 2 propensity weighted & raked2.701.552.40
Telephone 1 (cell) raked1.581.411.66
Telephone 1 raked1.501.391.59
Telephone 2 raked2.062.141.97
NHIS weighted0.790.870.68
Table 7.

Statistical Metrics Averaged over the 12 Sets of Row/Column Percentages for each of the Matched/Weighted Samples

Matched/weighted sampleOverall average EAB for row/ column percentagesOverall median EAB for row/column percentagesOverall average standard deviation of EABs
Panel 1 matched3.903.643.04
Panel 2 matched2.782.832.63
Panel 2 propensity weighted3.863.313.12
Panel 1 raked2.092.082.12
Panel 2 raked2.771.732.54
Panel 1 matched & raked2.162.112.14
Panel 2 matched & raked2.712.992.68
Panel 2 propensity weighted & raked2.701.552.40
Telephone 1 (cell) raked1.581.411.66
Telephone 1 raked1.501.391.59
Telephone 2 raked2.062.141.97
NHIS weighted0.790.870.68
Matched/weighted sampleOverall average EAB for row/ column percentagesOverall median EAB for row/column percentagesOverall average standard deviation of EABs
Panel 1 matched3.903.643.04
Panel 2 matched2.782.832.63
Panel 2 propensity weighted3.863.313.12
Panel 1 raked2.092.082.12
Panel 2 raked2.771.732.54
Panel 1 matched & raked2.162.112.14
Panel 2 matched & raked2.712.992.68
Panel 2 propensity weighted & raked2.701.552.40
Telephone 1 (cell) raked1.581.411.66
Telephone 1 raked1.501.391.59
Telephone 2 raked2.062.141.97
NHIS weighted0.790.870.68

Table 7 summarizes our weighted, modeled, and matched samples. Notably, Panel 1’s (unweighted) overall average estimated absolute bias is reduced from 5.9 to 2.1 percentage points after it is raked. Perhaps unsurprisingly, the NHIS had the smallest overall average estimated absolute bias of any of the samples and there was also no practical difference in the overall average estimated absolute bias between the cell-only sample and its corresponding dual-frame sample.

Generally speaking, all samples have significantly reduced overall average estimated absolute bias when weighted and the reduction in estimated bias was roughly similar across samples, with some notable exceptions. The overall estimated absolute bias for the two nonprobability samples was, on average, reduced by a factor of 2.5, while the telephone samples saw an average reduction by a factor of 1.9 (and the NHIS by a factor of 2.8). The overall average estimated absolute bias of the two raked nonprobability samples was, on average, greater than that of the probability samples (NHIS excluded) (2.4 versus 1.7 percentage points).13

Of course, our expectation for adjustment techniques such as propensity weighting and sample matching is that they would further reduce bias over and above what raking can achieve. If true, this could place the bias in the nonprobability samples on a par with telephone samples. In fact, however, these procedures, while apparently reducing estimated bias compared to the unweighted samples, did not reduce bias as efficaciously as with simple raking. Simple raking reduced the overall average estimated absolute bias of both nonprobability samples to 2.4 pps, while the propensity weight only reduced the overall mean estimated absolute bias to 3.9 pps; propensity weighting and raking to 2.7 pps; matching in both samples to 3.4 pps; and matching and raking in both samples to 2.5 pps. Notably, Panel 1 had the same level of estimated mean absolute bias once matched and raked compared to just the simple rake, although the size of the matched sample was considerably less. As well, the Panel 2 propensity weighted sample attained the same level of estimated bias as the Panel 1 sample matched, 3.9 pps.

The overall average of the standard deviations of the estimated absolute biases is also reported in tables 6 and 7 for the unweighted, matched, and weighted samples, respectively. For the unweighted samples (table 6), we find a pattern similar to that of the estimated biases in that the nonprobability samples show substantially greater variability in the estimated absolute biases, on average, compared to the probability samples. In fact, taken together, the unweighted nonprobability panels attain an average standard deviation of 6.4 pps, compared to 3.5 pps for the probability samples (NHIS excluded). A closer look at table 7 reveals that this gap in variability shrinks with various weighting/matching methods. While it is still the case that the probability samples have less variability in estimated absolute biases, on average, compared to nonprobability samples, the differences between the two types of samples are less pronounced. Taken together, the various weighted and matched nonprobability samples attain an average standard deviation of 2.9 pps, compared to 1.7 pps for the probability samples (NHIS excluded). The largest average standard deviation in estimated absolute biases across the samples came from Panel 2 (unweighted). Finally, we observe that the variability in estimated absolute bias measures for the cell phone sample was on par with the full telephone samples and the NHIS had by far the lowest variability of all weighted samples (0.8 pps).

As for the potential impact to individual estimates resulting from UWEs, the pattern is much the same as we have seen thus far. Specifically, while the UWEs for the telephone samples are relatively low (1.3), the raking had to, in effect, “work harder” for the nonprobability samples, with an UWE of 2.8 (see figure 2). We note that the UWEs for dual-frame samples accounts for multiple inclusion probabilities across two sampling frames as well as impacts from the raking. Worse still was any attempt to apply propensity weighting, which resulted in a UWE of 5.9. One advantage of sample matching, however, is that by “stacking the deck” before inviting people to respond to surveys, sample matched data requires far less adjustment with raking. As such, the average UWE across the two matched and raked samples was rather low, at 1.2.

Unequal Weighting Effects.
Figure 2.

Unequal Weighting Effects.

Discussion and Conclusions

Anecdotally and empirically, researchers have found that estimates derived using samples from nonprobability panels can be less accurate than those derived using what we would in the present day consider high response rate probability samples. In our analysis, the unweighted samples from nonprobability Internet panels showed substantially higher estimated bias than did the low response rate probability samples. Beyond that, what is notable is the combination of 1) the overall size of the unweighted nonprobability estimated biases and the gap between probability and nonprobability samples on the estimated mean absolute bias; 2) the variability noted in the estimated absolute biases of the nonprobability samples; and 3) in comparison to probability samples, the substantial biases computed from the nonprobability samples persisted even after applying raking, propensity weighting, or matching. More specifically, the nonprobability samples in our analysis had mean estimated absolute bias measures that were about two times that of the low response rate probability samples, and, at 6.4 pps (unweighted), this average estimated absolute bias often translates into a substantive difference between the survey estimate and the corresponding benchmark value computed from the ACS.

Not only were the estimated biases larger for the nonprobability samples, they were also more varied. The standard deviations of the estimated biases for statistics computed across the 12 cross-tabulations of the demographic variables were on average about two times larger for the nonprobability samples, taken together, as compared to the two dual-frame probability samples. The magnitude of the estimated biases for a given row or column percentage in the cross-tabulation of the demographic variables for the nonprobability samples appears to fluctuate widely and no systematic pattern seems to emerge, with some variables exhibiting high estimated biases for one nonprobability sample but not for the other. In contrast, the estimated absolute biases computed from the probability samples appear to be consistently low, as evidenced by smaller mean estimated absolute biases and standard deviations. This may suggest that, with regard to nonprobability samples, one is “rolling the dice” in terms of how accurate any one particular estimate may be.

Because the demographic variables in our analysis were used as main effects in all raking applications, our investigation of biases represents a “best-case scenario” for which raking would expect to work very well in favor of reducing biases. But, even in this best-case scenario, substantial biases persisted for some of the nonprobability samples. In this sense, we believe that our elemental approach illustrates a “lower bound” on the overall estimated biases one might expect to see in practice when working with any samples whose data are used to estimate more than just conditional demographic distributions.

While attempting to focus on the best-case scenario in terms of bias in final estimates, this investigation has also exposed the deficits in each type of sample. The larger errors in the nonprobability samples we investigated were driven in large part by education as well as race, and third by age (too young), given that in terms of “main effects” these distributions are quite biased in the first place. Only age (too old) is similarly biased in the telephone samples. And, one can attain a basically self-weighting age distribution in dual-frame surveys with a larger percent (70–80 percent) cell phone and the remainder landline sample. The fact that the raked nonprobability samples seemingly reduced biases overall is no surprise given these deficits in nonprobability samples. Notably, the overall average estimated absolute bias in the nonprobability samples was reduced more than for telephone samples. However, despite this difference, nonprobability samples do not “catch up” to telephone samples once weighted. In exchange for the reduction in absolute bias, higher UWEs were seen for the raked nonprobability samples, even though the base weights were all one to begin with for all but the propensity weighted sample. In the end, the final average estimated absolute bias among nonprobability samples, even after adjustments, was 2.5 pps, which certainly might be good enough for a great deal of research, again noting both the concept of fit for purpose and the reality that nonprobability research is now valued in the billions of dollars (Lorenzetti 2014). But, in some contexts, even this amount of bias may not fall within acceptable limits. While the telephone samples are still biased at twice the rate of the NHIS, the variability of the estimated absolute mean biases for unweighted probability samples is quite similar between the telephone samples and the NHIS, with a relatively low overall average estimated absolute bias of about 1.5 pps. In short, there appears far less risk that probability samples are associated with the “roll the dice” issue noted earlier with regard to nonprobability samples.

One surprising result in our analysis has to do with whether sample matching and/or propensity weighting reduce bias beyond what raking can do alone. We found that while raking was a quite effective method of reducing bias for the conditional distribution estimates, raking combined with one of the two methods (raking and propensity weighting or raking and sample matching) only ended up matching the performance of raking alone in most cases.

One principal difference between all of these weighting and modeling approaches and samples is the UWEs they produce. If we are to judge nonprobability and probability samples on statistical power, which seems as fair a metric as any alternative, then nonprobability samples with a simple rake will require a little over twice (2.1) the sample size to attain an effective sample size of 1,000 compared to the probability dual-frame telephone sample (1,340 versus 2,770). This difference likely still allows nonprobability panels to have a lower cost per effective interview than the dual-frame telephone sample. That said, propensity weighting in our data led to yet another 2.1 times factor in effective sample size compared to telephone data, such that our propensity weighted nonprobability sample would require four times the size as our probability samples in order to attain a comparable range of error. Sample matching of nonprobability samples actually attained a slightly lower UWE than probability telephone samples, but the cost metric here has to take into account the many people recruited to the nonprobability panel but not matched, and thus not utilized for a given study. In our examples, we matched one respondent for every seven panelists in Panel 1, and one respondent for every 12 panelists in Panel 2.

There are a range of limitations to note regarding our analysis. One might be concerned about differences across the samples that might contribute to differences in the estimates but are not due to the nature of the sample itself. For example, there are potential mode effects between samples from Internet panels and telephone samples (Callegaro et al. 2014). However, there is little reason to suspect mode effects for such straightforward questions as age and the other demographics explored here, and the question wording of these questions is nearly identical across modes in most cases. Another argument could be made that the comparisons here are confounded by the fact that the Internet panels are comprised only of panelists who have Internet access. However, the samples from the Internet panels in question here are striving to represent the full population, and the coverage error they experience for missing respondents without Internet access is part of the total survey error consideration, integral as any other error to our analysis. In a similar vein, we compared only the cell phone respondents of Telephone 1, a sample with its own coverage problems, to the other samples in this analysis. Similarly, our use of the Telephone 2 sample and the companion nonprobability tracker, both of which were limited to adults aged 18 to 55, might have “stacked the deck” in favor of the nonprobability samples since they typically underrepresent older age cohorts. However, in the end, it likely “helped” the telephone samples too, preventing them from “getting too old” to the same degree.

While our cross-section of nonprobability samples may not represent all such samples from all such panel sources, the mechanisms at play in terms of self-selection into the panel, at least theoretically, should be similar and related to some of the bias we see here in this study. While self-selection might not explain all of the biases we see in our nonprobability samples, even after adjustments, it is likely a key component of the overall bias observed. As in probability sampling, nonprobability samples can and do vary in size, design, recruitment methods, quality, and overall composition. But unlike probability samples, nonprobability samples often represent a collection of respondents whose path to being in the sample in not entirely trackable (e.g., from where inside the “router” cometh a panelist?). Indeed, demographic factors may be related to self-selection into nonprobability panels in much the same way as they are related to nonresponse in probability-based surveys—but, as our findings suggest, the resulting bias in demographic subdomain distributions—even after controlling for the base demographics—is not completely resolved. Moreover, the “webographic” variables used for Panel 2 do account for some behavioral variables thought to be associated with web/nonprobability panel participation. However, including these in propensity weights—either with or without additional calibration—did not completely eliminate the biases.

Going forward, more systematic research is needed to better understand correlates of self-selection, or lack thereof, in nonprobability samples. Certainly, a massive body of research exists explaining many correlates and even models of response in the context of probability samples across many modes and sampling designs—however, what is unclear is whether these variables or types of explanatory models can adequately account for or describe self-selection mechanisms. While we have attempted to apply techniques known in the survey literature to account and compensate for nonresponse, these assume that the underlying framework is of selection (via probability designs) and not self-selection. Clearly, there are commonalities in these concepts— but what this study begins to elucidate is that accounting for self-selection via methods used for nonresponse is not adequate in all cases (see Mercer et al. in this issue). The framework for nonprobability samples is not as constrained and one might argue is much more variable or wide open than for probability-based designs. We believe that one cannot underestimate the role of the probability framework—even in the context of low response rates—for developing stable estimates, even those that rely on calibration or other adjustment methods, to compensate for the lack of perfect response rates.

There are many who have been quick to say that, given the low response rates currently being attained in telephone survey research, the very concept of “probability sampling” is null and void. Our analysis was, in part, conceived as a direct test between low response rate telephone and nonprobability samples. Clearly, something between the two is different. It is important when charges are levied against research based on probability sample designs with low response rates to understand that the response rate does not determine whether a sample drawn probabilistically results in a representative sample of respondents. What is critical is the degree to which non-response is systematic as opposed to random. In our view, the research industry still does not understand when nonresponse in probability sample surveys is random or systematic, and specifically where it falls on this continuum. Research that has found little to no bias comparing high and low response rate studies suggests that nonresponse is perhaps less systematic than many suspect. This does not gainsay the threat posed by nonresponse. But, in this paper, low response rate probability-based telephone surveys attain two and a half times less bias, and half the sample size required for equivalent statistical power, compared to the nonprobabilistic samples tested. Whether these benefits are worth the cost of the probabilistic approach, we leave to the investigator and her research exigencies.

References

Bankier
Michael D
.
1986
.
“Estimators Based on Several Stratified Samples with Applications to Multiple Frame Surveys.”
Journal of the American Statistical Association
81
:
1074
79
.

Biemer
Paul
.
2010
.
“Total Survey Error: Design, Implementation, and Evaluation.”
Public Opinion Quarterly
74
:
817
48
.

Brick
J. Michael
Cohen
Jon
Cho
Sarah
Keeter
Scott
McGeeney
Kyley
Mathiowetz
Nancy
.
2015
.
“Weighting and Sample Matching Effects for an Online Sample.”
Paper presented at the
AAPOR Conference
,
Hollywood, FL, USA
.

Buskirk
Trent
Best
Jonathan
.
2012
.
“Venn Diagrams, Probability 101, and Sampling Weights Computed from Dual-Frame Telephone RDD Designs.”
JSM Proceedings, Survey Research Methods Section, American Statistical Association
,
3696
710
.

Buskirk
Trent
Dutwin
David
.
2015a
.
“Probability Samples—Meet Your Match! A Comparison of Two Distance Measures for Linking Nonprobability and Probability Based Samples.”
Paper presented at the
International Total Survey Error Conference
,
Baltimore, MD, USA
.

———.

2015b
.
“Selected or Self-Selected? Part 2: Exploring Non-Probability and Probability Samples from Response Propensities to Participant Profiles to Outcome Distributions.”
Paper presented at the
AAPOR Conference
,
Hollywood, FL, USA
.

Callegaro
Mario
Villar
Ana
Yeager
David
Krosnick
Jon
.
2014
.
“A Critical Review of Studies Investigating the Quality of Data Obtained with Online Panels Based on Probability and Nonprobability Samples.”
Online Panel Research: A Data Quality Perspective
23
53
. doi:
10.1002/9781118763520.ch2

Chang
LinChiat
Krosnick
Jon A.
.
2009
.
“National Surveys via RDD Telephone Interviewing Versus the Internet: Comparing Sample Representativeness and Response quality.”
Public Opinion Quarterly
73
:
641
78
.

Craig
Benjamin M.
Hays
Ron D.
Pickard
Simon A.
Cella
David
Revicki
Dennis A.
Reeve
Bryce B.
.
2013
.
“Comparison of US Panel Vendors for Online Surveys.”
Journal of Medical Internet Research
15
:
e260
.

Dever
Jill
Shook-Sa
Bonnie
.
2015
.
“The Utility of Weighting Methods for Reducing Errors in Opt-In Web Studies.”
Paper presented at the
International Total Survey Error Conference
,
Baltimore, MD, USA
.

DiSogra
Charles
Greby
Stacie
Srinath
K. P.
Burkey
Andrew
Black
Carla
Sokolowski
John
Yue
Xan
Ball
Sarah
Donahue
Sara
.
2015
.
“Matching an Internet Panel Sample of Health Care Personnel to a Probability Sample.”
Paper presented at the
AAPOR Conference
;
Hollywood, FL, USA
.

Duffy
Bobby
Smith
Kate
Terhanian
George
Bremer
John
.
2005
.
“Comparing Data from Online and Face-To-Face Surveys.”
International Journal of Market Research
47
:
615
39
.

Dutwin
David
Buskirk
Trent
.
2015
.
“Selected or Self-Selected? Part 1: A Comparison of Methods for Reducing the Impact of Self-Selection Biases from Non-Probability Surveys.”
Paper presented at the
AAPOR Conference
,
Hollywood, FL, USA
.

Gelman
Andrew
.
2012
.
“Statistics in a World Where Nothing Is Random.”
December 12. Available at http://andrewgelman.com/2012/12/17/statistics-in-a-world-where-nothing-is-random/.

Groves
Robert M
.
2006
.
“Nonresponse Rates and Nonresponse Bias in Household Surveys.”
Public Opinion Quarterly
70
:
646
75
.

Groves
Robert M.
Peytcheva
Emilia
.
2008
.
“The Impact of Nonresponse Rates on Nonresponse Bias.”
Public Opinion Quarterly
72
:
167
89
.

Hopper
Joe
.
2014
.
“Why Phone Surveys Are Almost Dead.”
Versta Research (blog)
. Available at http://www.verstaresearch.com/blog/why-phone-surveys-are-almost-dead

Kalton
Graham
Anderson
Dallas W.
.
1986
.
“Sampling Rare Populations.”
Journal of the Royal Statistical Society, Series A
149
:
65
82
.

Keeter
Scott
Kennedy
Courtney
.
2006
.
“Gauging the Impact of Growing Nonresponse on Estimates from a National RDD Telephone Survey.”
Public Opinion Quarterly
70
:
759
79
.

Keeter
Scott
Miller
Carolyn
Kohut
Andrew
Groves
Robert
Presser
Stanley
.
2000
.
“Consequences of Reducing Nonresponse in a Large National Telephone Survey.”
Public Opinion Quarterly
67
:
125
48
.

Kish
Leslie
.
1992
.
“Weighting for Unequal Pi.”
Journal of Official Statistics
8
:
183
200
.

Lorenzetti
Laura
.
2014
.
“SurveyMonkey Is Worth $2 Billion After New $250 Million Fundraising Round.”
December 15. Available at http://fortune.com/2014/12/15/surveymonkey-is-worth-2-billion-after-latest-fundraising-round.

Malhotra
Neil
Krosnick
Jon A.
.
2007
.
“The Effect of Survey Mode and Sampling on Inferences about Political Attitudes and Behavior: Comparing the 2000 and 2004 ANES to Internet Surveys with Nonprobability Samples.”
Political Analysis
15
:
286
323
.

Mercer
Andrew
Kreuter
Frauke
Keeter
Scott
Stuart
Elizabeth
.
2017
.
“Theory and Practice in Nonprobability Surveys: Parallels Between Causal Inference and Survey Inference.”
Public Opinion Quarterly
.

Peters
Kurt R.
Driscoll
Heather
Saavedra
Pedro
.
2015
.
“Evaluating a Propensity Score Adjustment for Combining Probability and Non-Probability Samples in a National Survey.”
Paper presented at the
AAPOR Conference
,
Hollywood, FL, USA
.

Pew Research Center
.
2012
.
“Assessing the Representativeness of Public Opinion Surveys.”
May 15. Available at http://www.people-press.org/2012/05/15/assessing-the-representativeness-of- public-opinion-surveys.

Petit
Annie
.
2015
.
Peanut Labs presentation at AAPOR
. Available at http://web.peanutlabs.com

Rivers
Doug
.
2007
.
“Sampling for Web Surveys.”
Paper presented at the
Proceedings of the Joint Statistical Meetings
,
Salt Lake City, UT, USA
.

Rivers
Doug
Bailey
Delia
.
2009
.
“Inference from Matches Samples in the 2008 U.S. National Elections.”
Paper presented at the
Proceedings of the Joint Statistical Meetings
.
Washington, DC, USA
.

Rosenbaum
Paul R
.
1987
.
“Model Based Direct Adjustment.”
Journal of the American Statistical Association
82
:
387
94
.

Rosenbaum
Paul R.
Rubin
Donald B.
.
1983
.
“The Central Role of Propensity Score in Observational Studies for Casual Effects.”
Biometrika
70
:
41
55
.

Schonlau
Matthias
van Soest
Arthur
Kapteyn
Arie
.
2007
.
“Are ‘Webographic’ or Attitudinal Questions Useful for Adjusting Estimates from Web Surveys Using Propensity Scoring?”
RAND Working Paper WR-506
.

Schonlau
Matthias
Kinga
Zapert
Payne Simon
Lisa
Sanstad
Katherine
Marcus
Sue
Adams
John
Kan
Hongjun
Turner
Rachel
Berry
Sandra
.
2003
.
“A Comparison Between Responses from a Propensity-Weighted Web Survey and an Identical RDD Survey.”
Social Science Computer Review
21
:
1
11
.

Tourangeau
Roger
Conrad
Frederick G.
Couper
Mick P.
.
2013
.
The Science of Web Surveys
.
New York
:
Oxford University Press
.

Walker
Robert
Pettit
Raymond
Rubinson
Joel
.
2009
.
“A Special Report from the Advertising Research Foundation: The Foundations of Quality Initiative: A Five-Part Immersion into the Quality of Online Research.”
Journal of Advertising Research
49
:
464
85
.

Williams
Jo
.
2012
.
“Survey Methods in an Age of Austerity: Driving Value in Survey Design.”
International Journal of Market Research
54
:
35
47
.

Yeager
David S.
Krosnick
Jon A.
Chang
LinChiat
Javitz
Harold S.
Levendusky
Matthew S.
Simpser
Alberto
Wang
Rui
.
2011
.
“Comparing the Accuracy of RDD Telephone Surveys and Internet Surveys Conducted with Probability and Non-Probability Samples.”
Public Opinion Quarterly
75
:
709
47
.

ZuWallack
Randal
Dayton
James
, James,
Freedner-Maguire
Naomi
Karriker-Jaffe
Katherine J.
Greenfield
Thomas K.
.
2015
.
“Combining a Probability Based Telephone Sample with an Opt-In Web Panel.”
Paper presented at the
AAPOR Conference
;
Hollywood, FL, USA
.

1

The unequal weighting effect (or the design effect due to unequal weighting [Kish 1992]) is the component of the design effect that is attributable to the variation in the sampling weights and is not dependent on any particular survey outcome.

2

Administered by Luth.

3

The SSRS omnibus.

4

SportsPoll.

5

Administered by Research Now.

6

As is typical in weighting, the benchmarks lag to principal year of the data collection since benchmarks are typically not available at the time of a survey’s completion. However, given the longer time frame of publication, we were able as well to replicate the entire study utilizing 2013 benchmarks. Results are highly similar, with zero directional changes in results but some variance in the size of effects.

7

If we consider the 4-by-4 cross tabulation of Region by Race, for example, there are four row percentages computed within each of the four levels of region that describe the distribution of Race within Region. Similarly, there are 16 total column percentages that describe the distribution of Region within Race. Thus, this 4-by-4 table alone contributes a total of 32 total comparisons.

8

Panel 1 matched thus contains 5,010 “matched” cases from the larger Panel 1 sample, and similarly, Panel 2 contains 5,013 “matched” cases from the larger Panel 2 sample. In both cases, we chose the 3.5 percent simple random sample to use as the basis of the match in order to allow the ratio of available nonprobability to probability cases to be large enough to ensure an adequate number of possible matches obtained from the nonprobability samples. This decision was related to the “common support” conditions for matching across sample types (Rivers 2007; Rivers and Bailey 2009).

9

To determine a match for each probability case, a distance measure was computed between the probability case and all available nonprobability cases using the simple matching coefficient which simply computes the proportion of the eight variables for which two cases have different values. The “matched sample case” was then that nonprobability sample case with the smallest distance to the probability case. Ties were broken randomly and the matches were processed sequentially based on randomly reordering the probability sample.

10

We did trim or stratify the propensity weight, since our goal was to create a propensity weight that would reduce as much bias as possible.

11

The tracking studies (Panel 2 and Telephone 2) focused on 18–54-year-old adults, and as such truncated the age target variable at the third break.

12

For example, if a given sample/weighting approach is inexpensive to obtain, but has a very large UWE, then the true cost of data obtained from that sample is based not on the cost per interview, but the cost per effective interview, which is a combination of the raw cost of all interviews divided by the effective sample size.

13

In terms of real percentages, there are also notable differences. As can often be the case in raking, cells that are underrepresented in raw data can sometimes become overcorrected once weighted. For example, while 12 percent of African Americans in Panel 1 do not have a high school diploma (ACS = 16 percent), once weighted this cell balloons to 21 percent. Comparatively, Telephone 1 rose from 13 percent unweighted to nail the benchmark at 16 percent, once weighted. This is not to say that there were no examples of the weighted telephone samples doing poorly. As might be expected, the telephone samples over-represent older Americans, and when weighted this can result in imprecision in the estimates of race/ethnicity. So, whereas 12 percent of Hispanics reported being age 65 and older in Telephone 1 (compared to 13 in the ACS), because raking suppresses all persons age 65 and older in the telephone samples, this weighted estimate drops to 6 percent.

Appendix 1. Questions Used in the Analysis

Age

Telephone 1 and Panel 1: What is your age?

Telephone 2: In order for us to ensure representation from all age groups, let me ask you what year you were born?

Panel 2: In order for us to ensure representation from all age groups, let me ask you what year you were born?

NHIS: What is your date of birth?

Education

Telephone 1 and Panel 1: What is the last grade of school you completed?

Telephone 2 and Panel 2: What is the last grade or level of school you completed?

What is the last grade or class that you completed in school?

NHIS: What is the highest level of school you have completed or the highest degree you have received?

Race

Telephone 1, Panel 1, and Telephone 2: Do you consider yourself white, black or African American, Asian, Native American, Pacific Islander, mixed race, or some other race?

NHIS: What race or races do you consider yourself to be? Please select 1 or more of these categories. White, Black/African American, Indian (American), Alaska Native, Native Hawaiian, Guamanian or Chamorro, Samoan, Other Pacific Islander, Asian Indian, Chinese, Filipino, Japanese, Korean, Vietnamese, Other Asian, some other race.

Ethnicity

Telephone 1 and Panel 1: Are you of Hispanic origin or background?

Telephone 2 and Panel 2: Are you of Hispanic origin or not?

NHIS: Do you consider yourself to be Hispanic or Latino?

Region

Based on self-reported zip code in all data sources.

Webographic questions

Please tell me how much you personally agree or disagree with each of the following statements.

  4 Agree Strongly

  3 Agree Somewhat

  2 Disagree Somewhat

  1 Disagree Strongly

  8 (DO NOT READ) Don’t Know

  9 (DO NOT READ) Refused

 a. I usually try new products before other people do

 b. I often try new brands because I like variety and get bored with the same old thing

 c. When I shop I look for what is new

 d. I like to be the first among my friends and family to try something new

 e. I like to tell others about new brands or technology

 f. Once I find a product I like I tend to stick with it

 g. I am always looking for deals and discounts

Author notes

*Address correspondence to David Dutwin, SSRS, 53 West Baltimore Pike, Media, PA 19063; e-mail: [email protected].