Recapture or Precapture? Fallibility of Standard Capture-Recapture Methods in the Presence of Referrals Between Sources

Capture-recapture methods, largely developed in ecology, are now commonly used in epidemiology to adjust for incomplete registries and to estimate the size of difficult-to-reach populations such as problem drug users. Overlapping lists of individuals in the target population, taken from administrative data sources, are considered analogous to overlapping “captures” of animals. Log-linear models, incorporating interaction terms to account for dependencies between sources, are used to predict the number of unobserved individuals and, hence, the total population size. A standard assumption to ensure parameter identifiability is that the highest-order interaction term is 0. We demonstrate that, when individuals are referred directly between sources, this assumption will often be violated, and the standard modeling approach may lead to seriously biased estimates. We refer to such individuals as having been “precaptured,” rather than truly recaptured. Although sometimes an alternative identifiable log-linear model could accommodate the referral structure, this will not always be the case. Further, multiple plausible models may fit the data equally well but provide widely varying estimates of the population size. We demonstrate an alternative modeling approach, based on an interpretable parameterization and driven by careful consideration of the relationships between the sources, and we make recommendations for capture-recapture in practice.

accounted for by using measured covariates. An early suggestion to merge dependent data sources (33) has now been subsumed within the more general approach of incorporation of source-by-source interaction terms in a log-linear model (2,29,34).
We will demonstrate that this standard approach is often inadequate in the presence of referrals of individuals between sources, a mechanism we believe to be very common in CRC studies of human populations. When a proportion of individuals "captured" in 1 source is referred to another, these individuals can be thought of as having been "precaptured" rather than truly "recaptured." Because they have simply been passed from 1 source to another, they cannot be informative about the population size. Whereas more general source dependencies in epidemiologic CRC have their counterpart in ecological CRC (broadly comparable to "trap fascination" or "trap avoidance" (4)), direct referrals would be analogous to catching a sample of fish in a net, setting some proportion of these fish aside, and letting them contribute to the second sample.
The critical issues are those of model selection and parameter identification. Often, referrals will not be the only mechanism creating between-source dependencies, so that the full dependency structure might be quite complex. In a CRC analysis of S data sources, inclusion of all possible interaction terms in a log-linear model would require 2 S parameters-1 more than the number of observations (29). It is therefore necessary to enforce some constraint, such as assuming that at least 1 of the possible interaction terms is 0. The usual approach is to consider only models with no S-way interaction (2). This is motivated by the view that higher-order interactions are unlikely in the absence of their lower-order relatives, sometimes referred to as the "hierarchy principle" (4,29). However, Cormack (35) describes the no S-way interaction assumption as an "act of faith." Variable catchability (heterogeneity) and migration are 2 phenomena that have been recognized to induce interactions of an order higher than 1 (1,36).
Using the 3-source case as an example, we will show that between-source referrals will often correspond to a 3-way interaction term in the log-linear model. Sometimes an alternative saturated log-linear model would adequately accommodate the referral structure. However, alternative saturated models can lead to widely different estimates of the population size and, because all fit the data perfectly by definition, it is impossible to choose between them without expert knowledge of the interrelationships between the sources. Further, in slightly more complex scenarios, we will show that the true model structure cannot be represented by any identifiable log-linear model. Lists of PDUs identified in each of the following 4 administrative data sources during that time period were obtained and linked: "treatment in the community," "arrest for possession," "probation," and "prison." For ease of exposition, we discard the prison source but return to discuss this later. We also aggregate over the available covariate values (age group, sex, and geographical area). For a description of the data sources and the process used for matching individuals, see the article by Hay et al. (37).
The aggregated data are shown in Table 1. We refer to the sources as S1, S2, and S3, and we denote the observed number of individuals in each cell as x ijk , where the indices indicate presence (subscript = 1) or absence (subscript = 0) in each of the 3 sources. For example, x 101 denotes the number of individuals observed in S1 and S3 but not in S2. On occasion, we will write Su = 1 and Su = 0 to represent presence or absence in Source u (u = 1,2,3).
Clearly, we would expect some dependencies among these 3 data sources. In particular, we would anticipate a positive dependency between arrest and probation, because we would expect individuals committing 1 crime to be more likely to commit another. A negative dependency between treatment and the 2 criminal justice system sources is also possible, although these might be independent. The dependency structure is further complicated by the presence of referrals between sources. Specifically, individuals arrested for drug possession may be put on probation, and referrals to drug treatment are made from each of the criminal justice system sources.
Using the standard log-linear modeling framework, if we adopt the no S-way interaction rule, there are only 8 possible models to consider. These are indicated in Table 2, where we denote, for example, an S1 by S2 interaction by "S1xS2". Model 1, which we refer to as the base case or simply "base," assumes independence of the 3 sources. Alternatively, we can include 1 (models 2-4), 2 (models 5-7), or all 3 (model 8) source-by-source interaction terms. Model 8 is a saturated model, because it involves fitting 7 parameters to the 7 observed data points. In addition, we present results from 3 alternative saturated log-linear models (models 9-11), which violate the no S-way interaction rule and therefore are not usually considered in CRC applications. Each involves excluding 1 of the 2-way interaction terms in favor of accommodating a 3-way interaction.
As saturated models, models 8-11 will all fit the data perfectly, but they will provide different extrapolations to the missing cell. It can be shown that the maximum likelihood estimators for the missing cell based on models 9-11 are in fact identical to those from models 5-7. For example, models 5 and 9 both estimate x 000 by x 010 x 001 /x 011. This arises from both models assuming independence between S2 and S3 in the S1 = 0 cells. In estimating x 000 , all 4 S1 = 1 cells are ignored. However, model 9, unlike model 5, does not enforce the assumption of independence between S2 and S3 among the S1 = 1 cells, and hence will result in better model fit (at the cost of an additional parameter).
Model selection in epidemiologic CRC is often carried out by minimizing some information criterion such as the Akaike information criterion (AIC) (1,29). Alternatively, the simplest model is selected that fits the data according to a likelihood ratio test statistic G 2 , referred to a χ 2 distribution (1,29). In Table 2 we show the AIC and G 2 statistics for each of the 11 models, in addition to estimates of the population size. Estimates from models incorporating the available covariates were similar (not shown). We adopt the most common definition of the AIC (38,39), but note that an alternative version is popular in the CRC literature. The 2 definitions differ only in terms of a constant; because AICs should be interpreted only relative to each other rather than in absolute terms, the distinction is not important. Regression models were fitted using Stata, version 13, statistical software (StataCorp LP, College Station, Texas), assuming a Poisson likelihood. We coded all models such that the presence or absence in a source was indicated by 1 or 0, respectively. Other parameterizations can be used, but care must be taken to interpret the interaction models accordingly. Table 2 shows that the 4 saturated models jointly have by far the lowest AIC. The G 2 statistic is seen to be very large relative to its degrees of freedom for all other models, indicating that none of these provides an adequate fit to the data. The 4 saturated models are seen to imply widely different estimates of the total population size N, ranging from 288,000 to 927,000 and with 95% confidence intervals not overlapping.
Following the no S-way interaction rule, models 9-11 would not be considered, so model 8 would be selected. However, this is seen to provide an estimate of 927,000 (95% confidence interval (CI): 840,700, 1,024,800) PDUs in England, which is more than 3 times the magnitude of the previously published estimate (27) and would imply that 2.7% of adults aged 15-64 years were heroin or crack cocaine users. This is unsupported by other evidence (such as surveys of the proportion of people in treatment and the number of drug-related deaths) and is considered infeasible by experts. Perhaps the prevalence estimate from 1 of the other saturated models is more accurate, but there is no obvious reason to select 1 of these over the others. Moreover, we will demonstrate below that it is perfectly plausible that none of these 4 saturated models provides an accurate estimate of the population size, even in the absence of realistic additional complications, such as variable catchability or errors made in the matching process.
If the "prison" source is also included then, following the hierarchy principle, there are 113 possible log-linear models (2). The best fitting of these was found to be the saturated model with six 2-way and four 3-way interactions between sources. This model gave a still more implausible estimate of 1,614,000 (95% CI: 1,167,900, 2,265,000) PDUs in England. Again, alternative saturated models provided very different estimates. The methodological problems are therefore clearly not confined to 3-source models.

BIAS DUE TO BETWEEN-SOURCE REFERRAL
To assess whether models incorporating referrals between sources can be fitted within the standard log-linear framework, we compare the form of the expected cell counts under various scenarios, focusing again on the 3-source case for simplicity. We denote the conditional probability of an individual appearing in source u given presence or absence in source v, before any referrals take place, by p u|Sv = 1 and p u|Sv = 0 , respectively. When p u|Sv = 1 = p u|Sv = 0 , we simply write p u .
We consider 3 simple referral scenarios as examples. The expected cell counts under each of these, before and after referrals, are shown in Table 3. In the first 2 scenarios, there are assumed to be no "standard" interactions (i.e., if it were not for the referrals, then the data sources would be independent). In referral scenario 1, a proportion q 13 of individuals in S1 are referred to S3. This occurs independently of whether they were seen in S2 or would otherwise have been seen in S3 anyway. From the postreferral expected cell counts, it can be seen that this is equivalent to a standard interaction between S1 and S3, with p 3|S1=1 = p 3 + q 13 (1 − p 3 ) and p 3|S1=0 = p 3 . Hence, this very simple scenario corresponds to model 3 in Table 2.
However if, in addition, a proportion q 23 of individuals were independently referred from S2 into S3, then the expected cell counts would be as shown in the second part of a For example, the row with S1 = 1, S2 = 1, and S3 = 0 corresponds to the expected number of individuals observed in source 1 and source 2 but not in source 3 (the expectation of x 110 ).
b No standard interactions, but referral of a proportion q 13 of individuals from source 1 into source 3. (Prereferral: base. Postreferral: base + referrals from S1 into S3). c No standard interactions, but referral of a proportion q 13 of individuals from source 1 into source 3 and, independently, a proportion q 23 of individuals from source 2 into source 3. (Prereferral: base. Postreferral: base + referrals from S1 into S3 + referrals from S2 into S3). d Source 1 by source 2 and source 2 by source 3 interactions, plus referral of a proportion q 13 of individuals from source 1 into source 3. (Prereferral: base + S1xS2 + S2xS3. Postreferral: base + S1xS2 + S2xS3 + referrals from S1 into S3). Table 3 (referral scenario 2). As we would expect, the probability of an individual appearing in S3 now depends on appearance or otherwise in each of S1 and S2. However, we show in Appendix 1 that the corresponding log-linear model is not model 7 (base + S1xS3 + S2xS3) but actually model 11 (base + S1xS3 + S2xS3 + S1xS2xS3), violating the hierarchy principle. Specifically, this means that the referrals have induced a relationship between sources 1 and 2 among individuals observed in source 3, but not among individuals not seen in source 3. Clearly, even a simple scenario involving referrals can correspond to a 3-way interaction term in the log-linear model, so that none of the 8 standard models is appropriate.
In the final section of Table 3, we consider the scenario of 2 standard interactions (S1xS2 and S2xS3) followed by referral of a proportion q 31 of individuals from S3 into S1. We show in Appendix 1 that the corresponding log-linear model requires both a 3-way interaction term and also an S1xS3 interaction term. With only 7 observed data points, it is not possible to fit the appropriate log-linear model, which would need to include 4 interaction terms, and therefore 8 parameters in total.
In summary, in a 3-source CRC analysis, some simple hypothetical referral scenarios can be parameterized as loglinear models with only 2-way interaction terms, whereas others induce a 3-way interaction. Sometimes, for example in our referral scenario 3, it will not be possible to accommodate the correct data structure using any identifiable loglinear model.

ARTIFICIAL 3-SOURCE DATA SETS
We now consider 2 artificial data sets in order to demonstrate that standard methods can be very misleading. The 2 data sets, displayed in Table 4, were simulated from multinomial models under referral scenarios 2 and 3. Refer to the table's legend for full details. In both cases, we treat the cell count x 000 as missing and estimate it using each of the 11 log-linear models. Population size estimates and model fit statistics are shown in Tables 5 and 6.
As with our real data, we see that the 4 saturated models (models 8-11) jointly have by far the lowest AIC statistics. All other models have very large G 2 statistics relative to their degrees of freedom, indicating inadequate fit to the data. Once again, of the saturated models, the default choice would be model 8 because of the no S-way interaction rule. For the first data set, the resulting estimate of the missing cell is 60,400 (95% CI: 56,900, 64,200), with the 95% confidence interval not incorporating the true value of 86,600 (Table 4). For the second data set, the estimate of the missing cell is 163,700 (95% CI: 153,800, 174,100), which is much greater than the true value (Table 4) of only 57,100. For each data set, the alternative saturated models generate widely differing estimates of the hidden population size.
For artificial data set 1, we know from the section "Bias due to between-source referral" that the appropriate log-linear model is model 11, and we see in Table 5 that this does indeed recover the true value accurately, estimating the missing cell count to be 85,500 (95% CI: 82,000, 89,100). However, without expert knowledge of the relationships among the 3 data sources and the mathematical analysis above, it would be impossible to choose among models 8-11 if all were considered. As noted above, model 7 also gives the correct maximum likelihood estimate. However, this model fits the observed data poorly and, as such, would not be selected by analysts in practice.
We showed in the section titled "Bias due to between-source referral" and in Appendix 1 that none of the 11 log-linear models corresponds to the truth for the model used to generate artificial data set 2. In fact, none of the 4 saturated models in Table 6 produces a 95% confidence interval that includes the true value. If, however, we knew the underlying structure of the data, then it would be possible to obtain an unbiased estimate by formulating a model directly in terms of meaningful parameters ( probabilities and proportions) of the type in Table 3. As an example, in Appendix 2 we present WinBUGS (40) code for fitting the correct model to artificial data set 2.
Here we assume uninformative uniform(0,1) prior distributions for each of the p and q parameters and a vague half-normal distribution for the log of the total population size. Fitting this model, the resulting estimate of the unseen population size was 56,900 (95% credible interval: 55,000, 58,800), implying a total population size of 199,800 (95% credible interval: 197,900, 201,700), which is extremely close to the true value of 200,000. Note that, because this "correct" model is a For example, the row with S1 = 1, S2 = 1, and S3 = 0 shows x 110 , the number of individuals observed in source 1 and source 2 but not in source 3. b Simulated from referral scenario 2 in Table 3 (no standard interactions, but referral of a proportion q 13 of individuals from source 1 into source 3 and, independently, a proportion q 23 of individuals from source 2 into source 3), assuming the following parameter values: N = 200,000, p 1 = 0.4, p 2 = 0.1, p 3 = 0.2, q 13 = 0.2, and q 23 = 0.3 (N = total population size, and p u = prereferral probability of appearing in source u).
c Simulated from referral scenario 3 in Table 3 (standard source 1  1 of many possible saturated models (others including models 8-11 in Table 6), we would not possibly know to select it without external knowledge of the data structure. DISCUSSION We have demonstrated that, in the presence of referral mechanisms between sources, the standard log-linear modeling approach to CRC is too restrictive and can lead to grossly misleading prevalence estimates. The absolute bias will be largest when the proportion of the population observed is low. Model averaging (41) is not an appropriate solution if none of the models considered is "correct" and/or if only saturated models fit the data, because potentially widely varying estimates from different saturated models would be assigned equal weight. Abbreviations: AIC, Akaike information criterion; CI, confidence interval. a All population size estimates have been rounded to the nearest 100. b Estimates of the missing cell count should be compared with the true value of 86,595 (Table 4). c This is simply the estimated missing cell count plus the observed number of 113,405 (Table 4). Estimates should be compared with the true value of 200,000. d G 2 = the likelihood ratio test statistic, displayed with the degrees of freedom for the corresponding χ 2 test (P values are omitted because, for all nonsaturated models, P < 0.001). Abbreviations: AIC, Akaike information criterion; CI, confidence interval. a All population size estimates have been rounded to the nearest 100. b Estimates of the missing cell count should be compared with the true value of 57,114 (Table 4). c This is simply the estimated missing cell count plus the observed number of 142,886 (Table 4). Estimates should be compared with the true value of 200,000. d G 2 = the likelihood ratio test statistic, displayed with the degrees of freedom for the corresponding χ 2 test (P values are omitted because, for all nonsaturated models, P < 0.001).
As early as 1968, Wittes and Sidel (42) noted cause for concern if it is suspected that information from 1 CRC source has been obtained directly from another. This was before the advent of log-linear models. Inclusion of interaction terms in a log-linear model is now the standard methodology for accounting for source dependencies (2,29), with the highestorder interaction term assumed to equal 0. We are not the first to point out that this approach is not infallible (35,36). In particular, Hook and Regal (2) have suggested a need for caution when the standard saturated model is selected by some criterion such as the AIC. However, the fact that alternative saturated models can provide such dramatically different estimates of the population size does not appear to have been recognized. For example, Regal and Hook (36) demonstrated an application with likely violation of the "no S-way interaction" rule, but the inferred interaction was small, so that ignoring it was found to have little effect on the estimate of the population size.
We highlight precapture as a particular source of concern because we believe that many CRC studies in epidemiology are likely to involve referrals between sources and might therefore be subject to bias. Policies that seek to move drug users identified in the criminal justice system into treatment (43) complicate attempts to estimate the prevalence of drug addiction using CRC. More generally, applications of CRC in epidemiology frequently combine reports from clinical and laboratory sources, from primary care and secondary care settings, or from different medical specialists (44). It seems almost inevitable that some referrals will occur between such sources. For example, Van Hest et al. (45) used infectious disease notifications, laboratory reports, and data on hospitalizations to estimate tuberculosis prevalence, but they noted that "[tuberculosis] services in England are organized around close collaboration between clinicians, microbiologists and public health professionals." Similarly, CRC has been used to estimate the incidence of congenital cataract by combining information from pediatricians and pediatric ophthalmologists (46). However, cataract is often a manifestation of a wider dysmorphic syndrome. Good practice dictates that children who present first to pediatricians will be referred to ophthalmologists, whereas those presenting first to ophthalmologists are referred to pediatricians to be investigated for other problems associated with these syndromes (47).
Again, it seems likely that there are referrals among hospitalizations, outpatient visits, and nursing home admissions, which are 3 of the sources used by Turabelidze et al. (48) in estimating multiple sclerosis prevalence. Papoz et al. (49) noted that a positive dependency between a list of patients from physicians' practices and a list of patients given prescriptions is "hardly surprising, since the physician is the prescriber." The tendency for 2 malaria reporting systems to alert each other was observed by Cathcart et al. (50), and Wittes (32) discusses a potential scenario in which some names are actually copied from 1 list to another.
We have demonstrated that, in certain circumstances, an alternative saturated log-linear model might successfully account for referrals. Models 9-11 are based on the assumption of independence of 2 of the sources in the subset of individuals not seen in the third, but on dependence in the subset appearing in the third. Use of these models requires a priori justification of why this might be more plausible for the specific data set than the assumption of no 3-way interaction. More generally, in agreement with Regal and Hook (36) and Cormack (51), we encourage model choice to be guided by an a priori in-depth consideration of the likely relationships among the data sources, guided by discussions with relevant experts, rather than by blind application of some set of statistical criteria. In particular, we urge analysts to explicitly consider whether referrals between data sources are likely. This might be an additional rule to add to the list of recommendations of Regal and Hook (52,53). If no referrals are present, then standard log-linear methods can be used, although it might be helpful to instead write models directly in terms of meaningful parameters. This facilitates assessment of the plausibility of the estimated parameters, to be used alongside standard model fit statistics to assess model adequacy.
If referrals are believed to be present, then expected cell counts should be formulated in terms of probabilities and proportions, as in Table 3. We have demonstrated that a model parameterized in this way can be fitted using the Bayesian software WinBUGS, but we note that it might also be possible to fit this model using a maximum likelihood approach (36). Our software choice was primarily for pragmatic reasons: WinBUGS automatically provides a credible interval for the population size in addition to a point estimate and makes it simple to enforce constraints such that probability parameters lie between 0 and 1.
For our motivating example, we are not confident that any of the alternative saturated log-linear models ( Table 2) is "correct." For example, the structure assumed by model 9 allows for referrals from both arrest and probation into treatment, but it does not allow for the "standard" interactions that we also expect between these data sources, nor for referrals from arrest into probation. The true expected dependence structure is, unfortunately, too complex for all parameters to be identifiable without additional information. The best advice to those planning a CRC is to avoid this difficult problem by careful selection of data sources, such that complex dependence structures are avoided. If this is impossible, then we encourage the collection of auxiliary information at the point of data matching, such as referral source if known, and/or the specific dates at which individuals appeared in each source. Using this information, it may be possible to quantify the probability that those in overlap sets have been referred, and hence to "remove" likely precaptures prior to analysis.
An alternative solution might be to seek external information to inform the referral parameters. Here, a Bayesian approach could be particularly useful because it offers scope for the incorporation of such external information as informative prior distributions. The external information required need not inform the referral proportions directly; external data can be used to inform a function of parameters within the model (54). However, a great degree of care is needed, because the information must reflect as closely as possible the way in which the specific data were collected, in particular the length of the time window. More generally, a Bayesian framework can be used to incorporate formally other sources of information on prevalence, such as data on drug-related deaths (55). A "multiparameter evidence synthesis" framework (56,57), in which multiple sources of data are incorporated into a single coherent model, offers potential for formal assessment of the consistency of evidence from different sources. Careful modeling and accounting for known biases of course remain essential.