Abstract

Understanding factors that contribute to the increased likelihood of pathogen transmission between two individuals is important for infection control. However, analyzing measures of pathogen relatedness to estimate these associations is complicated due to correlation arising from the presence of the same individual across multiple dyadic outcomes, potential spatial correlation caused by unmeasured transmission dynamics, and the distinctive distributional characteristics of some of the outcomes. We develop two novel hierarchical Bayesian spatial methods for analyzing dyadic pathogen genetic relatedness data, in the form of patristic distances and transmission probabilities, that simultaneously address each of these complications. Using individual-level spatially correlated random effect parameters, we account for multiple sources of correlation between the outcomes as well as other important features of their distribution. Through simulation, we show the limitations of existing approaches in terms of estimating key associations of interest, and the ability of the new methodology to correct for these issues across datasets with different levels of correlation. All methods are applied to Mycobacterium tuberculosis data from the Republic of Moldova, where we identify previously unknown factors associated with disease transmission and, through analysis of the random effect parameters, key individuals, and areas with increased transmission activity. Model comparisons show the importance of the new methodology in this setting. The methods are implemented in the R package GenePair.

1. Introduction

The availability and affordability of whole-genome sequencing technology has made the use of genomic data increasingly common in epidemiological modeling studies (Loman & Pallen, 2015; Polonsky et al., 2019). Sequencing data from bacteriological pathogens provide insight into the level of similarity between the pathogens (e.g., Mitchell et al. (2011)) and the spread of infectious diseases through populations, and numerous methods have been developed to infer individual-to-individual transmission. These methods yield dyadic outcomes describing genetic relatedeness between pathogens for each pair of individuals. The simplest approach uses the difference of single nucleotide polymorphisms (SNPs) between two samples and assigns any pairs separated by fewer than a threshold number of SNPs to a transmission cluster (Stimson et al., 2019). Other approaches make use of phylogenetic trees, a reconstruction of the estimated evolutionary history of a pathogen based on genetic sequencing data (Delsuc et al., 2005). For example, patristic distances, the summed length of the tree branches separating two isolates, is a similar measure of genetic distance, but this approach leverages prior biological assumptions about the substitution rate to generate a more accurate measure than SNP distance (Poon, 2016). Many other advanced approaches exist, including those that make use of transmission trees, based on phylogenetic analyses, to estimate the asymmetric probability of direct transmission between pairs of cases (Didelot et al., 2017; Klinkenberg et al., 2017; Campbell et al., 2019), and a likelihood-based approach that incorporates information from a third individual when estimating genetic relatedness between a pair of individuals (Wang, 2007).

Understanding how potentially modifiable factors contribute to the increased likelihood of disease transmission between two individuals is important for infection control. However, while a robust set of methods exists to estimate novel measures of genetic relatedness between case pairs, advanced statistical methods to analyze these dyadic data and their associations with other factors may require extensions due to the unique features of the outcomes (e.g., spatial correlation, distributional characteristics).

Fixed effect regression-based approaches that do not account for correlation between the dyadic outcomes may have serious inferential limitations. For example, we may expect correlation between dyadic outcomes formula and formula, describing the genetic relatedness between individuals formula and formula, respectively, because individual j is present in both pairs; particularly, if individual j is a major driver of transmission. Ignoring this positive correlation during analysis can result in overly optimistic measures of uncertainty for regression parameter estimates, leading to incorrect conclusions regarding statistical significance (Hoff, 2003).

Several different analysis frameworks have been developed for modeling network dependence in dyadic data (Kenny et al., 2020). Random effect regression models have been widely used in this setting due to their ability to flexibly characterize the correlation, ability to handle different outcome types, and relative ease in making statistical inference (Warner et al., 1979; Wong, 1982; Gill & Swartz, 2001; Hoff et al., 2002; Hoff, 2003, 2005, 2021). However, current models often ignore other important sources of correlation that may be unique to the infectious disease setting (e.g., spatial correlation due to unmeasured transmission dynamics) and/or are limited in their ability to describe non-standard distributions that may be seen when analyzing novel genetic relatedness outcomes (e.g., zero-inflation), with a few exceptions. Beck et al. (2006) and Neumayer and Plümper (2010) discussed the use of spatial lag regression models for analyzing spatially-referenced dyadic data, while Austin et al. (2013) and Ciminelli et al. (2019) integrated spatially correlated random effects within a network analysis by extending the latent position distance model of Hoff et al. (2002). Hanks and Hooten (2013) introduced the use of Gaussian Markov random fields within circuit theory to model spatial dependencies in a landscape genetics study. With respect to non-standard distributions, Simpson and Laurienti (2015) introduced a non-spatial (in terms of the random effect parameters) two-part mixed model for analyzing the probability of connection in whole-brain network dyadic data.

In this work, we develop hierarchical Bayesian spatial methods for analyzing dyadic genetic relatedness data in the form of patristic distances and transmission probabilities by extending the existing random effect frameworks to better reflect features of the genetic relatedness outcomes. Specifically, instead of inducing spatial correlation in the data through the latent position distance components of the model as in Austin et al. (2013) and Ciminelli et al. (2019), we incorporate spatial structure directly into the individual-level random effect parameters. We fully investigate the implications of this choice on the induced correlation structure of the data. Additionally, we accommodate the distinctive distributional features of the considered genetic relatedness outcomes during modeling. Finally, we develop efficient Markov chain Monte Carlo (MCMC) sampling algorithms for several genetic relatedness outcomes, along with an R package, to facilitate posterior sampling in future work.

Using a simulation study, we show the importance of these methods for conducting accurate statistical inference for key regression associations and the limitations of existing approaches. We also apply our methods to a unique dataset of Mycobacterium tuberculosis isolates and associated patient data from the Republic of Moldova and show the benefits of the new methodology with respect to model fit and uncovering new insights into potentially important transmission factors. Analyzing the posterior distributions of individual-specific random effect parameters is shown to be important for understanding how individuals in the population personally contribute to the transmission dynamics and the role of spatial versus individual variability in this process. Given the increasing availability and interest in these types of data, and the limitations of existing approaches, we anticipate that these methods will represent important tools for researchers looking to correctly identify factors associated with genetic relatedness between individuals in future studies.

2. Motivating Data and Application

In this study, we analyze data previously described by Yang et al. (2022). In that work, the authors performed whole genome sequencing on M. tuberculosis isolates from 2,236 of the 2,770 non-incarcerated adults diagnosed with culture-positive tuberculosis (TB) in the Republic of Moldova between January 1, 2018 and December 31, 2019. They constructed a maximum likelihood phylogenetic tree and identified broad, putative transmission clusters with a threshold of 0.001 substitutions per site.

From these analyses, Yang et al. (2022) produced estimates of genetic relatedness among sequences using two metrics. First, they computed patristic distance between any pair of isolates within a cluster. Patristic distance is a measure of genetic relatedness between two sequences in a phylogenetic tree expressed in substitutions per site. Second, the authors estimated the probability that one individual infected another individual. These probabilities are obtained using TransPhylo (Didelot et al., 2017), a Bayesian approach that augments timed phylogenetic trees with who-infected-whom information while accounting for the possibility of unsampled individuals. TransPhylo uses an MCMC framework to obtain a posterior collection of transmission trees, accounting for the time from infection to infecting others (generation time) and the time from infection to sampling. For most pairs formula, there is no posterior transmission tree that includes a transmission event from i to j or from j to i. Where there are samples in the posterior in which i or j infected the other, the posterior probability is not necessarily symmetric. This can occur, for example, because i was sampled much earlier than j, making it more likely that i infected j. For all possible pairs within a transmission cluster, symmetric estimates of patristic distance (i.e., formula) and asymmetric estimates of transmission probability (i.e., formula necessarily) are available.

Demographic data are also available for each TB case. These data include individual characteristics such as age (in years), sex, education status (less than secondary, secondary or higher), working status (employed, unemployed), and residence type (urban, not urban). With these data, we calculate characteristics of the pair of individuals, such as an indicator for whether the individuals in the pair reside in the same village, the Euclidean distance between their villages of residence (in kilometers), the difference between their dates of diagnosis (in days), and the absolute difference between their ages (in years).

In this work, we analyze data from individuals in the largest putative transmission cluster. After removing six individuals with missing covariates, this cluster includes 99 individuals resulting in 4,851 symmetric patristic distance pairs and 9,702 asymmetric transmission probability pairs. Characteristics of the dyadic data are shown in Table 1, while in Figure 1 we display distributions of the different genetic relatedness outcomes along with the village locations for the included individuals.

TABLE 1

Summary of the dyadic genetic relatedness data in the Republic of Moldova study population by transmission probability (TP) status.

EffectTP =0 (formulaTP >0 (formula)
Distance between villages (km)87.49 (67.92)94.76 (78.13)
Same village (% Yes)0.561.28
Date of diagnosis difference (Day)225.09 (233.75)202.41 (211.00)
Age difference (Year)13.57 (14.00)13.33 (15.00)
Sex (%):
Both male56.5458.89
Both female6.114.63
Mixed pair37.3536.48
Residence location (%):
Both urban14.6214.18
Both rural37.5538.16
Mixed pair47.8347.67
Working status (%):
Both employed0.990.77
Both unemployed80.8180.50
Mixed pair18.1918.73
Education (%):
Both < secondary10.3010.02
Both ≥ secondary45.3346.21
Mixed pair44.3743.77
EffectTP =0 (formulaTP >0 (formula)
Distance between villages (km)87.49 (67.92)94.76 (78.13)
Same village (% Yes)0.561.28
Date of diagnosis difference (Day)225.09 (233.75)202.41 (211.00)
Age difference (Year)13.57 (14.00)13.33 (15.00)
Sex (%):
Both male56.5458.89
Both female6.114.63
Mixed pair37.3536.48
Residence location (%):
Both urban14.6214.18
Both rural37.5538.16
Mixed pair47.8347.67
Working status (%):
Both employed0.990.77
Both unemployed80.8180.50
Mixed pair18.1918.73
Education (%):
Both < secondary10.3010.02
Both ≥ secondary45.3346.21
Mixed pair44.3743.77

Note: Means, with interquartile ranges given in parentheses, are shown for continuous variables and percentages are shown for categorical variables.

TABLE 1

Summary of the dyadic genetic relatedness data in the Republic of Moldova study population by transmission probability (TP) status.

EffectTP =0 (formulaTP >0 (formula)
Distance between villages (km)87.49 (67.92)94.76 (78.13)
Same village (% Yes)0.561.28
Date of diagnosis difference (Day)225.09 (233.75)202.41 (211.00)
Age difference (Year)13.57 (14.00)13.33 (15.00)
Sex (%):
Both male56.5458.89
Both female6.114.63
Mixed pair37.3536.48
Residence location (%):
Both urban14.6214.18
Both rural37.5538.16
Mixed pair47.8347.67
Working status (%):
Both employed0.990.77
Both unemployed80.8180.50
Mixed pair18.1918.73
Education (%):
Both < secondary10.3010.02
Both ≥ secondary45.3346.21
Mixed pair44.3743.77
EffectTP =0 (formulaTP >0 (formula)
Distance between villages (km)87.49 (67.92)94.76 (78.13)
Same village (% Yes)0.561.28
Date of diagnosis difference (Day)225.09 (233.75)202.41 (211.00)
Age difference (Year)13.57 (14.00)13.33 (15.00)
Sex (%):
Both male56.5458.89
Both female6.114.63
Mixed pair37.3536.48
Residence location (%):
Both urban14.6214.18
Both rural37.5538.16
Mixed pair47.8347.67
Working status (%):
Both employed0.990.77
Both unemployed80.8180.50
Mixed pair18.1918.73
Education (%):
Both < secondary10.3010.02
Both ≥ secondary45.3346.21
Mixed pair44.3743.77

Note: Means, with interquartile ranges given in parentheses, are shown for continuous variables and percentages are shown for categorical variables.

Patristic distances (substitutions per site; panel 1), transmission probabilities that are >0 (panel 2), and village locations (panel 3) from the largest putative cluster in the Republic of Moldova data analysis. In panel 3, gray, blue, orange, and red points represent villages with one, two, three, and five cases, respectively.
Figure 1

Patristic distances (substitutions per site; panel 1), transmission probabilities that are >0 (panel 2), and village locations (panel 3) from the largest putative cluster in the Republic of Moldova data analysis. In panel 3, gray, blue, orange, and red points represent villages with one, two, three, and five cases, respectively.

3. Methods

We develop models to analyze spatially-referenced dyadic genetic relatedness outcomes (i.e., patristic distances, transmission probabilities) while accounting for multiple sources of correlation between responses and important features of the outcomes. The methods are designed to yield accurate statistical inference for the regression parameters while also identifying individuals and locations that play a more important role in the transmission dynamics.

3.1 Patristic Distances

We model the patristic distance (log scale) between individuals i and j as a function of pair- and individual-level covariates as well as spatially-referenced, individual-specific random effect parameters such that

1

formula, where formula is the error term; n is the total number of individuals in the study; formula is the symmetric patristic distance between a unique pair of individuals i and j (i.e., formula for all formula); formula is a vector of covariates (including intercept) describing differences between individuals i and j (e.g., spatial distance), with β being the corresponding vector of regression parameters; formula is a vector of covariates specific to individual i (excluding intercept), where the impact of the covariates on the response, described by the γ vector, is assumed to be the same across all individuals; and formula is a spatially-referenced, individual-specific random effect parameter which describes individual i's role in transmission events within the population (both as infector and infectee). Small formula values indicate that across all outcome pairs involving individual i, the patristic distance is smaller on average, suggestive of an increased likelihood of being in transmission pairs.

We allow for the possibility of spatial correlation between responses by modeling the random effect parameters using a spatially-referenced Gaussian process such that

2

where formula is decomposed into two pieces; one that is purely spatial, formula, and the other that is non-spatial, formula. This allows for each parameter to be individual-specific and not driven solely by spatial location. In the case where individuals are co-located, the function h(.) maps the spatial location of an individual to an entry within a smaller set of formula unique locations such that formula, where it is possible that formula for some formula. When all individuals have a unique location, formula for all i and therefore, formula.

The vector of purely spatial random effect parameters, η, is modeled using a Gaussian process centered at zero (i.e., formula is an m length vector of zeros) with correlation structure defined by the Euclidean distances between spatial locations (i.e., formula), where formula controls the range of the spatial process and τ2 the total variability of the spatial process. Small values of ϕ correspond to a larger range. The Gaussian process represents an extremely flexible option for describing distance-based spatial correlation and has been used extensively in past hierarchical Bayesian spatial modeling (Banerjee et al., 2014). Finally, formula represent the individual-specific parameters that account for the possibility that two people could be in similar spatial locations but have different patterns of behavior that impact their likelihood of being transmitted to and/or transmitting to others.

3.1.1 Prior Distributions

We assign weakly informative prior distributions to the remaining model parameters when appropriate. The regression parameters are specified as formula for formula and formula, where formula and formula are the lengths of the formula and formula vectors, respectively; the variance parameters as formula; and the spatial correlation parameter as formula. We scale the spatial distances used in the analysis to allow the prior distribution for ϕ to be minimally informative (i.e., ranging from relatively weak to strong spatial correlation at both short and long spatial distances).

3.2 Transmission Probabilities

Next, we introduce a method for analyzing transmission probabilities which, unlike patristic distances, contain potentially rich information regarding the direction of transmission. As a result, the outcomes are not necessarily symmetric as they were in Section 3.1 (i.e., formula necessarily). Here, we introduce a framework for analyzing transmission probabilities that includes similar inferential goals as the model in (1, 2) while also accounting for the asymmetry of the outcome as well as other important data features (e.g., zero-inflation).

Specifically, we model the probability that individual j infected individual i (i.e., formula) as a function of individual- and pair-specific covariates, and individual and location-specific random effect parameters, using a zero-inflated distribution. This specification includes a binary component to account for the large proportion of exact zero transmission probabilities (see Figure 1), and a continuous piece to model the non-zero probabilities. We define the probability density function (pdf) for a transmission probability, formula, as

3

where all pairs are now included in the analysis (i.e., formula; formula; formula); 1(.) is an indicator function taking the value of one when the input statement is true and the value of zero otherwise; formula describes the probability that the transmission probability is exactly equal to zero; and formula is a pdf introduced to model the non-zero probabilities on the logit scale.

We use a logistic regression framework to connect these underlying probabilities with covariates and random effect parameters such that

4

where formula and formula were previously described in Section 3.1. Because we now observe two responses for each pair (i.e., formula and formula), we are able to separate the included parameters into the different roles of the individuals within a pair; specifically, we can estimate the infector or “giver” (g) and the infectee or “receiver” (r) terms. For example, formula describes the probability that individual j transmits to i, so formula from (Equation 4) represents the impact of the giver's (individual j's) covariates on the probability of transmission, while formula describes the impact of the receiver's (individual i's) characteristics. Similarly, now each individual has two different additive random effect parameters, formula; one for the giver and receiver roles, respectively. The formula interaction term accounts for correlation between observations from the same pair of individuals in different roles (i.e., formula), where formula, formula (Hoff, 2005).

In order to understand if the magnitude of a non-zero transmission probability is also impacted by covariates and random effect parameters, the pdf in (Equation 3) includes a separate regression framework for the non-zero probabilities. Specifically, formula, the pdf introduced to model the non-zero probabilities on the logit scale, is defined as

5

where formula represents the variance of the distribution and the remaining terms in (Equation 5) have been previously described. The “w” subscripts in (Equation 5) serve to differentiate these parameters from those used by the binary model in (Equation 4) (i.e., “z” subscripts).

In total, each individual in the analysis has four additive random effect parameters, formula, representing the residual (i.e., after adjustment for known risk factors) likelihood of transmitting (g) or being transmitted to (r) in the binary (z) and positive (w) transmission probability regressions. Large values of the z subscript random effect parameters suggest an increased chance of a non-zero transmission probability, while large values of the w subscript parameters indicate an increasingly positive transmission probability.

As in (Equation 2), the introduced random effect parameters serve to adjust for pair- and proximity-based correlation in the outcomes. Unlike in Simpson and Laurienti (2015), we anticipate that the collection of parameters corresponding to the same individual may themselves be correlated and introduce a multivariate model as a result. The model for one set of the parameters is similar to (Equation 2) and is given as (for formula

where formula once again account for individual variability, and the remaining parameters, formula, formula, formula, are defined similarly. To account for cross-covariance and spatial correlation among the parameters, we specify a multivariate Gaussian process for the mean parameters such that

6

The complete collection of mean parameters across all m unique spatial locations is denoted by η (similar to (Equation 2)); formula represents the collection of mean parameters across each component of the model and different roles, specific to unique location formula; formula was previously described in (Equation 2); ⊗ is the Kronecker product; and Ω represents a four-by-four unstructured covariance matrix describing the cross-covariance among the set of four random effect parameters specific to a unique spatial location.

3.2.1 Prior Distributions

We complete the model specification by assigning prior distributions to the introduced model parameters. As in Section 3.1.1, the regression parameters are specified as formula, formula, formula, formula, formula, formulaformula for formula and formula; the variance parameters as formula, formula, formula, formula, formula, formula, formulaformula; the spatial correlation parameter as formula; and the cross-covariance matrix as formula, a weakly informative choice resulting in uniform cross-correlations a priori (Gelman et al., 2013).

3.3 Induced Correlation Structure

The inclusion of individual-specific and spatially correlated random effect parameters in the dyadic outcome models detailed in Sections 3.1 and 3.2 results in a positive correlation between the responses, whose magnitude varies depending on (i) if the two pairs share a common individual and (ii) the geographic distance between the people in the pairs. To better understand the induced correlations, we consider two different cases specifically for patristic distances; when there is, and is not, a shared individual between the pairs. Derivations for the transmission probability model are similar but more complicated due to the zero-inflated distribution and two regression frameworks used. Therefore, we present simulation-based correlation estimates for this model over a range of spatial covariance settings. See Web Appendix 1 of the supporting information for full details.

4. Simulation Study

We design a simulation study to investigate the implications of ignoring multiple forms of correlation between dyadic genetic relatedness outcomes when making inference on regression parameters of interest. Additionally, we aim to determine if a common Bayesian model comparison tool can be used to identify datasets that contain non-negligible levels of correlation and also differentiate the sources of the correlation. In order to avoid repetition of text, the process described throughout Section 4 is specifically for the patristic distances model. However, we carried out the same steps for transmission probabilities using the corresponding equations from Section 3.2. These full details are provided in Web Appendix 2 of the Supporting information.

4.1 Data Generation

We simulate data from the model in (Equation 1) under three different scenarios. In Setting 1, we specify that there is no unmeasured correlation by setting formula for all i. In Setting 2, we simulate data from (Equations 1) and (2) with formula for all i, resulting in only non-spatial variability in the random effect parameters. In Setting 3, we also simulate data from (Equations 1) and (2) with formula and formula for all i, resulting in spatially correlated random effect parameters with correlation that decreases to 0.05 at the maximum distance observed in the data (i.e., the effective range).

When simulating data from these models, we use results from our data application in the Republic of Moldova (Section 5) to ensure that we were working with realistic outcomes. Specifically, we use the same sample size (formula), same covariates (formula, formula), and choose the true parameter values needed to simulate from (Equations 1) and (2) based on posterior estimates obtained from the data application. In Web Table 1 of the Supporting information, the specific values used in each simulation study are given.

When simulating data from Settings 2 and 3, we define the total variance of the random effect process as formula and use the values in Web Table 1 of the Supporting information to estimate this quantity. In Setting 2, all of this variability is assumed to be non-spatial, while in Setting 3 it is attributed entirely to spatial correlation. We generate a new vector of θ parameters for each dataset from the setting-specific model. We also create a unique set of spatial locations, with no co-located individuals (i.e., formula), for each dataset. In a sensitivity analysis, we repeated the entire study while allowing for co-located individuals as observed in the data application in the Republic of Moldova. In total, we simulate 100 datasets from each setting.

4.2 Competing Models

We apply three competing models to every dataset. The first model (i.e., Fixed) represents a simplified fixed effects regression form of (Equation 1), where formula for all i (i.e., no random effect parameters). As Fixed matches the data generated from Setting 1, we expect it to perform well in that setting. However, in Settings 2 and 3, it will likely struggle in estimating the associations of interest as it ignores all correlation. The second model (i.e., non-spatial) also represents a variant of (Equations 1) and (1), where formula for all i (matching data generation Setting 2). Therefore, Non-spatial accounts for correlation due to the nature of dyadic responses, but ignores the potential for spatial correlation in the data. Prior to this work, it was unknown how inference for the regression parameters is impacted when dyadic correlation is accounted for but spatial correlation is ignored. The final competing model (i.e., Spatial) is our newly developed model in Section 3.1 which can address both sources of correlation simultaneously.

We monitor several pieces of information collected from the analyzed datasets to compare the methods. First, we calculate the mean absolute error (MAE) and mean squared error (MSE) for every regression parameter in the model using the posterior mean as the point estimate. Next, we calculate 95% quantile-based equal-tailed credible intervals (CIs) for each regression parameter and monitor how often this interval included the true value (ideally around 95% of the time) and its length. Finally, we formally compare the models using Watanabe Akaike information criterion (WAIC), a metric that balances model fit and complexity, where smaller values suggest that a model is preferred (Watanabe, 2010).

4.3 Results

We apply each method to each dataset and collect 20,000 posterior samples after removing the first 5,000 iterations prior to convergence of the model. Additionally, we think the remaining samples by a factor of two to reduce posterior autocorrelation, resulting in 10,000 samples with which to make posterior inference. The priors from Sections 3.1.1 and 3.2.1 were used other than when we encountered convergence issues when applying Non-spatial and Spatial in the transmission probabilities framework to a few of the datasets generated from Settings 2 and 3. In that case, formula or formula was used to stabilize estimation of the formula parameters.

The full results are shown in Table 2 for all models, where we report the average MAE, average MSE, average empirical coverage (EC), and average CI length across all regression parameters and simulated datasets, as well as the average difference in WAIC values, with respect to Spatial, across all simulated datasets.

TABLE 2

Simulation study results.

PatristicTrans. Probs.
MetricSettingStandardNon-spatialSpatialStandardNon-spatialSpatial
MAE11.02 (0.03)1.03 (0.03)1.04 (0.03)1.62 (0.20)1.53 (0.19)1.45 (0.17)
26.77 (0.23)6.21 (0.23)6.19 (0.23)0.40 (0.00)0.35 (0.01)0.34 (0.01)
35.56 (0.20)4.84 (0.19)2.31 (0.08)0.43 (0.03)0.56 (0.21)0.35 (0.08)
MSE10.30 (0.02)0.30 (0.02)0.31 (0.02)109.90 (18.89)99.21 (16.66)83.81 (13.80)
210.10 (0.75)9.69 (0.76)9.68 (0.75)0.36 (0.01)0.36 (0.02)0.36 (0.02)
36.18 (0.53)5.50 (0.49)1.21 (0.09)2.65 (2.27)33.20 (30.29)5.04 (3.52)
EC10.94 (0.01)0.97 (0.00)0.97 (0.00)0.92 (0.01)0.92 (0.01)0.93 (0.01)
20.41 (0.02)0.93 (0.01)0.94 (0.01)0.44 (0.01)0.93 (0.00)0.94 (0.00)
30.39 (0.01)0.92 (0.01)0.96 (0.01)0.47 (0.01)0.90 (0.01)0.94 (0.01)
CI length10.05 (0.00)0.06 (0.00)0.06 (0.00)4.33 (0.41)4.26 (0.40)4.39 (0.40)
20.10 (0.00)0.28 (0.00)0.30 (0.00)0.38 (0.00)1.61 (0.01)1.62 (0.01)
30.08 (0.00)0.21 (0.00)0.13 (0.00)0.52 (0.09)1.80 (0.43)1.33 (0.19)
Δ WAIC1−31.74 (0.97)−2.95 (0.07)10.25 (1.91)−1.53 (1.91)
26138.68 (56.90)0.57 (0.02)9013.90 (87.72)26.04 (22.47)
33983.08 (98.48)4.89 (0.10)8349.41 (298.75)112.00 (10.66)
PatristicTrans. Probs.
MetricSettingStandardNon-spatialSpatialStandardNon-spatialSpatial
MAE11.02 (0.03)1.03 (0.03)1.04 (0.03)1.62 (0.20)1.53 (0.19)1.45 (0.17)
26.77 (0.23)6.21 (0.23)6.19 (0.23)0.40 (0.00)0.35 (0.01)0.34 (0.01)
35.56 (0.20)4.84 (0.19)2.31 (0.08)0.43 (0.03)0.56 (0.21)0.35 (0.08)
MSE10.30 (0.02)0.30 (0.02)0.31 (0.02)109.90 (18.89)99.21 (16.66)83.81 (13.80)
210.10 (0.75)9.69 (0.76)9.68 (0.75)0.36 (0.01)0.36 (0.02)0.36 (0.02)
36.18 (0.53)5.50 (0.49)1.21 (0.09)2.65 (2.27)33.20 (30.29)5.04 (3.52)
EC10.94 (0.01)0.97 (0.00)0.97 (0.00)0.92 (0.01)0.92 (0.01)0.93 (0.01)
20.41 (0.02)0.93 (0.01)0.94 (0.01)0.44 (0.01)0.93 (0.00)0.94 (0.00)
30.39 (0.01)0.92 (0.01)0.96 (0.01)0.47 (0.01)0.90 (0.01)0.94 (0.01)
CI length10.05 (0.00)0.06 (0.00)0.06 (0.00)4.33 (0.41)4.26 (0.40)4.39 (0.40)
20.10 (0.00)0.28 (0.00)0.30 (0.00)0.38 (0.00)1.61 (0.01)1.62 (0.01)
30.08 (0.00)0.21 (0.00)0.13 (0.00)0.52 (0.09)1.80 (0.43)1.33 (0.19)
Δ WAIC1−31.74 (0.97)−2.95 (0.07)10.25 (1.91)−1.53 (1.91)
26138.68 (56.90)0.57 (0.02)9013.90 (87.72)26.04 (22.47)
33983.08 (98.48)4.89 (0.10)8349.41 (298.75)112.00 (10.66)

Note: Δ WAIC = WAIC − WAICformula. Averages across the 100 simulated datasets are reported with standard errors given in parentheses. Bold entries indicate the “best” value within a setting and outcome type across models (i.e., smallest values for MAE and MSE, values closest to 0.95 for EC). MAE and MSE results for the patristic distances model are multiplied by 100 and 1,000, respectively, for presentation purposes. CI, credible interval; EC, empirical coverage; MAE, mean absolute error; MSE, mean squared error; WAIC, Watanabe Akaike information criterion.

TABLE 2

Simulation study results.

PatristicTrans. Probs.
MetricSettingStandardNon-spatialSpatialStandardNon-spatialSpatial
MAE11.02 (0.03)1.03 (0.03)1.04 (0.03)1.62 (0.20)1.53 (0.19)1.45 (0.17)
26.77 (0.23)6.21 (0.23)6.19 (0.23)0.40 (0.00)0.35 (0.01)0.34 (0.01)
35.56 (0.20)4.84 (0.19)2.31 (0.08)0.43 (0.03)0.56 (0.21)0.35 (0.08)
MSE10.30 (0.02)0.30 (0.02)0.31 (0.02)109.90 (18.89)99.21 (16.66)83.81 (13.80)
210.10 (0.75)9.69 (0.76)9.68 (0.75)0.36 (0.01)0.36 (0.02)0.36 (0.02)
36.18 (0.53)5.50 (0.49)1.21 (0.09)2.65 (2.27)33.20 (30.29)5.04 (3.52)
EC10.94 (0.01)0.97 (0.00)0.97 (0.00)0.92 (0.01)0.92 (0.01)0.93 (0.01)
20.41 (0.02)0.93 (0.01)0.94 (0.01)0.44 (0.01)0.93 (0.00)0.94 (0.00)
30.39 (0.01)0.92 (0.01)0.96 (0.01)0.47 (0.01)0.90 (0.01)0.94 (0.01)
CI length10.05 (0.00)0.06 (0.00)0.06 (0.00)4.33 (0.41)4.26 (0.40)4.39 (0.40)
20.10 (0.00)0.28 (0.00)0.30 (0.00)0.38 (0.00)1.61 (0.01)1.62 (0.01)
30.08 (0.00)0.21 (0.00)0.13 (0.00)0.52 (0.09)1.80 (0.43)1.33 (0.19)
Δ WAIC1−31.74 (0.97)−2.95 (0.07)10.25 (1.91)−1.53 (1.91)
26138.68 (56.90)0.57 (0.02)9013.90 (87.72)26.04 (22.47)
33983.08 (98.48)4.89 (0.10)8349.41 (298.75)112.00 (10.66)
PatristicTrans. Probs.
MetricSettingStandardNon-spatialSpatialStandardNon-spatialSpatial
MAE11.02 (0.03)1.03 (0.03)1.04 (0.03)1.62 (0.20)1.53 (0.19)1.45 (0.17)
26.77 (0.23)6.21 (0.23)6.19 (0.23)0.40 (0.00)0.35 (0.01)0.34 (0.01)
35.56 (0.20)4.84 (0.19)2.31 (0.08)0.43 (0.03)0.56 (0.21)0.35 (0.08)
MSE10.30 (0.02)0.30 (0.02)0.31 (0.02)109.90 (18.89)99.21 (16.66)83.81 (13.80)
210.10 (0.75)9.69 (0.76)9.68 (0.75)0.36 (0.01)0.36 (0.02)0.36 (0.02)
36.18 (0.53)5.50 (0.49)1.21 (0.09)2.65 (2.27)33.20 (30.29)5.04 (3.52)
EC10.94 (0.01)0.97 (0.00)0.97 (0.00)0.92 (0.01)0.92 (0.01)0.93 (0.01)
20.41 (0.02)0.93 (0.01)0.94 (0.01)0.44 (0.01)0.93 (0.00)0.94 (0.00)
30.39 (0.01)0.92 (0.01)0.96 (0.01)0.47 (0.01)0.90 (0.01)0.94 (0.01)
CI length10.05 (0.00)0.06 (0.00)0.06 (0.00)4.33 (0.41)4.26 (0.40)4.39 (0.40)
20.10 (0.00)0.28 (0.00)0.30 (0.00)0.38 (0.00)1.61 (0.01)1.62 (0.01)
30.08 (0.00)0.21 (0.00)0.13 (0.00)0.52 (0.09)1.80 (0.43)1.33 (0.19)
Δ WAIC1−31.74 (0.97)−2.95 (0.07)10.25 (1.91)−1.53 (1.91)
26138.68 (56.90)0.57 (0.02)9013.90 (87.72)26.04 (22.47)
33983.08 (98.48)4.89 (0.10)8349.41 (298.75)112.00 (10.66)

Note: Δ WAIC = WAIC − WAICformula. Averages across the 100 simulated datasets are reported with standard errors given in parentheses. Bold entries indicate the “best” value within a setting and outcome type across models (i.e., smallest values for MAE and MSE, values closest to 0.95 for EC). MAE and MSE results for the patristic distances model are multiplied by 100 and 1,000, respectively, for presentation purposes. CI, credible interval; EC, empirical coverage; MAE, mean absolute error; MSE, mean squared error; WAIC, Watanabe Akaike information criterion.

As expected, all competing models perform similarly in terms of MAE, MSE, EC, and CI length in Setting 1, with Spatial having improved MAE and MSE results for the transmission probabilities framework. For the patristic distances model, WAIC also correctly favors Fixed on average as it most closely matches the way the data were generated. However, with respect to WAIC, Fixed is slightly outperformed by Spatial in this setting for transmission probabilities.

In Settings 2 and 3 where correlation is present, Spatial consistently outperforms Fixed and Non-spatial across most metrics. Fixed displays troubling behavior in these settings, particularly with respect to EC. The 95% CIs are only capturing the true parameter values around 39%–47% of the times and are much shorter on average than those from the competing methods. This suggests that failing to account for correlation may result in CIs that are too narrow, possibly leading to an inflated type I error rate.

The MAE results suggest that regression parameter inference may also suffer when correlation is ignored and/or mischaracterized. For example, in Setting 3 where the data exhibit spatial correlation, Spatial produces substantially smaller MAEs than Non-spatial (and Fixed), showing the importance of accounting for spatial correlation. However, in Setting 2 where there is no spatial correlation, Non-spatial and Spatial perform almost identically with respect to MAE. The MSE results show similar patterns other than in Setting 3 for the transmission probabilities where Fixed yields a smaller estimate, likely due to a few outliers obtained when using Spatial.

The WAIC results in Settings 2 and 3 are decisively in favor of Spatial over Fixed, with differences between Spatial and Non-spatial increasing under spatially correlated data. When combined with results from Setting 1, this provides evidence to suggest that WAIC may be a useful tool for differentiating datasets in terms of the presence and composition of correlation. In Web Table 2 of the Supporting information, we display results from the sensitivity analysis where co-located individuals are included in the simulated datasets. Overall, the results are similar to those shown in Table 2.

5. Mycobacterium Tuberculosis in the Republic of Moldova

We apply the methods developed in Sections 3.1 and 3.2 to better understand factors related to M. tuberculosis transmission dynamics in the Republic of Moldova. Specifically, formula from (1, 4, 5) includes the pair-specific covariates described in Section 2 and Table 1 (i.e., intercept, same village indicator, distance between villages (kilometers), difference in diagnosis dates (days), difference in ages (years) while formula includes the individual-specific covariates (i.e., age (years), sex, education (less than secondary, secondary or higher), working status (employed, unemployed), and residence type (urban, not urban). In addition to the new methods, we also present results from the competing approaches detailed in Section 4.2 (i.e., Fixed and Non-spatial). In a sensitivity analysis for the transmission probability data, we compare a more parsimonious version of the full model based on a modified tobit regression framework. Full details on this approach are given in Web Appendix 3 of the Supporting information. We use WAIC to compare the different models, identify the source of correlation, and ultimately determine the need for the new methodology.

All models are fit in the Bayesian setting using MCMC sampling techniques, with the full conditional distributions detailed in Web Appendix 4 of the supporting information. For each method, we collect 10,000 samples from the joint posterior distribution after removing the first 50,000 iterations prior to convergence and thinning the remaining 200,000 posterior samples by a factor of 20 to reduce posterior autocorrelation. For each parameter, convergence was assessed using Geweke's diagnostic (Geweke, 1991), while effective sample size was calculated to ensure we collected sufficient post-convergence samples to make accurate statistical inference; neither tool suggested any issues. We present posterior means and 95% quantile-based equal-tailed CIs when discussing posterior inference.

5.1 Patristic Distances

Results from the patristic distance analyses are shown in Figure 2 and Table 3. From Figure 2, it is clear that WAIC favors Spatial over the other approaches. In the simulation study, WAIC was shown to consistently identify the correct data generating setting, suggesting that there is likely non-negligible and spatially structured correlation in these data. Consequently, Fixed should not be used as it is likely to underestimate regression parameter uncertainty, resulting in CIs that are often too narrow, while Non-spatial may result in regression parameter estimates with inflated MAE and MSE. This can partly be seen in Figure 2, where the Fixed CIs are much shorter on average. The point estimates seen in Figure 2 are generally consistent across all methods in this case.

TABLE 3

Results from the Spatial patristic distance analyses in the Republic of Moldova.

EffectEstimate (95% CI)
Distance between villages (50 km)1.00 (0.99, 1.01)
Date of diagnosis difference (1/2 year)1.01 (1.00, 1.01)
Age difference (10 years)1.00 (0.99, 1.01)
Age (10 years)1.00 (0.95, 1.04)
Sex:
Mixed pair vs. both female0.99 (0.88, 1.11)
Both male vs. both female0.99 (0.78, 1.24)
Residence location:
Mixed pair vs. both rural0.98 (0.88, 1.08)
Both urban vs. both rural0.97 (0.78, 1.19)
Working status:
Mixed pair vs. both unemployed0.98 (0.84, 1.14)
Both employed vs. both unemployed1.00 (0.72, 1.35)
Education:
Mixed pair vs. both ≥ secondary1.05 (0.95, 1.17)
Both < secondary vs. both ≥ secondary1.12 (0.90, 1.37)
EffectEstimate (95% CI)
Distance between villages (50 km)1.00 (0.99, 1.01)
Date of diagnosis difference (1/2 year)1.01 (1.00, 1.01)
Age difference (10 years)1.00 (0.99, 1.01)
Age (10 years)1.00 (0.95, 1.04)
Sex:
Mixed pair vs. both female0.99 (0.88, 1.11)
Both male vs. both female0.99 (0.78, 1.24)
Residence location:
Mixed pair vs. both rural0.98 (0.88, 1.08)
Both urban vs. both rural0.97 (0.78, 1.19)
Working status:
Mixed pair vs. both unemployed0.98 (0.84, 1.14)
Both employed vs. both unemployed1.00 (0.72, 1.35)
Education:
Mixed pair vs. both ≥ secondary1.05 (0.95, 1.17)
Both < secondary vs. both ≥ secondary1.12 (0.90, 1.37)

Note: Posterior means and 95% quantile-based equal-tailed credible intervals are shown for the exponentiated regression parameters (i.e., ratio of expected patristic distances per specified change in covariate value). The displayed credible intervals for the bolded entries exclude one.

TABLE 3

Results from the Spatial patristic distance analyses in the Republic of Moldova.

EffectEstimate (95% CI)
Distance between villages (50 km)1.00 (0.99, 1.01)
Date of diagnosis difference (1/2 year)1.01 (1.00, 1.01)
Age difference (10 years)1.00 (0.99, 1.01)
Age (10 years)1.00 (0.95, 1.04)
Sex:
Mixed pair vs. both female0.99 (0.88, 1.11)
Both male vs. both female0.99 (0.78, 1.24)
Residence location:
Mixed pair vs. both rural0.98 (0.88, 1.08)
Both urban vs. both rural0.97 (0.78, 1.19)
Working status:
Mixed pair vs. both unemployed0.98 (0.84, 1.14)
Both employed vs. both unemployed1.00 (0.72, 1.35)
Education:
Mixed pair vs. both ≥ secondary1.05 (0.95, 1.17)
Both < secondary vs. both ≥ secondary1.12 (0.90, 1.37)
EffectEstimate (95% CI)
Distance between villages (50 km)1.00 (0.99, 1.01)
Date of diagnosis difference (1/2 year)1.01 (1.00, 1.01)
Age difference (10 years)1.00 (0.99, 1.01)
Age (10 years)1.00 (0.95, 1.04)
Sex:
Mixed pair vs. both female0.99 (0.88, 1.11)
Both male vs. both female0.99 (0.78, 1.24)
Residence location:
Mixed pair vs. both rural0.98 (0.88, 1.08)
Both urban vs. both rural0.97 (0.78, 1.19)
Working status:
Mixed pair vs. both unemployed0.98 (0.84, 1.14)
Both employed vs. both unemployed1.00 (0.72, 1.35)
Education:
Mixed pair vs. both ≥ secondary1.05 (0.95, 1.17)
Both < secondary vs. both ≥ secondary1.12 (0.90, 1.37)

Note: Posterior means and 95% quantile-based equal-tailed credible intervals are shown for the exponentiated regression parameters (i.e., ratio of expected patristic distances per specified change in covariate value). The displayed credible intervals for the bolded entries exclude one.

Results from the patristic distance analyses in the Republic of Moldova. Posterior means and 95% quantile-based equal-tailed credible intervals are shown for the raw regression parameters from each of the competing methods. Red lines indicate that the interval excludes zero. WAIC values are provided along with the model complexity term in parentheses (i.e., p), with smaller values of WAIC preferred. WAIC, Watanabe Akaike information criterion.
Figure 2

Results from the patristic distance analyses in the Republic of Moldova. Posterior means and 95% quantile-based equal-tailed credible intervals are shown for the raw regression parameters from each of the competing methods. Red lines indicate that the interval excludes zero. WAIC values are provided along with the model complexity term in parentheses (i.e., pformula), with smaller values of WAIC preferred. WAIC, Watanabe Akaike information criterion.

Posterior inference from Spatial for the exponentiated regression parameters in Table 3 suggest that the indicator of whether the pair of individuals was in the same village is the only association whose CI excludes one. Patristic distances from pairs of individuals from the same village are around 45% (40%, 49%) smaller on average than those from pairs of individuals in different villages.

5.2 Transmission Probabilities

Results from the transmission probability analyses are shown in Figure 3 and Table 4. Similar to the patristic distance analyses, WAIC also favors Spatial over the competing approaches in this setting. Once again, Fixed may be producing CIs that are too narrow while the Non-spatial point estimates may have inflated MAEs and MSEs. Graphical results from Spatial show several more significant associations than observed with the competing approaches. Recall that in the transmission probabilities framework, Non-spatial specifies that formula for all i, meaning that it accounts for neither spatial correlation nor cross-covariance. This could explain the relatively large differences between results from the two methods.

TABLE 4

Results from the Spatial transmission probability analyses in the Republic of Moldova.

EffectBinaryContinuous
Distance between villages (50 km)0.99 (0.89, 1.10)0.93 (0.87, 1.00)
Same village (Yes vs. No)5.41 (2.22, 11.11)10.84 (5.54, 19.12)
Date of diagnosis difference (1/2 year)0.77 (0.68, 0.86)0.75 (0.69, 0.82)
Age difference (10 years)1.03 (0.93, 1.13)0.93 (0.86, 1.01)
Giver role:
Age (10 years)1.28 (1.04, 1.57)0.94 (0.86, 1.03)
Sex:
Male vs. female0.82 (0.46, 1.36)1.15 (0.87, 1.51)
Residence location:
Urban vs. rural0.35 (0.11, 0.82)1.23 (0.90, 1.65)
Working status:
Employed vs. unemployed4.69 (1.23, 12.26)1.05 (0.71, 1.49)
Education:
< Secondary vs. ≥ secondary3.01 (1.49, 5.61)1.08 (0.82, 1.39)
Receiver role:
Age (10 years)1.28 (1.01, 1.62)1.00 (0.90, 1.10)
Sex:
Male vs. female0.65 (0.33, 1.19)1.20 (0.90, 1.59)
Residence location:
Urban vs. rural0.25 (0.06, 0.66)1.27 (0.90, 1.72)
Working status:
Employed vs. unemployed6.18 (1.32, 18.66)1.20 (0.80, 1.73)
Education:
< Secondary vs. ≥ secondary3.70 (1.60, 7.66)1.15 (0.86, 1.48)
EffectBinaryContinuous
Distance between villages (50 km)0.99 (0.89, 1.10)0.93 (0.87, 1.00)
Same village (Yes vs. No)5.41 (2.22, 11.11)10.84 (5.54, 19.12)
Date of diagnosis difference (1/2 year)0.77 (0.68, 0.86)0.75 (0.69, 0.82)
Age difference (10 years)1.03 (0.93, 1.13)0.93 (0.86, 1.01)
Giver role:
Age (10 years)1.28 (1.04, 1.57)0.94 (0.86, 1.03)
Sex:
Male vs. female0.82 (0.46, 1.36)1.15 (0.87, 1.51)
Residence location:
Urban vs. rural0.35 (0.11, 0.82)1.23 (0.90, 1.65)
Working status:
Employed vs. unemployed4.69 (1.23, 12.26)1.05 (0.71, 1.49)
Education:
< Secondary vs. ≥ secondary3.01 (1.49, 5.61)1.08 (0.82, 1.39)
Receiver role:
Age (10 years)1.28 (1.01, 1.62)1.00 (0.90, 1.10)
Sex:
Male vs. female0.65 (0.33, 1.19)1.20 (0.90, 1.59)
Residence location:
Urban vs. rural0.25 (0.06, 0.66)1.27 (0.90, 1.72)
Working status:
Employed vs. unemployed6.18 (1.32, 18.66)1.20 (0.80, 1.73)
Education:
< Secondary vs. ≥ secondary3.70 (1.60, 7.66)1.15 (0.86, 1.48)

Note: Posterior means and 95% quantile-based equal-tailed credible intervals are shown for the exponentiated regression parameters (i.e., odds ratios). The displayed credible intervals for the bolded entries exclude one.

TABLE 4

Results from the Spatial transmission probability analyses in the Republic of Moldova.

EffectBinaryContinuous
Distance between villages (50 km)0.99 (0.89, 1.10)0.93 (0.87, 1.00)
Same village (Yes vs. No)5.41 (2.22, 11.11)10.84 (5.54, 19.12)
Date of diagnosis difference (1/2 year)0.77 (0.68, 0.86)0.75 (0.69, 0.82)
Age difference (10 years)1.03 (0.93, 1.13)0.93 (0.86, 1.01)
Giver role:
Age (10 years)1.28 (1.04, 1.57)0.94 (0.86, 1.03)
Sex:
Male vs. female0.82 (0.46, 1.36)1.15 (0.87, 1.51)
Residence location:
Urban vs. rural0.35 (0.11, 0.82)1.23 (0.90, 1.65)
Working status:
Employed vs. unemployed4.69 (1.23, 12.26)1.05 (0.71, 1.49)
Education:
< Secondary vs. ≥ secondary3.01 (1.49, 5.61)1.08 (0.82, 1.39)
Receiver role:
Age (10 years)1.28 (1.01, 1.62)1.00 (0.90, 1.10)
Sex:
Male vs. female0.65 (0.33, 1.19)1.20 (0.90, 1.59)
Residence location:
Urban vs. rural0.25 (0.06, 0.66)1.27 (0.90, 1.72)
Working status:
Employed vs. unemployed6.18 (1.32, 18.66)1.20 (0.80, 1.73)
Education:
< Secondary vs. ≥ secondary3.70 (1.60, 7.66)1.15 (0.86, 1.48)
EffectBinaryContinuous
Distance between villages (50 km)0.99 (0.89, 1.10)0.93 (0.87, 1.00)
Same village (Yes vs. No)5.41 (2.22, 11.11)10.84 (5.54, 19.12)
Date of diagnosis difference (1/2 year)0.77 (0.68, 0.86)0.75 (0.69, 0.82)
Age difference (10 years)1.03 (0.93, 1.13)0.93 (0.86, 1.01)
Giver role:
Age (10 years)1.28 (1.04, 1.57)0.94 (0.86, 1.03)
Sex:
Male vs. female0.82 (0.46, 1.36)1.15 (0.87, 1.51)
Residence location:
Urban vs. rural0.35 (0.11, 0.82)1.23 (0.90, 1.65)
Working status:
Employed vs. unemployed4.69 (1.23, 12.26)1.05 (0.71, 1.49)
Education:
< Secondary vs. ≥ secondary3.01 (1.49, 5.61)1.08 (0.82, 1.39)
Receiver role:
Age (10 years)1.28 (1.01, 1.62)1.00 (0.90, 1.10)
Sex:
Male vs. female0.65 (0.33, 1.19)1.20 (0.90, 1.59)
Residence location:
Urban vs. rural0.25 (0.06, 0.66)1.27 (0.90, 1.72)
Working status:
Employed vs. unemployed6.18 (1.32, 18.66)1.20 (0.80, 1.73)
Education:
< Secondary vs. ≥ secondary3.70 (1.60, 7.66)1.15 (0.86, 1.48)

Note: Posterior means and 95% quantile-based equal-tailed credible intervals are shown for the exponentiated regression parameters (i.e., odds ratios). The displayed credible intervals for the bolded entries exclude one.

Results from the transmission probability analyses in the Republic of Moldova. Posterior means and 95% equal-tailed quantile-based credible intervals are shown for the binary component (top row) and continuous component (bottom row) raw regression parameters from each of the competing methods. Red lines indicate that the interval excludes zero. WAIC values are provided along with the model complexity term in parentheses (i.e., p), with smaller values of WAIC preferred. WAIC, Watanabe Akaike information criterion.
Figure 3

Results from the transmission probability analyses in the Republic of Moldova. Posterior means and 95% equal-tailed quantile-based credible intervals are shown for the binary component (top row) and continuous component (bottom row) raw regression parameters from each of the competing methods. Red lines indicate that the interval excludes zero. WAIC values are provided along with the model complexity term in parentheses (i.e., pformula), with smaller values of WAIC preferred. WAIC, Watanabe Akaike information criterion.

For the pair-specific covariates in the binary regression model, odds ratio results in Table 4 from Spatial suggest that pairs of individuals in the same village and with similar dates of diagnosis are more likely to have non-zero transmission probabilities on average. Results from the continuous regression model are similar, with the addition of a significant finding for the distance between villages. Pairs of individuals whose villages are closer together geographically are more likely to have larger non-zero transmission probabilities.

For the individual-specific covariates, overall the giver and receiver results are similar within both the binary and continuous regression models, suggesting that the impact of the included factors do not differ based on what role the individual is serving in the pair. For the binary component of the model we see that older, employed, and less formally educated individuals living in rural areas are more likely to be in pairs that have non-zero transmission probabilities. There were no statistically significant individual-level associations identified for the continuous component of the model, likely due to the smaller sample size of non-zero responses resulting in a reduction in power to detect such associations (i.e., see Figure 1).

In Web Figure 2 of the Supporting information, we display the findings from the modified tobit regression model and note that (i) the WAIC value suggests a substantially worse model fit compared to the methods in Figure 3 and (ii) previously identified significant findings observed in Figure 3 are now masked when the model is simplified in this way. Overall, while more complex (as indicated by the effective number of parameters, pformula), the zero-inflated models appear to be important for describing the trends and identifying important associations in our Republic of Moldova data application.

5.3 Random Effect Parameter Analyses

In Web Appendix 5 of the Supporting information, we present two additional analyses for the estimated random effect parameters, formula and formula, from the fitted patristic distances and transmission probabilities models, respectively. First, we present several summaries of the estimated random effect parameters with respect to the observed dyadic genetic distance outcomes, finding an intuitive connection between the random effect parameters and outcomes as well as the clear presence of spatially structured variability in the data (Web Table 3). Results suggest that the frameworks have the ability to identify key individuals associated with increased transmission activity. Next, we use Bayesian kriging techniques (Banerjee et al., 2014) to predict the random effect parameters at unobserved spatial locations for both outcomes. Posterior predicted mean and standard deviation maps are presented in Web Figures 3 and 4, respectively, which allow us to determine areas of increased residual transmission risk for both outcomes. In Web Figure 5, we show trace plots for randomly selected individual-level random effect parameters which suggest that they are well identified in both of the newly developed methods.

6. Discussion

In this work, we presented innovative hierarchical Bayesian spatial methods for modeling two types of dyadic genetic relatedness data. The models account for multiple sources of correlation (i.e., dyadic and spatial) and important features of each outcome (e.g., zero-inflation). Through simulation, we showed that these approaches perform as well as Fixed (i.e., regression while ignoring correlation) and Non-spatial (i.e., no spatial or cross-covariance) when applied to uncorrelated data, and greatly outperform them under correlation in terms of estimating and quantifying uncertainty in the regression parameters. Under any type of correlation, Fixed produces CIs that are too narrow and as a result should not be used for estimating associations between genetic relatedness measures and other factors.

When applying the models to M. tuberculosis data from the Republic of Moldova, we found significant associations between genetic relatedness and spatial proximity, dates of diagnosis, and multiple individual-level factors; many of which represent new insights in this setting. Analysis of the random effect parameters identified individuals and geographic areas that are generally associated with higher levels of transmission, a potentially useful aspect of the model with respect to infection control as heterogeneity in infectiousness among TB patients has been investigated in previous studies (Ypma et al., 2013; Melsew et al., 2019). Additionally, WAIC showed that the correlation between outcomes was at least partially spatially structured, further motivating the development of the new methodology.

The unique data in our study represent a major strength of the work, however, a number of factors inherent to M. tuberculosis make the study of transmission dynamics difficult. First, TB epidemics are relatively slow moving, often occurring over years (Pai et al., 2016), and it is difficult to observe these processes in a two-year study. Second, the time between infection and disease onset varies greatly, from weeks to years (Borgdorff et al., 2011). Third, not all TB cases are “culture positive,” meaning it is more difficult to culture and sequence isolates for every known TB case (Cruciani et al., 2004; Nguyen et al., 2019). Taken together, these factors make it difficult to infer direct transmission events. Even within a putative transmission cluster, the probabilities of direct transmission among case pairs may be low. For this reason, it can be difficult to fit a model to these types of data. We address this challenge by including a binary component in our model specification (i.e., an indicator of a non-zero transmission probability). Finally, as is standard in most epidemiological analyses, our data do not include information on individuals who were exposed but not infected. Therefore, the results should be interpreted conditional on both individuals being infected.

While we use patristic distance as an outcome representative of symmetric dyadic genetic relatedness data, our framework has also been adapted to model SNP distances by modifying the likelihood in (Equation 1). In our R package GenePair (Warren, 2022), we have developed software for fitting the negative binomial and binary (i.e., genetically clustered or not) versions of this model, along with the two versions described in Sections 3.1 and 3.2.

Future work could focus on alternative techniques for accounting for the correlation caused by analyzing dyadic outcomes while also incorporating spatial correlation. Our current methods are based on the idea of shared spatially correlated random effect parameters between pairs of data including the same individual, which was shown to induce an intuitive correlation structure between observations in Section 3.3. Additionally, these parameters were shown to be useful in identifying key individuals and areas that drive transmission in the study. However, there are undoubtedly other approaches for achieving these same goals. More investigation is also required to determine how inference for the fixed effect parameters in (Equations 1), (4), and (5) is potentially impacted when there is spatial confounding between the fixed and random effect parameters in this specific framework (Hodges & Reich, 2010).

A meta-regression approach could also be used in future work for genetic relatedness outcomes that are estimated in preceding analyses and include measures of uncertainty (something not applicable in our data). For example, assuming uncertainty in the estimation of the patristic distances could be accurately quantified, the framework in Section 3.1 could be extended by including an additional hierarchical layer to the model such as formula, where formula represents the estimated patristic distance; formula is the known variance of the estimate; and formula is the true but unobserved patristic distance that is defined by our newly introduced methodology in Section 3.1. A similar setup would be appropriate for the transmission probabilities as well, though quantifying uncertainty in this zero-inflated outcome may prove more difficult. Incorporating this estimation uncertainty would not drastically impact the computational complexity of our frameworks either, since it would only require the imputation of the latent measures as an extra step during each MCMC iteration. However, while modeling this uncertainty is neither conceptually nor computationally complicated, quantifying it accurately is not a trivial task and must be addressed in future work by the methods that aim to quantify genetic relatedness.

We recommend the use of the newly developed methods when the goal of a study is to estimate associations between spatially-referenced genetic relatedness and other variables of interest. Even under the likely unrealistic assumption of no correlation between dyadic responses, the new methodology performs well. When correlation is present, competing approaches can yield potentially overly optimistic insights about statistical significance and/or poorly estimate the associations and should not be used for decision making.

Data Availability Statement

The spatial coordinates of the villages are not available because they may be used to identify individuals in the study. The remaining data are available from the corresponding author upon reasonable request.

Acknowledgments

We thank Valeriu Crudu, Head of Microbiology & Morphology of Laboratory of Phthisiopneumology Institute from Moldova, for providing important insights to our findings in Moldova. This work was supported in part by the National Institute of Allergy and Infectious Diseases (R01 AI147854, R01 AI137093) and the generous support of the American people through the United States Agency for International Development (USAID) through the TREAT TB Cooperative Agreement No. GHN-A-00-08-00004. The contents are the responsibility of the authors and Subgrantee and do not necessarily reflect the views of USAID or the United States Government.

References

Austin
,
A.
,
Linkletter
,
C.
&
Wu
,
Z.
(
2013
)
Covariate-defined latent space random effects model
.
Social Networks
,
35
,
338
346
.

Banerjee
,
S.
,
Carlin
,
B.P.
&
Gelfand
,
A.E.
(
2014
)
Hierarchical modeling and analysis for spatial data
.
CRC Press
.

Beck
,
N.
,
Gleditsch
,
K.S.
&
Beardsley
,
K.
(
2006
)
Space is more than geography: using spatial econometrics in the study of political economy
.
International Studies Quarterly
,
50
,
27
44
.

Borgdorff
,
M.W.
,
Sebek
,
M.
,
Geskus
,
R.B.
,
Kremer
,
K.
,
Kalisvaart
,
N.
&
van Soolingen
,
D.
(
2011
)
The incubation period distribution of tuberculosis estimated with a molecular epidemiological approach
.
International Journal of Epidemiology
,
40
,
964
970
.

Campbell
,
F.
,
Cori
,
A.
,
Ferguson
,
N.
&
Jombart
,
T.
(
2019
)
Bayesian inference of transmission chains using timing of symptoms, pathogen genomes and contact data
.
PLoS Computational Biology
,
15
,
1
20
.

Ciminelli
,
J.T.
,
Love
,
T.
&
Wu
,
T.T.
(
2019
)
Social network spatial model
.
Spatial Statistics
,
29
,
129
144
.

Cruciani
,
M.
,
Scarparo
,
C.
,
Malena
,
M.
,
Bosco
,
O.
,
Serpelloni
,
G.
&
Mengoli
,
C.
(
2004
)
Meta-analysis of BACTEC MGIT 960 and BACTEC 460 TB, with or without solid media, for detection of mycobacteria
.
Journal of Clinical Microbiology
,
42
,
2321
2325
.

Delsuc
,
F.
,
Brinkmann
,
H.
&
Philippe
,
H.
(
2005
)
Phylogenomics and the reconstruction of the tree of life
.
Nature Review Genetics
,
6
,
361
375
.

Didelot
,
X.
,
Fraser
,
C.
,
Gardy
,
J.
&
Colijn
,
C.
(
2017
)
Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks
.
Molecular Biology and Evolution
,
34
,
997
1007
.

Gelman
,
A.
,
Carlin
,
J.B.
,
Stern
,
H.S.
,
Dunson
,
D.B.
,
Vehtari
,
A.
&
Rubin
,
D.B.
(
2013
)
Bayesian data analysis
.
CRC Press
.

Geweke
,
J.
(
1991
)
Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments
, vol. 196. Federal Reserve Bank of Minneapolis,
Research Department Minneapolis
,
MN, USA
.

Gill
,
P.S.
&
Swartz
,
T.B.
(
2001
)
Statistical analyses for round robin interaction data
.
Canadian Journal of Statistics
,
29
,
321
331
.

Hanks
,
E.M.
&
Hooten
,
M.B.
(
2013
)
Circuit theory and model-based inference for landscape connectivity
.
Journal of the American Statistical Association
,
108
,
22
33
.

Hodges
,
J.S.
&
Reich
,
B.J.
(
2010
)
Adding spatially-correlated errors can mess up the fixed effect you love
.
The American Statistician
,
64
,
325
334
.

Hoff
,
P.D.
(
2003
)
Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers
, chapter Random effect models for network data.
National Academies Press
,
Washington DC
. pp.
303
312
.

Hoff
,
P.D.
(
2005
)
Bilinear mixed-effects models for dyadic data
.
Journal of the American Statistical Association
,
100
,
286
295
.

Hoff
,
P.D.
(
2021
)
Additive and multiplicative effects network models
.
Statistical Science
,
36
,
34
50
.

Hoff
,
P.D.
,
Raftery
,
A.E.
&
Handcock
,
M.S.
(
2002
)
Latent space approaches to social network analysis
.
Journal of the American Statistical Association
,
97
,
1090
1098
.

Kenny
,
D.A.
,
Kashy
,
D.A.
&
Cook
,
W.L.
(
2020
)
Dyadic data analysis
.
Guilford Publications
.

Klinkenberg
,
D.
,
Backer
,
J.A.
,
Didelot
,
X.
,
Colijn
,
C.
&
Wallinga
,
J.
(
2017
)
Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks
.
PLoS Computational Biology
,
13
,
1
32
.

Loman
,
N.
&
Pallen
,
M.
(
2015
)
Twenty years of bacterial genome sequencing
.
Nature Review Microbiology
,
13
,
787
794
.

Melsew
,
Y.A.
,
Gambhir
,
M.
,
Cheng
,
A.C.
,
McBryde
,
E.S.
,
Denholm
,
J.T.
,
Tay
,
E.L.
, et al. (
2019
)
The role of super-spreading events in Mycobacterium tuberculosis transmission: evidence from contact tracing
.
BMC Infectious Diseases
,
19
,
1
9
.

Mitchell
,
M.N.
,
Ocamb
,
C.M.
,
Grünwald
,
N.J.
,
Mancino
,
L.E.
&
Gent
,
D.H.
(
2011
)
Genetic and pathogenic relatedness of Pseudoperonospora cubensis and P. humuli
.
Phytopathology
,
101
,
805
818
.

Neumayer
,
E.
&
Plümper
,
T.
(
2010
)
Spatial effects in dyadic data
.
International Organization
,
64
,
145
166
.

Nguyen
,
M.-V.H.
,
Levy
,
N.S.
,
Ahuja
,
S.D.
,
Trieu
,
L.
,
Proops
,
D.C.
&
Achkar
,
J.M.
(
2019
)
Factors associated with sputum culture-negative vs. culture-positive diagnosis of pulmonary tuberculosis
.
JAMA Network Open
,
2
, e187617.

Pai
,
M.
,
Behr
,
M.A.
,
Dowdy
,
D.
,
Dheda
,
K.
,
Divangahi
,
M.
,
Boehme
,
C.C.
, et al. (
2016
)
Tuberculosis
.
Nature Reviews Disease Primers
,
2
, 16076.

Polonsky
,
J.A.
,
Baidjoe
,
A.
,
Kamvar
,
Z.N.
,
Cori
,
A.
,
Durski
,
K.
,
Edmunds
,
W.J.
, et al. (
2019
)
Outbreak analytics: a developing data science for informing the response to emerging pathogens
.
Philosophical Transactions of the Royal Society B
,
374
, 20180276.

Poon
,
A. F.Y.
(
2016
)
Impacts and shortcomings of genetic clustering methods for infectious disease outbreaks
.
Virus Evolution
,
2
(
2
), vew031.

Simpson
,
S.L.
&
Laurienti
,
P.J.
(
2015
)
A two-part mixed-effects modeling framework for analyzing whole-brain network data
.
NeuroImage
,
113
,
310
319
.

Stimson
,
J.
,
Gardy
,
J.
,
Mathema
,
B.
,
Crudu
,
V.
,
Cohen
,
T.
&
Colijn
,
C.
(
2019
)
Beyond the SNP threshold: identifying outbreak clusters using inferred transmissions
.
Molecular Biology and Evolution
,
36
,
587
603
.

Wang
,
J.
(
2007
)
Triadic IBD coefficients and applications to estimating pairwise relatedness
.
Genetics Research
,
89
,
135
153
.

Warner
,
R.M.
,
Kenny
,
D.A.
&
Stoto
,
M.
(
1979
)
A new round robin analysis of variance for social interaction data
.
Journal of Personality and Social Psychology
,
37
,
1742
.

Warren
,
J.L.
(
2022
)
GenePair: Statistical methods for modeling spatially-referenced paired genetic relatedness data
. https://github.com/warrenjl/GenePair [Accessed 23th January 2023].

Watanabe
,
S.
(
2010
)
Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory
.
Journal of Machine Learning Research
,
11
,
3571
3594
.

Wong
,
G.Y.
(
1982
)
Round robin analysis of variance via maximum likelihood
.
Journal of the American Statistical Association
,
77
,
714
724
.

Yang
,
C.
,
Sobkowiak
,
B.
,
Naidu
,
V.
,
Codreanu
,
A.
,
Ciobanu
,
N.
,
Gunasekera
,
K.S.
, et al. (
2022
)
Phylogeography and transmission of M. tuberculosis in Moldova: a prospective genomic analysis
.
PLoS Medicine
,
19
, e1003933.

Ypma
,
R.J.
,
Altes
,
H.K.
,
Van Soolingen
,
D.
,
Wallinga
,
J.
&
Van Ballegooijen
,
W.M.
(
2013
)
A sign of superspreading in tuberculosis: highly skewed distribution of genotypic cluster sizes
.
Epidemiology
,
24
,
395
400
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data