Hierarchical probabilistic models for multiple gene/variant associations based on next-generation sequencing data

Abstract Motivation The identification of genetic variants influencing gene expression (known as expression quantitative trait loci or eQTLs) is important in unravelling the genetic basis of complex traits. Detecting multiple eQTLs simultaneously in a population based on paired DNA-seq and RNA-seq assays employs two competing types of models: models which rely on appropriate transformations of RNA-seq data (and are powered by a mature mathematical theory), or count-based models, which represent digital gene expression explicitly, thus rendering such transformations unnecessary. The latter constitutes an immensely popular methodology, which is however plagued by mathematical intractability. Results We develop tractable count-based models, which are amenable to efficient estimation through the introduction of latent variables and the appropriate application of recent statistical theory in a sparse Bayesian modelling framework. Furthermore, we examine several transformation methods for RNA-seq read counts and we introduce arcsin, logit and Laplace smoothing as preprocessing steps for transformation-based models. Using natural and carefully simulated data from the 1000 Genomes and gEUVADIS projects, we benchmark both approaches under a variety of scenarios, including the presence of noise and violation of basic model assumptions. We demonstrate that an arcsin transformation of Laplace-smoothed data is at least as good as state-of-the-art models, particularly at small samples. Furthermore, we show that an over-dispersed Poisson model is comparable to the celebrated Negative Binomial, but much easier to estimate. These results provide strong support for transformation-based versus count-based (particularly Negative-Binomial-based) models for eQTL mapping. Availability and implementation All methods are implemented in the free software eQTLseq: https://github.com/dvav/eQTLseq Supplementary information Supplementary data are available at Bioinformatics online.


Detailed model description
In all cases, we assume an N × M matrix of genotypes X and an N × K matrix of count-data Z, where N , M and K is the number of samples, genetic markers and transcripts, respectively. We also assume an N × K matrix of expression data Y, which results from transforming Z. We start by describing the Normal model, which Poisson and Binomial models are based on.

Normal model
Assuming N (a, b) is a univariate Normal distribution with mean a and variance b, this model takes the following form: where y ik is the expression level of transcript k for subject i; µ k is the mean expression level of transcript k; β jk quantifies the effect of marker j on the expression level of transcript k; finally, τ k , ζ jk and η j are precision parameters. The dot product x T i β k measures the influence (positive, negative or zero) of the genetic profile of subject i on the mean expression of transcript k.

Poisson model
Assuming P (a) is a Poisson distribution with mean a, this model takes the following form: where z ik is the (untransformed) number of counts for transcript k in subject i, m ik is the mean expression level of transcript k for subject i and y ik is a normally distributed latent variable (or pseudo-datum) obeying the Normal model described above. Thus, the Poisson (and Binomial; see below) model can be thought of as an extension of the Normal model through the addition of a Poisson likelihood. Conditional on pseudo-data Y, the observed counts Z are independent of B, thus greatly simplifying the estimation of the latter, as shown in a subsequent section.

Binomial model
Assuming B (a, b) is the Binomial distribution with a trials and b probability of success at each trial, the Binomial model is constructed on top of the Normal, similarly to the Poisson: where n i = k z ik is the library size in subject i and p ik is the read mapping probability for transcript k in subject i. As for the Poisson model, y ik is a latent variable following the Normal model.

Negative Binomial model
The final and most elaborate model we examine is the Negative Binomial. Assuming NB (a, b) is the Negative Binomial distribution with mean a and dispersion parameter b, the model takes the following form: where φ k and µ k are the dispersion and mean expression level of transcript k, respectively. Shrinkage of φ k is achieved by imposing a log-normal prior with mean µ φ and precision τ φ . These parameters in turn have their own priors (uniform and Jeffrey's, respectively) completing the hierarchical model. Importantly, the prior of µ k is not defined directly; rather, we impose an uninformative Beta prior on the variable µ k / φ −1 k + µ k , which as we shall see below, simplifies the estimation of µ k . Notice that β jk (the elements of vector β k ) are modelled as in the Normal model.

Inference
Inference in all models is achieved via Gibbs sampling, which requires sampling in turn and repeatedly from the posterior distribution of each random variable in the model, after conditioning on the data and the remaining variables. Below, we describe these conditional posteriors for each model.

Normal model
A tractable mathematical theory exists for the Normal model, which is the simplest to estimate. The necessary posteriors are as follows: where Ga (a, b) is a Gamma distribution with shape a and rate b and N M (a, b) is a multivariate Normal distribution of dimension M with mean vector a and covariance matrix b. For the transcript-specific precision matrix A k , we have A k = X T X + diag (η 1 ζ 1k , . . . , η j ζ jk , . . . , η M ζ M k ), where diag (a) indicates an M × M diagonal matrix with a as its main diagonal.

Poisson model
For the Poisson model, the conditional posterior for each latent variable p (y ik |−) does not exist in the form of a well-known, easy-to-sample-from distribution. Thus, we employ a Metropolis step at each iteration of the Gibbs sampler, to approximate a sample from this distribution. We use proposals sampled from the prior of y ik (which is Normal; see Detailed model description) given the current values of the mean and precision parameters. If y * ik is such a proposal, it is accepted with probability P accept as follows: where m ik = e y ik and m * ik = e y * ik . Conditional on y ik , the observed counts z ik are independent of the other parts of the model and the remaining posteriors are the same as for the Normal model.

Binomial model
Similarly, for the Binomial model, the conditional posterior p (y ik |−) is not available in a convenient form and a Metropolis step is employed using proposals y * ik from the prior, as in the case of the Poisson model. The acceptance probability is: and n i is the size of the library for the i th sample (see Detailed model description above). As in the Poisson model, given y ik the remaining posteriors are the same as for the Normal.

Negative Binomial model
This model is the most difficult to estimate. The most important methodological contribution of this paper is applying recent statistical theory (Polson et al., 2013) to show that the posterior p (β k |−) is a multivariate Normal distribution, as shown below: where ω k is a vector of auxiliary variables sampled from a Polya-Gamma distribution PG(b, c) with parameters b > 0 and c ∈ R. Briefly, sampling from p (β k |−) requires first sampling ω k from a Polya-Gamma distribution given the current value of β k , followed by sampling a new value of β k from a multivariate Normal distribution given the newly sampled ω k . The derivation of the above equations is given in the next section.
For µ k , we first sample from the posterior of π k = µ k φ −1 k + µ k −1 , which is Beta: Given π k , a new value of µ k is easily derived as: µ k = π k (1 − π k ) −1 φ −1 k . A more detailed explanation of the above derivation is given in the next section. Finally, the posterior of φ k is not one of the common distributions and we must again employ a Metropolis step at each iteration of the Gibbs sampler. Proposals φ * k are sampled from the log-Normal prior of φ k using the most recent values of µ φ and τ φ , and accepted with probability: Γ is the gamma function. The hyper-parameters µ φ and τ φ are easily updated by sampling from the following posteriors:

Further details on the Negative Binomial model
Deriving the conditional posterior p (β k |−) for the Negative Binomial model requires use of the following identity (Polson et al., 2013): where κ = a − b/2 and ω ∼ PG (b, 0) is a random sample from a Polya-Gamma distribution. The implied conditional posterior of ω given ψ is p (ω|ψ) = PG (b, ψ). The above suggests a data augmentation strategy, where we sample ω from its posterior given ψ, followed by sampling ψ, given ω, from the conditional posterior p (ψ|ω) ∝ e κψ e − 1 2 ωψ 2 p (ψ), where p (ψ) is the prior of ψ. In the case of the Negative Binomial model, the likelihood of a single data point can be written as follows: Then, invoking Bayes' rule and identity (1), the posterior of β k can be written as: . . , τ k ζ jk η j , . . . , τ k ζ jk η j ). After some algebra, it follows that: We sample from the Polya-Gamma distribution as in (Zhou et al., 2012).
Regarding the conditional p (µ k |−), we first compute the posterior of At each iteration, we first sample π k from the above Beta distribution, and then we calculate µ k = π k (1 − π k ) −1 φ −1 k given the newly sampled π k .

Computational considerations
Inference in the above models requires calculating at each iteration the Cholesky decomposition of A k for all k, and solving three triangular linear systems involving the Cholesky factor L k (a lower triangular matrix) of A k . Cholesky decomposition has complexity O M 3 , while solving each of the triangular systems has complexity O M 2 , where M is the number of rows/columns of A k (and the number of genetic markers we examine). In order to reduce the computational cost of these calculations, at each iteration we ignore the rows (genetic markers) of B k with all elements below a threshold β thr = 10 −6 . This effectively reduces the dimensionality of the problem and dramatically increases the speed of the Gibbs sampler. In addition, the above computations are performed in parallel for different k, which further reduces the execution time of the sampler.

Normalization
When testing vst and voom, the RNA-seq data (either simulated or natural) are normalized using the default method in each package. For the remaining models, we use the relative log expression (RLE) method, which is the default in DESeq2.
It should be emphasised that, for eQTL mapping (as for differential gene expression), explicitly normalising read counts with respect to gene length is not necessary. This is because, in both these problems, we wish to identify relative variations in gene expression across a number of samples in all of which each particular gene or transcript has obviously the same constant length. Thus, correcting for gene length is not required. Although logically more reads would map on longer genes for a particular level of expression, the only expected "side-effect" of this would be an increased power to detect variations in the expression of longer genes, due to the presence of a stronger and easier to detect signal (i.e. number of reads).
However, gene length may be important in other applications, which focus on comparison of gene expression within the same sample. In this case, correcting for gene length may be necessary to avoid bias. This can be done using appropriate normalisation methods such as reads per kilobase of transcript and per million of mapped reads (RPKM) or fragments per kilobase of transcript and per million of mapped reads (FPKM). Laplace smoothing can be subsequently applied on the corrected counts. More generally, any normalisation method that results in a matrix of corrected counts can precede transformation using Laplace smoothing.

Data simulation
For simulating genotypes, we use data from the 1000 Genomes project (1000Genomes Project Consortium et al., 2015. Specifically, we extracted the genotypes for all bi-allelic single nucleotide variants (SNVs) with minor allele frequency (MAF) larger than 5% in the region chr7:100000-200000. This left us with genotypes for 435 SNVs in 2503 samples with known empirical MAFs (Supplementary Figure 7a). In order to generate a matrix of genotypes X, we select randomly without replacement M SNVs from this dataset and, for each SNV, we use its empirical MAF in a Binomial distribution to generate genotypes for N subjects. For large N , the MAF for each of the M variants matches the empirical MAF of the corresponding variant in the 1000 Genomes project. For a test case where the genotype data from the 1000 Genomes project is used as is, see Supplementary Figure 11.
Ideally, simulation of read counts should not rely on any distributional assumptions. Although there has been some progress in non-parametric simulation of read counts in the context of differential expression studies (e.g. Benidt and Nettleton (2015)), it is still not obvious how to simulate transcript/variant associations (as required in this study) without resorting to some form of parametric model. Given this limitation, we have adopted a Negative Binomial model, which is widely accepted as the standard approach for simulating read counts. Towards this aim, we use RNA-seq data from 60 HapMap individuals of European descent (Montgomery et al., 2010;Frazee et al., 2011), which includes 11353 genes with non-zero read counts in at least one sample. We fitted a Negative Binomial model to normalized counts of individual genes, thus estimating mean and dispersion parameters for each of them (Supplementary Figure 7b). In order to simulate a read counts matrix Z, we select randomly without replacement mean and dispersion estimates (µ k and φ k , respectively) for K genes. Furthermore, we simulate a sparse matrix of regression coefficients B with non-zero elements β jk ∼ 1 2 (1 + e jk ) − 1 2 (1 + e jk ), where e jk is a random sample from the Exponential distribution with rate λ = 1 (Supplementary Figure 7c). Given the genotypes X, the non-zero elements in the k th column of this matrix are normalized by first dividing with max (| Xβ k |) and then multiplying with log s, where s > 1 is the maximum fold change in the mean expression level of transcript k across all subjects. This normalization ensures that the fold change for each genes in all subjects is no larger that s or smaller than 1/s (Supplementary Figure 7d). In this paper, we assume that s takes the values 2, 4 or 8 corresponding to small, medium and large effect sizes, respectively.
Finally, given the genotypes X, coefficients B and {µ k , φ k } estimates for K genes, we can simulate an N × K matrix of read counts with elements z ik drawn from a Negative Binomial distribution with mean µ k e x T i β k and dispersion φ k (Supplementary Figures 7e,f). Further details on the capacity of the Negative Binomial distribution to faithfully model the empirical RNA-seq data are given in Supplementary Figure 8.

Performance metrics
The root mean square error (RMSE) in Figure 2 is defined as: where j and k are restricted to the true positive elements of the estimated coefficients matrixB, only. The normalized elements β * jk andβ * jk are derived from the elements of the true (β jk ) and estimated (β jk ) matrices by dividing each with jk | β jk | and jk |β jk |, respectively. This normalization removes any possible scaling that may have been introduced due to the use of different models, thus making comparisons between them permissible.
For calculating the concordance correlation coefficient (CCC) in Figure 3, we treat the expression matrix Z as an N × K vector. In the case of Poisson, Binomial and Negative Binomial models, Z is first log-transformed (after adding 1 to all entries in order to avoid infinities). In all other cases, we use the transformed expression matrix Y as input. CCC is then defined as: where Y andŶ indicate test and predicted expression data, respectively. Mean (Ȳ ,Ŷ ), variance (S 2 Y , S 2 Y ) and covariance (S YŶ ) values are defined as usual.
All other metrics follow their common definitions, as described below. First, for each simulation, we computed the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) by comparing the estimated marker/transcript pairs with zero/non-zero coefficients against the corresponding simulated ones. This comparison is straightforward after reducing the estimated and "true" (i.e. simulated) sparse coefficient matrices (B and B, respectively) to ones with elements in the set {−1, 0, 1}, which merely indicate the presence/absence and directionality (up-or down-regulation) of a marker/transcript association. Given estimated numbers for TP, TN, FP and FN, we can calculate Matthews correlation coefficient (MCC), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), negative predictive value (NPV) and accuracy (ACC) using the following formulas:

Various metrics
We further test model performance using various additional metrics (Supplementary Figure 1). These are the true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), negative predictive value (NPV) and accuracy (ACC), which is equal to the sum of true positives and true negatives divided by the sum of all positives and negatives. Overall the count-based models have relatively low TPR and PPV at low to medium samples sizes, but very high TNR and NPV at all sample sizes, which results in a significantly higher accuracy, in relation to Normal models. Among the latter, arcsin demonstrates consistently the best performance, particularly at small to medium sample sizes.

Strength and distribution of gene/variant associations
We examine how model performance is affected by the strength of gene/variant associations (i.e. the effect size), their number and their distribution at small (N = 250) and large (N = 2000) sample sizes. We test three different effect sizes: small, medium and large, which are defined according to the fold change they induce in gene expression (see precise mathematical definition in Supplementary Methods). Regarding the number of associations, we test data with 1, 2, 4, 8, 16, 32 and 64 associations and we made sure that both hotspots (i.e. single genetic markers influencing multiple genes) and polygenic effects (i.e. multiple genetic markers affecting the same gene) were included (Supplementary Figure 9). As illustrated in Supplementary Figure 2, increasing the effect size has a positive effect on the performance of all models. At small samples and at all effect sizes, the arcsin model shows the best performance. At large samples, the Negative Binomial and Poisson models take the lead (along with the arcsin model) at small and medium effect sizes. At large effect sizes, the Poisson and arcsin models are pulled towards the mean, leaving the Negative Binomial as the model with the best performance. A similar pattern is observed when we examine the number and distribution of gene/variant associations (Supplementary Figure 3), with the Negative Binomial, Poisson or arcsin as the best performing model in most cases.

The effect of noise
We examine the effect of contaminating the transcript expression data with two different types of noise in the form of extreme outliers (Soneson and Delorenzi, 2013). Type I noise was generated by considering each element of the expression matrix Z independently and multiplying it, with a probability 50%, with a random number between 5 and 10. Type II noise was generated by selecting, for each gene, one sample at random and multiplying the counts corresponding to that gene with a random number between 5 and 10. With Type I noise, we expect 50% of the counts in each sample to be outliers, while Type II noise ensures that the counts for all genes are surely outliers in one of the samples. Type I noise represents a rather extreme scenario compared to Type II, although both types of noise include equally strong outliers. As illustrated in Supplementary Figure 4, Type I noise has a negative effect on model performance at all samples sizes, while Type II noise has a specific negative effect on the Negative Binomial model at small sample sizes only. The arcsin model still shows top performance at small samples in both noise scenarios, but the Negative Binomial model has relatively good performance only in the Type I noise scenario. The performance of the Poisson model is either below average (noise Type I ) or average (noise Type II). At large sample sizes, the Negative Binomial, Poisson and arcsin models retain their lead under both noise scenarios.
Furthermore, we examine a second kind of noise, which takes the form of random genotyping errors (Supplementary Figure 5). We simulate these errors by randomly modifying the number of alleles in a subset of the genotypes matrix X to one of their two possible alternatives. For example, if element (i, j) in this matrix is 1 (which indicates a heterozygous call), we randomly change this to either 0 (wild-type homozygous) or 2 (mutant homozygous). We apply this process to either 15% or 30% of the elements of matrix X. Not surprisingly, increasing the proportion of genotyping errors reduces the performance of all models, particularly at small samples (N=250), where the average MCC value drops by more than half at 30% error rate. At this error rate, arcsin, followed by nbin, demonstrates the best performance. At large samples (N=2000), reduction in average MCC values at higher amounts of genotypic noise does not seem to modify relative performance, with nbin and pois being the best-performing models, closely followed by arcsin.

Violation of model assumptions
Over-dispersion, i.e. the situation where the variance of the read counts for each gene is larger than the mean, is one of the signature characteristics of RNA-seq. In this section, we examine the extreme scenario where only half of the genes are over-dispersed, while the other half has variance equal to the mean, as assumed by the standard (i.e. non-over-dispersed) Poisson model. At small samples (Supplementary Figure 6), the lack of over-dispersion in a significant proportion of genes severely hits the performance of the Negative Binomial model, but boosts the performance of the Poisson and Normal models smoothing out the difference between them. At large samples, estimating the Negative Binomial model did not finish after 30 hours of computation due to the introduction of many false positives and it was excluded from the analysis. In this scenario, the Poisson model becomes the top-performer followed by the Normal model with either log-or vst-transformed data. This is the only case we observed, where arcsin has smaller performance than other Normal models.
The fact that the Normal model appears robust in violations of basic distributional assumptions is expected, because its application requires transforming the data first (regardless of the form of its original distribution) in order to make it more normal-like. This is an additional point in favour of a transformation-based statistical approach.  Suppl. Fig. 6: The effect of over-dispersion on model performance. We examine the impact on model performance at both small (N=250 ) and large (N=2000 ) samples, when 50% of the genes do not demonstrate over-dispersion, but rather their variance is equal to the mean, as in the standard Poisson model.  Suppl. Fig. 11: The effect of correlation between genetic markers. We examine the scenario where the template matrix of genotypes (see Data simulation) is used as is and only the matrix of expression data is simulated. (a) Covariance between all pairs of genetic markers in the data. Marker 75 is embedded in a block of highly correlated markers, while marker 225 is located in a region of relatively uncorrelated markers (see arrows at the bottom). (b) Markers 75 and 225 are associated with transcripts 363 (down-regulation) and 253 (up-regulation), respectively (see vertical solid lines). While eQTLseq correctly identifies marker 225 as an eQTL, it also identifies as such marker 78, instead of the highly correlated marker 75 (see black dots). As with other variable selection methods (e.g. LASSO), when presented with a group of highly correlated variables, eQTLseq selects a single one among them in a possibly non-deterministic manner. In this scenario, we used the Normal model with a logarithmic transformation of the simulated read counts. Although eQTLseq can be used with correlated data, as demonstrated above, its intended use is with pre-filtered genotypic data consisting of mostly uncorrelated markers.