Polee: RNA-Seq analysis using approximate likelihood

Abstract The analysis of mRNA transcript abundance with RNA-Seq is a central tool in molecular biology research, but often analyses fail to account for the uncertainty in these estimates, which can be significant, especially when trying to disentangle isoforms or duplicated genes. Preserving uncertainty necessitates a full probabilistic model of the all the sequencing reads, which quickly becomes intractable, as experiments can consist of billions of reads. To overcome these limitations, we propose a new method of approximating the likelihood function of a sparse mixture model, using a technique we call the Pólya tree transformation. We demonstrate that substituting this approximation for the real thing achieves most of the benefits with a fraction of the computational costs, leading to more accurate detection of differential transcript expression and transcript coexpression.


Properties of the transformation
First, for convenience, we will restate the definition of the Pólya tree transformation. The transformation T from (u 1 , y) ∈ (R + , (0, 1) n−1 ) to x ∈ R n + is defined in terms of intermediate values u 1 , . . . , u 2n−1 for each node, which have the following relation, The result of the transformation is then simply the intermediate values from the leaf nodes: x i = u i+n−1 , for i = 1, . . . , n In everything that follows, we adopt the indexing convention that nodes 1, . . . , n − 1 are internal nodes and n, . . . , 2n − 1 leaf nodes. Furthermore, the internal nodes are numbered so that no node has a index smaller than its parent.
Lemma 3.1. If leaves(i) is the set of indexes of leaf nodes in i's subtree, then Proof. This simply says that each intermediate stick piece is the same length as the sum of the parts it is ultimately broken into, and is easy to see through induction.
Let i be the index of a leaf node. The subtree's only leaf is i, so the theorem is trivially true, Next consider any internal node i where the theorem is true of its children.  Figure 2: A comparison of the space and time needed to evaluate the likelihood function and its approximations. A GTEx sample (SRR1340508) was subsampled to, and its likelihood evaluated for, increasing numbers of reads. The upper plot shows the memory in megabytes necessary to evaluate the function, the lower plot, the time in seconds necessary to evaluate the function once.  Figure 3: The likelihood function for every sample from [Li et al., 2017] was approximated. 1000 samples were drawn with probability proportional to the approximated likelihood, and 1000 samples were drawn using a Gibbs sampler. A Wilcoxon signed-rank test was performed for each annotated transcript, and the distribution plotted in a boxplot, drawn with upper and lower whiskers corresponding to the 99th and 1st percentile, respectively. An ideal fit would have a uniform distribution of p-values.
That is, where Then, By mathematical induction, the theorem is true of all i = 1, . . . , 2n − 1.
Suppose u 1 = u 1 , then by Lemma 3.1 n j=1 x j = u 1 = u 1 = n j=1 x j and therefore x = x . Suppose u 1 = u 1 , and let i be the index of the first y i = y i . Since nodes are numbered so that ancestors have smaller indices than their children, for every ancestor j of i, y j = y j , therefore u i = u i . All intermediate values are positive (this follows from u 1 being positive and elements of y residing on an open (0, 1) interval), therefore To show that T is surjective as well, consider any x ∈ R n + . Equations 1 and 2 can be rewritten as Importantly, since elements of x are positive, it follows that all intermediate values are positive, so the denominator of Equation 5 is never zero and y i ∈ (0, 1).
As Equations 1, 2 are satisfied if Equations 4, 5 are, and u 1 , y are defined by the latter, it follows that T ((u 1 , y)) = x and T is surjective, and thus bijective.
Corollary 3.3. If u 1 = 1 is fixed, then the resulting transformation is a bijection between (0, 1) n−1 and ∆ n−1 = {x ∈ R n + | n i=1 x i = 1}. Proof. A restriction of a bijection remains a bijection so we need only show that the image is ∆ n−1 . It follows directly from Lemma 3.1 that the codomain of the restriction is ∆ n−1 . Furthermore, any x ∈ ∆ n−1 has a (u 1 , y) where T ((u 1 , y)) = x, and again by Lemma 3.1, u 1 = 1, so (u 1 , y) is in the restricted domain, and thus the image is ∆ n−1 .
For the purposes of approximating the RNA-Seq likelihood function, we only care about this special case mapping onto the simplex.
For the approximation to be useful for computing probability densities of transformed variables, the determinant of the Jacobian must also be efficiently computable, which we show here.
Theorem 3.4. Let (u 1 , y) ∈ (R + , (0, 1) n−1 ), and J T be the Jacobian matrix of T ((u 1 , y)). Further, let u i for i = 2, . . . , 2n − 1 be the intermediate values as defined in Equations 1 and 2. Then Or in words, the absolute value of the Jacobian determinant is simply the product of the internal node intermediate values.
Proof. One way to derive the determinant for J T is to separate the transformation T into the composition of a number of simpler transformations. The stick breaking metaphor suggests as natural decomposition: each break can be treated as its own transformation. If T is thought of as a sequence of n − 1 breaks, it can be represented as where T i is the ith break and has the associated break proportion y i . This would let us write where J Ti is the Jacobian matrix for T i .
The single break transformation can be thought of as taking i sticks lengths, and n − i remaining breaking proportions, and breaking one of the sticks according to the first remaining proportion. Equivalently, T i reflects taking one step in the tree traversal, computing u left(i) , u right(i) from u i , y i , and then discarding these latter two values. Incidentally, because only two values are replaced, it is not hard to see that each of these transformations is also a bijection between different n dimensional spaces.
To give a concrete example of T being decomposed into single break transformations, consider a balanced tree with n = 4 leaves, and the input u 1 = 100, y = (0.2, 0.6, 0.9). In the following table, each row gives the input to T i , and the subsequent row its output. The final result is a vector in R n + . Because T i only replaces u i , y i with u left(i) , u right(i) , the Jacobian matrix is the n by n identity matrix except for a 2 by 2 submatrix so that, 100 0.2, 0.6, 0.9 2 20, 80 0.6, 0.9 3 12, 8, 80 0.9 12, 8, 72, 8 A Dirichlet process is a special case of a Pólya tree distribution [Ferguson, 1974]. Here we show a variation of this fact, that a Dirichlet distribution can be represented using any Pólya tree transformation, regardless of tree topology, if appropriately applied to Beta distributed random variables. This is important to RNA-Seq because a multinomial distribution is often assumed when there is thought to be no read ambiguity. A multinomial distribution renormalized to form a distribution over probability vectors is a Dirichlet distribution. So this theorem says that if Pólya tree transformed Beta random variables are fit to the multinomial likelihood, nothing is lost except for a constant of proportionality. If the underlying distributions can reasonably mimic Beta distributions, then we can expect the fit to the multinomial to be very good.
where leaves(·) corresponds to T 's tree topology. Put in words, if we associate α j with the jth leaf node (which is node j + n − 1 in our numbering scheme), a i is the sum of α entries corresponding to the leaf node descendants of internal node i.
Hierarchical Beta is simply the family of distributions induced by applying a Pólya tree transformation to Beta distributed stick breaking proportions with a specifically chosen parameterization scheme. The larger distribution family formed without constraining the Beta parameters is discussed by Dennis [1991], who call it the the "hyper-Dirichlet type 1" distribution. Recall that a Pólya tree distribution is a (potentially infinite) hierarchical stick breaking process [Lavine, 1992, 1994, Mauldin et al., 1992 with Beta distributed breaks. Hierarchical Beta distributions are thus a specific subset of Pólya tree distributions.
Definition 3.6. A terminal stick break transformation where x is formed from x by removing x i and inserting yx i and (1 − y)x i such that they are x j and x k , respectively. This is a bijection, and the inverse B −1 ijk , we call an initial stick mend. It removes x j and x k and inserts x i = x j + x k , and additionally yields y = In the proof of Theorem 3.4 we described how a Pólya tree transformation can be thought of as a series of single stick break transformations, which we used to derive the Jacobian determinant. In most trees there is a choice in the order of single stick breaks, equivalent to choosing one of multiple possible preorder traversals of the tree, each of which results in a different representation of the same Pólya tree transformation. Regardless of the ordering, there is always some final break, and this break is a terminal stick break B ijk for some i, j, k.
Some notation will be useful to decompose and build up Pólya tree distributions.
Definition 3.7. If is a single stick break transformation, as described in the proof for Theorem 3.4, which produces one more stick length by using the first breaking proportion, and if k > 1, we can drop the last unused breaking proportion to form a new transformation we denote Similarly, we can add another unused breaking proportion if k ≥ 0, which we'll write as This notation lets us peel off the final break of a Pólya tree transformation. As we saw, the last break is a terminal stick break B ijk for some i, j, k. So any Pólya tree transformation T with n > 2 can be written as B ijk • S, for a transformation S : (R + , (0, 1) n−1 ) → (R n−1 + , (0, 1)) If we dispense with S's unused breaking proportion, we have S − : (R + , (0, 1) n−2 ) → R n−1 + which is itself a Pólya tree transformation. Similarly, we can add another breaking proportion with the T + notation to extend a Pólya tree transformation.
Lemma 3.8 (Hierarchical Beta breaking). Let X ∼ HierarchicalBeta(T, α), where T is a Pólya tree transformation. Let α ∈ R n + , and Z ∼ Beta(a, b) where a + b = α i for some 1 ≤ i ≤ n, and let B ijk be a terminal stick break, then where α is formed by removing α i from α and inserting a and b such that α j = a and α k = b.
Proof. The original Hierarchical Beta distribution is defined by the Pólya tree transformation T applied to n−1 Beta distributed random variables Y 1 , . . . , Y n−1 , parameterized with α as they are in Definition 3.5.
Applying the terminal stick break B ijk simply does one more break according to one more Beta distributed random variable Z. So the resulting distribution is defined by the the Pólya tree transformation B ijk • T + applied to Beta random variables Y = (Y 1 , . . . , Y n−1 , Z). It only remains to be seen that Y follows the Hierarchical Beta parameterizations with α .
Z is defined to trivially obey these rules, and restricting a + b = α i and inserting a, b into α to form α preserves the parameterization of the rest of the distribution, so we can say B ijk (X, Z) ∼ HierarchicalBeta(B ijk • T + , α ).
Lemma 3.9 (Dirichlet mending). If X ∼ Dirichlet(α) and B −1 ijk is an initial stick mend, then where α is formed from α by removing α j and α k and inserting their sum so that α i = α j + α k .
Proof. This follows from two properties of the Dirichlet distribution: subcompositional invariance and amalgamation invariance (see, for example, Pawlowsky-Glahn et al. [2015] p. 121). If x ∼ Dirichlet(α), the former property says, in the narrow case needed here, that for any 1 ≤ k 1 < k 2 ≤ n, or, equivalently, The latter property says that if {1, . . . , n} is partitioned into k > 1 nonempty sets I 1 , . . . , I k , then i∈I1 x i , . . . , For any X ∼ Dirichlet(α), and initial stick mend B −1 ijk , let (X , Y ) = B −1 ijk (X). From the definition of an initial stick mend, Y = Xj Xj +X k . The subcompositional invariance property then gives us Since α is formed by by summing two elements of α, by the amalgamation invariance property, X ∼ Dirichlet(α ) Theorem 3.10. For any Pólya tree transformation T : (0, 1) n−1 → ∆ n−1 where n ≥ 2 and any α ∈ R n + Dirichlet(α) = HierarchicalBeta(T, α) In other words, the Dirichlet distribution family and the Hierarchical Beta distribution family are exactly the same, regardless of the tree topology used for the Hierarchical Beta.
Proof. The proof will proceed by induction over n.
By Lemma 3.9, if X ∼ Dirichlet(α) then where α is defined as it is in Lemma 3.9. By the inductive hypothesis, X is also Hierarchical Beta distributed for any Pólya tree transformation onto ∆ n−1 , and since S − is a Pólya tree transformation onto ∆ n−1 , and by Lemma 3.8, and so we have Dirichlet(α) = HierarchicalBeta(T, α) And by induction, this equality holds for all such α, T and any n ≥ 2.
This result can be strengthened to say that not only can any Dirichlet distribution be constructed using any Pólya tree transformation of the same dimensionality, but the Hierarchical Beta construction is the only way to do so. First, we need a simple result about random variables under bijections.
Lemma 3.11. For any random variables Y = (Y 1 , . . . , Y n ) and Y = (Y 1 , . . . , Y n ), and bijection T , if T (Y ) and T (Y ) are identically distributed, Y and Y must also be identically distributed.
Proof. To show the contrapositive, suppose T (Y ) and T (Y ) are not identically distributed. There is then some y where P (T (Y ) = y) = P (T (Y ) = y). Since where J T is the Jacobian matrix for T , it follows that Therefore Y and Y are not identically distributed either.
Next, we can strengthen Theorem 3.10.
Proof. For any Pólya tree transformation T and α ∈ R n , we know from Theorem 3.10, that if Y = (Y 1 , . . . , Y n ) are Beta distributed with the Hierarchical Beta parameterization, then T (Y ) ∼ Dirichlet(α) From Lemma 3.11, any other random variables Y = (Y 1 , . . . , Y n ) where T (Y ) ∼ Dirichlet(α) must be identically distributed to Y , and so must also be Beta distributed with the same parameterization.
To summarize these results: any Dirichlet distribution can be constructed using any Pólya tree transformation of the same dimensionality, and furthermore this construction is unique.

Minimizing KL-divergence
Minimizing the KL-divergence between the approximation q and the normalized likelihood function P(r|·) (defined in Equation 3 the Methods section), with a set of observed reads r, uses a standard variational inference procedure, which we will briefly recapitulate here.
Recall that the approximation q can be represented as a parameterized transformation S applied to a non-parameterized distribution (i.e., the reparameterization trick). In our case, the non-parameterized distribution is simply a standard n − 1 dimensional Normal distribution N (0, I). The transformation S includes the sinh-asinh transformation, shifting and scaling, the logistic transformation, and the Pólya tree transformation T , which we collapse into a single function here for simplicity.
Explicitly, the combined transformation, for y ∈ R n−1 is S µ,σ,α (y) = T (logistic(µ + σ sinh(asinh(y) + α))) where µ, σ, α is the parameterization, is the element-wise product, and the logistic, sinh, and asinh functions are also applied element-wise. The logistic function has the standard definition logistic(x) = (1 + exp(−x)) −1 , and T is a Pólya tree transformation. Note that T is not parameterized, since it is chosen prior to optimization, as discussed in in the Methods section. To simplify notation, we will combine parameter vectors as φ = (µ, σ, α) and use S φ to denote the parameterized transformation.
The approximate normalized likelihood q φ is defined as the transformation S φ applied to a multivariate standard-normal z ∼ Normal(0, I). If p is the density function for this Normal distribution, the approximation's density function is then is the Jacobian matrix for the inverse S φ transformation. With this set up in mind, the objective function can be rewritten in a form amenable to stochastic gradient descent.
The penultimate step follows because E z [− log p(z)] is a constant (specifically, the entropy of a multivariate standard Normal distribution), and in the final step we replace the normalized likelihood P(·) with the actual likelihood P (r|·), since the constant of proportionality becomes an irrelevant additive constant here.
Intuitively, the objective is choosing an approximation that maximizes the expected log-likelihood, with a Jacobian term accounting for the transformation. What we are left with is an expectation than can be easily optimized through stochastic gradient descent. Each iteration of the optimization algorithm draws from N (0, I), transforms the vector, evaluates the RNA-Seq log-likelihood function, and adds to that the log of the Jacobian determinant of the transformation, computing the gradients with respect to φ. In practice we use Adam Kingma and Ba [2014] to update φ estimates each iteration, and 500 iterations, with each iteration estimating gradients using 6 Monte Carlo samples of z.

Regression Model
Differential expression is determined throughout the paper using a probabilistic linear regression model on log-expression, but with the additional layer of approximate likelihood.
In defining the model we will use the following notation: m number of samples in the experiment n number of transcripts k number of factors/columns in the design matrix X m by n matrix of log-expression values E m by n noise matrix A m by k design matrix W k by n matrix of regression coefficients µ length n log-expression bias vector σ length n transcript standard deviation vector r (j) some representation of the RNA-Seq reads for the jth sample 1 ones vector (length m in all uses below) The basic regression model is defined by, where X T j is the jth row of X, transposed. The softmax functions transforms the log-expression to relative expression. There are some subtleties of this move noted below. The bias vector µ has a Normal prior with a large variance.
Apart from the inclusion of the extra layer r|X, treating the reads as observed and expression values as unobserved, this is a fairly standard log-linear model of gene expression, similar to the one described by Smyth [2004], for example.

Prior on regression coefficients
Operating under the prior expectation that most transcripts are not differentially expressed, W is given a sparsity inducing prior. Elements of W ij have the prior given by, This is a version of what Bhadra et al. [2017] describe as the "Horseshoe+" prior.

Modeling variance
Transcript expression standard deviation σ is shrunk towards standard deviation estimates of transcripts with similar expression.
To do this we choose h = 15 "knots" at expression values, equally spaced between the minimum and maximum, and compute weights u j between every transcript j and knot , using the squared difference between the knot's expression value and the transcript expression averaged across samples. These weights are pre-computed using initial approximate posterior mean expression estimates, obtained by applying the approximation's transformation to a zero vector. Each knot has associated parameters α , β .

Scale penalization
Because RNA-Seq measures relative expression, detecting changes is absolute expression involves additional assumptions. Various normalization schemes have been proposed to overcome this [Bullard et al., 2010]. Other methods avoid the problem by looking for compositional changes [McGee et al., 2019]. Similar to the normalization schemes, our model attempts to find changes in absolute expression by assuming most transcripts or genes are maintaining relatively constant expression (or constant lack of expression).
Because of the the softmax transformation, log-expression values are not on a fixed scale, which is to say for a given sample i, n j=1 exp(x ij ) is not constrained to sum to any particular number. How relative scales between samples are inferred determines how changes in transcript composition are interpreted as changes in expression.
Because of the sparsity inducing prior, higher probability is achieved when fewer transcripts are differentially expressed. If this assumption is reasonable relative scaling will be arrived at automatically so as to minimize differences between samples. Of course, there are circumstances when this assumption fails, particularly in the presence of large changes in the overall quantity of RNA between groups of cells.
In practice we found it helpful for inference to additionally penalize these exponent sums from straying too far. To find reasonable values, we use initial posterior mean estimates to estimate relative scales that minimize differential expression of highly expressed transcripts, then include an additional Normal prior in the model on these exponent sums, centered on the scale estimates. This is simply one approach to putting an informative prior on relative scale.

Inference
The model is fit using variational inference implemented in TensorFlow. Computing the Pólya tree transformation efficiently, particularly many different transformations with different tree topologies, using tensor operations is not straightforward. To avoid this issue, we implemented a custom TensorFlow operation in C++ which computes transformations in parallel on CPUs. Currently, there is no GPU version of this operation, so while running inference on GPUs greatly speeds it up (for example, running a regression on 96 samples took 44 minutes with a GPU, and 168 minutes without), computing the Pólya tree transformation remains the bottleneck, and an obvious target for further improvements in efficiency.
When calling differential expression with this model, effect size is always considered explicitly. Because the model is continuous, the posterior probability of no change in expression is zero. We instead look for transcripts or genes with a sufficiently large posterior probability of the effect size being above some threshold. Unlike most null-hypothesis test models which typically (though, not necessarily) adopt a null hypothesis of zero change, there is an extra threshold that must be chosen: the minimum effect size. This is a slight complication, but has the advantage of being up front about what is considered a potentially interesting effect.

Bias correction
Bias correction methods implemented in Polee follow a methodology similar our prior work in Jones et al. [2012], but incorporating some of the advancements made in Love et al. [2016] to produce a more complete model. The essential approach is to train a probabilistic classifier to discriminate between observed fragments and random fragments from the same transcripts, then using the resulting probability ratios to compute bias.
1. Randomly sample a subset of reads.
2. Assign these reads to transcripts by maximum likelihood, to produce a set of implied fragments F f 3. Reposition fragments in F f uniformly at random within their transcripts to produce a set of hypothetical background fragments F b .

Train a probabilistic classifier on
where c(f ) is the confounding features (e.g., GC content) of the fragment.

Compute bias for a fragment f as
Here |F f | = |F b |, so the obvious prior is P (f ∈ F f ) = 0.5, and in such case, the posterior probability is simple to compute from the Bayes factor, or vice versa, If there are no technical effects biasing read coverage, this classifier will (setting aside any issues of overfitting or noise) match the prior of 0.5 and the bias will be computed as 1 for every fragment.
We train separate models for sequence bias at fragment ends, fragment GC content bias, and positional bias in sequence, at each step reweighting each training example according the classification error of the combined model up to that point in a manner similar to AdaBoost [Bishop, 2006]. Each model is alone is relatively simple.
• Fragment sequence end Sequence probability is modeled with a variableorder Markov chain model. The order at each position is fit by a greedy hill climbing approach. At each iteration, the order of the chain is incremented at the position that most improves the cross entropy. Iteration stops when no improvement is possible. Separate models are build for the 3' and 5' ends of the fragment, which correspond to first-and secondstrand cDNA synthesis, as we observe slightly different bias patterns for the two sides.
• Fragment GC content A simple histogram is used where examples are split into bins so that the number of examples in each bin is approximately equal. The number of bins is optimized over.
• Positional bias Fragments are binned by relative position and transcript length, a histogram is computed, and bins are linearly interpolated (and extrapolated). The number of transcript length bins and relative position bins are optimized over.
The optimization carried out in each sub-model is performed by evaluating cross entropy on a portion of the training data that is set aside. The remainder of the training data is used for setting parameters to their maximum likelihood value, which is trivial for histograms and Markov chain models.