RankProd 2.0: a refactored bioconductor package for detecting differentially expressed features in molecular profiling datasets

Abstract Motivation The Rank Product (RP) is a statistical technique widely used to detect differentially expressed features in molecular profiling experiments such as transcriptomics, metabolomics and proteomics studies. An implementation of the RP and the closely related Rank Sum (RS) statistics has been available in the RankProd Bioconductor package for several years. However, several recent advances in the understanding of the statistical foundations of the method have made a complete refactoring of the existing package desirable. Results We implemented a completely refactored version of the RankProd package, which provides a more principled implementation of the statistics for unpaired datasets. Moreover, the permutation-based P-value estimation methods have been replaced by exact methods, providing faster and more accurate results. Availability and implementation RankProd 2.0 is available at Bioconductor (https://www.bioconductor.org/packages/devel/bioc/html/RankProd.html) and as part of the mzMatch pipeline (http://www.mzmatch.sourceforge.net). Supplementary information Supplementary data are available at Bioinformatics online.

• Assumption 1: Among all the measured features, only a few molecular species are differentially expressed; • Assumption 2: The measurements are independent between replicate experiments; • Assumption 3: Most of the changes are independent of each other; • Assumption 4: The variance is about equal for all features.
Suppose we have an omics dataset where a total of N molecular features (e.g. genes, proteins, metabolites) have been measured in K replicated experiments. Let r i,j be the position of the i th feature in the j th replicate experiment in a list ordered according to fold changes (in decreasing order if we are interested in upregulated features, or in increasing order vice versa). Under the null hypothesis (no differentially expressed molecular features are present in the dataset), the position of a feature in the list generated by one replicate comes from a uniform distribution (i.e. P (r i = x) = 1/N , where x ∈ {1, .., N }). Considering all the K replicates, one should notice that it is extremely unlikely to find the same feature at the top of each list just by chance. In fact the exact probability of a gene being ranked 1 in each replicate is exactly 1/N K . The RP (Rank Product) statistic for the i th feature is defined as the geometric mean of all the ranks of the feature obtained in each replicate: (1) the related RS (Rank Sum) statistic is defined as the arithmetic mean of all the ranks: The molecular features with the smallest RP (or RS) value are the most likely to be upregulated or downregulated (according to the order we chose when ranking the fold changes and under the assumptions enlisted above). Detailed comparative evaluations of the performance of RP and RS on simulated and real experimental datasets have been published by Breitling et al. (2004), Breitling and Herzyk (2005a), Breitling and Herzyk (2005b), Jeffery et al. (2006), Koziol (2010a) and Koziol (2010b).

P-values evaluation
Selecting the variables with the smallest RP (or RS) values for further study can be sufficient, but it is of great interest knowing how statistically significant the changes are and how many of the selected variables are likely to be truly differentially expressed. In other words, the evaluation of the p-values associated with the RP and RS values plays a key role in the interpretation of the results. In the original version of the method, the determination of significance levels was based on a simple but timeconsuming permutation-based estimation procedure both for the RP and the RS (Breitling et al., 2004). This method is based on the generation of a large number of random experiments with the same number of replicates and features (as in the real data) in order to evaluate the percentage of false positives (pfp): 1. generate p permutations of K ranked lists of dimension (1 × N ); 2. calculate the RP (or RS) values of the N features in each permutation; 3. let c be the number of times the rank products of the i th feature in the permutations are smaller or equal to the observed rank product value; 4. evaluate the expected value for the rank product: E(RP i ) = c p ; 5. evaluate the percentage of false positives: rank i is the position of the i th feature in a list sorted by increasing rank products.
Using this procedure automatically allows to cope with the multiple testing problem. Especially when dealing with a large number of features, as is typical in modern molecular profiling experiments, to get a reliable estimate this method requires a computationally demanding number of permutations. To tackle this problem, Koziol (2010a) used a gamma distribution to approximate the distribution of the rank products. This considerably speeds up the calculations, but it has been shown that this approximation is imprecise when dealing with the tails of the distribution, i.e. the most interesting (most strongly changed) features (Eisinga et al., 2013). Eisinga et al. (2013) derived the exact probability distribution of the Rank Product statistic. The use of the exact p-values offers a significant improvement both over the gamma approximation and the permutation approximation, especially regarding small p-values in the tails. Unfortunately, the evaluation of the exact p-values is computationally time demanding, especially with large numbers of features and replicates. For this reason, in the new version of the package the evaluation of the p-values for the RP statistic uses an alternative method proposed by Heskes et al. (2014), which allows a very accurate approximation of the p-values in a computationally fast manner by computing strict bounds for the exact pvalues and approximating the latter as the geometric mean of the bounds. Regarding the RS, we have incorporated a new method to compute the exact p-value. It is easy to understand that, under the null hypothesis, the probability distribution of the RS, in an experiment with N features and K replicates, is equivalent to the probability distribution of the sum of the outcomes obtained by rolling K dice with N faces. This common challenge of traditional statistics has a long story that goes back to the 16 th century. It was solved by enumeration for two and three ordinary dice in Cardano's Liber de Ludo Aleae (Cardano, 1663, p: 264-265), written in 1564 and published many years later (Bellhouse, 2005). The same problem was also tackled by Blaise Pascal and Pierre de Fermat in a number of letters that they exchanged in 1654, which mark the beginning of modern statistics (Hald, 2003, p: 67-68). Their findings and additional analyses were published by Huygens (1657, p: 521-534) in De Ratiociniis in Ludo Aleae. In the early 18 th century, de Moivre (1711) used the generating function to obtain the probability of a sum of s points on a roll of K fair N -sided dice. Concurrently, Rémond de Montmort (1713) provided a combinatorial solution for such distribution. For a detailed historical account see Hald (2003Hald ( , p: 204-2013). The distribution is described as an exercise or example demonstrating the power of the probability generating function as a problem-solving tool in classic and modern textbooks on statistics (Uspensky, 1937, p: 23-24;Feller, 1968, p: 284-285;Johnson and Kotz, 1977, p: 56-57;Berry and Lindgren, 1996, p: 107-111;DasGupta, 2010, p: 84-85) and combinatorics (Dobrushkin, 2009, p: 396-371), and it is discussed in various statistical papers (Tsao, 1955;Takács, 1967;Murty, 1981;Neff, 1982;Fielder and Alford, 1991;Bollinger, 1986;Balasubramanian et al., 1995;Albert, 2002;Ericksen, 2002;McShane and Ratliff, 2003) and resources (Weisstein, 2016). We describe the exact null in rather detail, as the information drawn together here is available but widely scattered in literature. This may also be one of the reasons the exact distribution has been reinvented many times and in various fields, including genetics Levinson et al., 2003), psychology (Edgington, 1972;Strube and Miller, 1986), veterinary science (Reiczigel, 1999), agricultural science (Luo et al., 2010), rank sum testing (Özkan, 1987), gaming research (Singh et al., 2011), and statistics (Shen and Marston, 1995). , Albert (2002) and Eger (2013), amongst others, derived the exact null distribution using an inclusion-exclusion argument. We have K independent rankings (replicate experiments), each with integer-valued ranks 1, 2, ..., N (features). Under the null hypothesis, the random rankings have a discrete uniform distribution with probability P (R = r) = pr = N −1 , for 1 ≤ r ≤ N . The generating function of the sequence of probabilities of the random rankings is the finite power series: where the probabilities pr of the integer rank values r are the coefficients of t r . This function can be put in compact form using the formula for the sum of integers in a geometric progression (i.e., finite geometric series): From the convolution theorem (e.g., Feller, 1968, p: 267;Johnson et al., 2005, p: 60), the probability generating function of the sum of K mutually independent rankings, each assuming the ranks 1, 2, ..., N , is given by the K th power of f (t; n): The first sum on the right is the binomial expansion of (1 − t N ) K and the second sum is the power series expansion of (1 − t) −K . Using the −1 transformation and the upper negation identity transformation we obtain: (8) To calculate the probability P (S = s; N, K) of rank sum s as the coefficient of the power of the auxiliary variable t, we write this product of sums as a double sum and collect all terms involving powers of t, which gives: (9) With the change of variable s = k + N i + l , the expression becomes: 10) by noting that whenever l ≥ 0 and k+N i+l = s, we have l+k = s+N i. Only finitely many terms in this sum are different from zero. To obtain these non-zero terms one should note that: The binomial coefficient on the right is greater than zero only when s − N i − K ≥ 0, and this is the case only for i ≤ (s − K)/N . Hence other terms do not contribute to the summation in f (t; N, K). The probability of obtaining rank sum s is therefore: where . is the floor function, i.e., the largest integer less than or equal to gives the number of ways to obtain rank sum s, for K mutually independent rankings, each assuming the ranks 1, 2, ..., N , i.e., the number of (weak) compositions of an integer s = r 1 + r 2 + ... + r K , where 1 ≤ r i ≤ N , for i = 1, 2, ..., K . It may also be described as the number of ways to place s distinguishable balls in K cells so that each cell receives at least 1, but no more than N balls (Albert, 2002;Dobrushkin, 2009, p: 369). The cumulative distribution of rank sum s has the generating function f (t; N, K)/(1 − t) and is obtained as: by repeated application of Pascal's identity: (Feller, 1968, p: 285;Koziol and Feng, 2004). The probability mass function of the discrete random RSs is symmetric (Tsao, 1955), that is: For the probabilities this implies that only half of the values need to be calculated and that the remaining half can be obtained by the identity: given in Tsao (1956). In particular, for the number of compositions we have: as in Wardlaw (1991), and for the p-values: The exact mean and variance of the random RS variable are: As the distribution is symmetric, the skewness is zero, as are all odd order moments. The excess kurtosis is less than zero (Whitfield, 1953), implying that the distribution has thinner extreme tails than the normal. Various studies have proved, however, that under the null hypothesis the sum of ranks is asymptotically normally distributed (Stuart, 1954;Tsao, 1956;Koziol and Feng, 2004;DasGupta, 2010). The calculation of the exact p-value of rank sum s given K and N is implemented in R (R Core Team, 2015), as shown in Algorithm 1. This Algorithm 1 Sum (S) of K, N -sided Dice cumulative distribution 1: procedure dice.cdf(S, N, K) 2: a ← trunc((K(N + 1)/2)) 3: P ← 0 4: if S ≤ a then 5: 10:

11:
return 1 − P 12: end if 13: end procedure vectorized algorithm is extremely fast. It takes about half a second, for example, to calculate the exact p-value of all possible RS in the range N ≤ RS ≤ N · K, for the problem K = 5 and N = 10000, typically encountered in microarray studies. Although the expression for P (S ≤ s; N, K) is algebraically correct, straightforward computation produces numerical errors whenever the dataset is too large. For such cases, the binomial coefficient that has s in the upper index becomes extremely large and storing large numbers for subsequent calculation creates numerical overflow due to the precision limitation of fixed-precision arithmetic, resulting in erroneous results (Dobrushkin, 2009, p: 369). There are remedies to address this problem. One is to use recurrence relations that satisfy the generating function, discussed by various authors (Tsao, 1956;Bollinger, 1986Bollinger, , 1993Dobrushkin, 2009, p: 370). However, the recursive algorithms we examined were computationally expensive to run, except for small to moderate values of N and/or K. An alternative and faster way of computing P (S ≤ s; N, K) correctly is to use arbitrary-precision arithmetic computation to deal with numbers that can be of arbitrary large size, limited only by the available computer memory. Therefore, whenever the dataset is large the R code uses the package Rmpfr (Maechler, 2015) for high precision arithmetic to calculate the exact p-values. As one can see in Table 1 Table 1. Computational time (in milliseconds) needed by RankProd 2.0 to compute the exact p-values for RS = trunc(K * (N + 1)/2). When the size of the dataset forces us to switch to the Rmpfr version of the function, the computation time drastically increases (bold values). when dealing with extremely large datasets the time needed to compute the p-values quickly becomes impractical. As mentioned before, the RS distribution quickly tends to a Gaussian distribution. It is easy to show that with the Gaussian approximation, the largest relative errors are found in the tails, corresponding to the smallest p-values. Incidentally, the computational time needed to compute the p-values with the exact method is small when considering the smallest p-values (i.e. the smallest RS values). Knowing this, when the size of the dataset is such that the time needed to evaluate all the exact p-values becomes unacceptable, the new package uses the exact distribution only for the tails, whereas all the other p-values are evaluated through the Gaussian approximation. The switch between the two strategies happens after a user-defined amount of time.

Comparison with old approach
To assess numerical accuracy of the new implementation of the p-value evaluation we compared it with the permutation-based approach implemented in the old version of the package (Fig 1). In the permutation-based procedure the smallest possible p-value is equal to the inverse of the number of permutations. When getting close to this limit, the permutation approach becomes very imprecise (insets). Obtaining an accurate estimate for the complete range of all possible Rank Products (or Rank Sums)with the old implementation would require a prohibitive number of permutations.

Application to unpaired datasets
The Rank Products method was initially introduced for the analysis of gene expression in paired datasets, specifically two-color microarrays (Breitling et al., 2004). In the old version of the RankProd package, an ad hoc implementation able to cope with unpaired datasets is already present. Nevertheless, the number of unpaired datasets in molecular profiling experiments is constantly rising. This led to the need of a new and more principled approach. Let A and B stand for two experimental conditions in a molecular profiling experiment. In the dataset under study, the level of N features is provided for M A samples in class A and for M B samples in class B. When dealing with a paired experiment, the procedure to evaluate the fold changes needed to rank our metabolites is relatively straightforward. It can be easily noticed that, in the case of an unpaired dataset the ranking is not as trivial as before: which fold changes should be considered? In the old package, this problem was tackled by generating all the possible pairings of samples. In this way, the RP method (or the RS) is applied to a new dataset containing N features and M A × M B "replicates". It appears that the use of this approach leads to an increase of the number of false positives detected when compared with the paired case. A novel procedure called Random Pairing (described in Algorithm 2) has been implemented in the new package to address this issue.  Instead of considering all the possible pairs at once, this procedure generates a user-defined number of random pairings (by default, this number is fixed as the minimum odd integer greater or equal to M × M , where M is the minimum value of M A and M B ) and evaluates the RP (or RS) statistics for each of them. Each of these randomly paired datasets has M samples. For each variable, the final RP (or RS) value returned is the median of all the values found during the random pairing process. The p-value evaluation is then performed as in the case of a paired experiment.

Comparison with old approach
The novel Random Pairing procedure handles unpaired datasets in a more principled way than the old ad hoc implementation. The p-value calculations of RankProd 2.0 for unpaired datasets are more rigorous than before, yielding more reliable significance estimates. As shown in Fig 2, the pvalues for the new and old method are very highly correlated, but in the tails of the distribution the values for the old method were slightly smaller than for the new random-pairing method (insets), leading to a larger number of false positives than warranted. For both statistics the correlation is extremely high (R 2 adj = 0.9989 for RP and R 2 adj = 0.9997 for the RS). As can be seen in Fig 1, the permutation-based approach does not introduce a bias in the p-values, so these differences are not due to the way the p-values themselves are calculated.