Systematic comparison of ranking aggregation methods for gene lists in experimental results

Abstract Motivation A common experimental output in biomedical science is a list of genes implicated in a given biological process or disease. The gene lists resulting from a group of studies answering the same, or similar, questions can be combined by ranking aggregation methods to find a consensus or a more reliable answer. Evaluating a ranking aggregation method on a specific type of data before using it is required to support the reliability since the property of a dataset can influence the performance of an algorithm. Such evaluation on gene lists is usually based on a simulated database because of the lack of a known truth for real data. However, simulated datasets tend to be too small compared to experimental data and neglect key features, including heterogeneity of quality, relevance and the inclusion of unranked lists. Results In this study, a group of existing methods and their variations that are suitable for meta-analysis of gene lists are compared using simulated and real data. Simulated data were used to explore the performance of the aggregation methods as a function of emulating the common scenarios of real genomic data, with various heterogeneity of quality, noise level and a mix of unranked and ranked data using 20 000 possible entities. In addition to the evaluation with simulated data, a comparison using real genomic data on the SARS-CoV-2 virus, cancer (non-small cell lung cancer) and bacteria (macrophage apoptosis) was performed. We summarize the results of our evaluation in a simple flowchart to select a ranking aggregation method, and in an automated implementation using the meta-analysis by information content algorithm to infer heterogeneity of data quality across input datasets. Availability and implementation The code for simulated data generation and running edited version of algorithms: https://github.com/baillielab/comparison_of_RA_methods. Code to perform an optimal selection of methods based on the results of this review, using the MAIC algorithm to infer the characteristics of an input dataset, can be downloaded here: https://github.com/baillielab/maic. An online service for running MAIC: https://baillielab.net/maic. Supplementary information Supplementary data are available at Bioinformatics online.


apoptosis) and cancer (NSCLC).
The virus data comes from a SARS-CoV-2 related meta-analysis study for prioritisation of host genes implicated in COVID-19 [Parkinson et al., 2020]. The dataset used in the evaluation to assess the result is selected from the lists collected by researchers later than the first published version (2020-07-9 to 2020-11-25) and is not used as input for the ranking aggregation algorithms. 238 genes appearing more than once within the top 100 genes of new lists after the published lists (2020-07-9) are extracted as truth. Details of each source for SARS-CoV-2 data can be found in [Parkinson et al., 2020].
The cancer dataset consists of genes associated with non-small cell lung cancer (NSCLC) which were collected and used for evaluating ranking aggregation methods by Li et al. (2019Li et al. ( , 2018 [Li et al., 2019[Li et al., , 2018. 5 lists were extracted from sources that were used in these two papers. The evaluation list used as the truth set in [Li et al., 2018] is used to assess the performance of the ranking aggregation methods. The third dataset concerns genes related to apoptosis in human macrophages during bacterial infection. Two of the lists are provided by the SHIELD Consortium (SHIELD AMR Research Consortium: https://shieldamrresearch. org/) including a ranked list from an infection-associated apoptosis CRISPR screen in human iPSC-derived macrophages and another ranked list of human orthologues of upregulated genes in murine bronchoalveolar lavage alveolar macrophages (BAL AM) 16 hours after infection with Streptococcus pneumoniae [Preston et al., 2019]. Whereas the other lists for this macrophage apoptosis data set are collected from published papers including [Huang et al., 2020, Kumar et al., 2010, MacHugh et al., 2012, Yeung et al., 2019, Abebe et al., 2010, Losick and Isberg, 2006, Maertzdorf et al., 2011, Lai et al., 2020.
The dataset used as the gold standard to assess the performance of the methods is extracted from multiple sources which are either more highly related to the research question of the corresponding meta-analysis or published more recently. It is the same case for all three real datasets. So these sources used as the gold standard are assumed to be relatively reliable and combined as an unranked list for each of the three collected real dataset to assess the performance of the algorithms.
2 Parameter settings for the stochastic generative model and comparison with real data Datasets for SARS-CoV-2, NSCLC, and macrophage apoptosis are used for exploring the features of real data, so that parameters of the stochastic generative model can be selected to better emulate the real data. After the analysis of real data shown as below, parameters are selected for generating simulated data such that the simulated data well emulates them, shown in Tables 1, 2 and 3. List number, order information, list length, frequency decay of top-ranked entities, list quality and absent genes were assessed.
In the name of dataset types (S), "r" means it only includes ranked sources whereas "m" refers to a mix of ranked and unranked sources. In addition, "0" and "1" means large dataset and small dataset respectively. By analysing the mean and standard deviation of the accuracy of real sources, M ∈ {0.5, 1, 3, 4, 12} and D ∈ {0.1, 0.5, 1, 3, 12} were used in the evaluation for each of the 4 dataset types to emulate various noise levels and heterogeneity. Various absent gene rates (γ) were also explored.  The key properties for each type of simulated datasets are introduced. For example, a dataset of m0 type includes a group of D1 ranked lists and a group of D1 unranked lists, with a description "MixLarge" in the main document. The corresponding emulated real dataset is also introduced. The parameters were set to emulate properties of the real data.
List number: The number of lists used in the SARS-CoV-2 study is 32, which includes 21 unranked lists and 11 ranked lists. The small data set used for NSCLC has 5 ranked lists whereas 4 ranked lists are collected for macrophages apoptosis with another 2 unranked lists. Because some unranked sources are used for extracting the gold standard for NSCLC and macrophage apoptosis data, the total number of unranked sources which can be used as input should be higher if collecting data in a similar way, although these 2 datasets are not all sources from a thorough review for related studies. For the simulated dataset, 4 scenarios are analysed. The first two are formal biological meta-analysis analysis cases emulating the collected SARS-CoV-2 data, with 11 ranked lists with or without 21 unranked lists. The other two cases emulate the small collected dataset with 5 ranked lists, and with or without 5 unranked lists.
List length: The length of each list can go from 1 to 20000 (human genome scale) theoretically. But studies that included a single gene were excluded in the SARS-CoV-2 study. Also, methods like MAIC are designed for accepting lists with at least 2 entities. So we set a boundary L = 2 as the least number of entities in a list. Besides, considering the human genome scale, 20000 is assigned to U as the upper bound to show all generated lists longer than U as a human genome-wide source with 20000 genes.
Length distribution: For each list in the study of SARS-CoV-2 with length denoted by h, the mean and standard deviation for ln(h − 2) are calculated. The mean for ranked lists is 5.3597 and the standard deviation is 1.8055. So ln(m c ) and D c are set to be 5.3597 and 1.8055 separately. The calculated ln(m c ) and D c for unranked data lists are 3.4365 and 1.6937, which are the corresponding mean and standard deviation for ln(h − 2). The ranked lists and unranked lists are analysed separately because unranked lists tend to be shorter. Because of the lack of lists for NSCLC and macrophage apoptosis data, especially for the unranked lists in terms of NSCLC data, they are both considered as small datasets and analysed together. The histogram and mean value of ln(h − 2) for real data are shown in Figure 1, whereas selected parameters for ln(m c ) and D c are shown in Table 1.
Entity significance µ k : In the simulated data generation model, 0 is used as the µ k for all noise entities. But the relationship between µ k for signal entities needs to be explored.
The frequency of each gene that appears in the top 1000 of all sources was calculated independently for 3 real datasets. For SARS-CoV-2 data, 986 out of 11470 entities show a frequency larger than 0.05 and are considered as signal entities whereas the others are considered as noise. An exponential curve y k = a * exp(b * k) + c is fitted for the signal entities ranked by the frequency, with rank k and frequency y k . The y k = a * exp(b * k) + c value with these fitted parameters can be used to estimate the significance of an entity(µ k ). The same analysis is carried out for NSCLC and macrophage apoptosis data, shown in Figure 2. As the bottom right of the figure shows, the curves for 3 real data sets are scaled to set the first value to be 2 since only the ratio between different positions is investigated here. These curves are similar, all falling from quickly to slower and tending to be flat reaching the 1000th position. So an assumption is made that all these data have a close pattern of significance falling along with the signal genes. The scaled value y k of the curve for SARS-CoV-2, which is fitted using the most number of lists, is used to set the µ k value, which goes from for µ 1 to for µ 1000 . The fitted curves for the simulated data with M = 3 and D ∈ {1, 0.5, 0.1} (explored to simulate appropriate mean noise and variance for simulated data described in the following part) are compared with the curves for 3 real data sets after scaling.
List quality σ i : Quality of lists in real data was compared to simulated data with various M and D, to explore the appropriate settings for parameters M and D which were shown in Table 1. The genes within the corresponding extracted unranked gene set used to assess the performance of algorithms are considered as truth, with n T genes. In order to assess the quality of each list,  the frequency for each entity appearing in top 1000 of each list from each real data set except for the pre-selected gold standard list. Entities with frequency larger than 0.05 are considered as signals and ranked by the corresponding frequency. A curve is fitted for each dataset. The right bottom figure shows the plotting for these scaled curves with top ranked entity having value 2, together with the analysis for selected simulated data. For each real dataset, the length of gold standard set, calculated accuracy, upper bound and lower bound estimation are introduced.
the precision( T rueP ositive P redictedP ositive ) for each list is calculated, which is the ratio for the number of true genes appearing in top n T genes of a ranked list with more than n T genes. For an unranked list with length h larger than n T , n T genes are randomly selected from the list as the predicted positive since there is no order information provided between genes within an unranked list. For a list with h genes less than n T , the precision value is the ratio of true genes appearing in the list. The quality of lists in real data was explored, shown as Table 4. This precision value shows the quality of each list, which is related to the σ i in the simulated data generation model. The average of these precision values is 0.0283, 0.0512 and 0.1053 for NSCLC, macrophage apoptosis, and SARS-CoV-2 respectively. The standard deviations (std) for each of them are 0.0263, 0.0951, and 0.1222. The number of gold standards set N T for NSCLC, macrophage apoptosis, and SARS-CoV-2 are 148, 91 and 238 separately. Because the size of the truth set may influence the accuracy, these numbers are analysed separately. The unranked lists used as the true positive for these real data are extracted from multiple sources as estimations for the corresponding truth. This estimation can influence the precision values calculated a lot if it is far from the truth. An estimation of the upper bound under various true positive sets with the same length (N T ) can be calculated by using the most common positive reported genes as the truth, which are 0.3019, 0.2124, and 0.2993 separately as shown in Table 4. The expected precision value for a non-information random selected list from 20000 genes are N T /20000 = 0.0074, 0.0045, 0.0117 as shown in Table  4, which can be an estimation for the lower bound of precision values.
To explore a suitable noise level M and heterogeneity D used in the simulated data generation model, experiments on various M and D were made to compare with the real data, with other parameters set as Table 1 for a large dataset m0 (11 ranked + 21 unranked) and a small dataset m1 (5 ranked + 5 unranked) as shown in Table 2. γ = 0 is used here. M ∈ {0.5, 1, 2, 3, 4, 12} is explored when D ∈ {0.1, 0.5, 1, 3, 12}. The no-heterogeneity scenario that M is constant is also explored, shown as D = 0. These M value are selected to show various signal noise ratio: µ k σi . Considering the expectation E[ln(σ i )] = ln(M ), µ k M is used to show the estimated scale for signal-to-noise ratio(SN R). Since µ k ∈ [0.6202, 2] when k=1,...1000, 1 is selected for µ k to calculate this figure. SN R ∈ {2, 1, 0.5, 0.333, 0.25}. Experiments are repeated 10 times and the average of results are calculated for both mean accuracy and standard deviation of accuracy, shown as Figure 3 and Estimated lower bounds and upper bounds are also compared to explore parameters for emulating a large enough range of various real cases. The smallest mean values for accuracy are around 0.0106, 0.005, and 0.017 for M=12, D=0.1 with the corresponding accuracy for top-N T cutoff, which are close to the noninformation list value (the lower bound estimation) mentioned above and low enough for emulating the lists in real data. It goes to around 0.6 when M = 0.5, which is far higher than the upper bound estimation for 3 real datasets and larger than any collected real list accuracy, which tends to be high enough to emulate the real data. In terms of the range of the heterogeneity, the std of accuracy goes from close to no heterogeneity (D = 0) when D = 0.1 to around 0.4 when D is larger, which is times higher than any real calculated std and seems high enough in the real case.
Considering the analysis above, parameters settings for M and D used in the evaluation are selected. Various M and D will be used to explore It can also support that the settings for M and D can generate datasets to cover cases of various mean noise and heterogeneity including common cases in real data for a large reasonable range of SN R for both µ k M and (ln(M )) 2 D . Absent genes γ: For the absence of genes in a real study, it is usually hard to judge the cases between bottom omitted or randomly excluded because the genes included in a study are usually more or less selected using prior knowledge. A simple way to emulate the combination of various scenarios for the exclusion of genes is used by randomly removing 20% and 50% of genes to be removed before cutting the bottom genes for M = 3, D ∈ {0.1, 0.5, 1, 3, 12} in terms of 4 types of datasets in Table 2.

Methods selection and implementation
As shown in Section 2.3 of the manuscript, the ranking aggregation methods with various features which can deal with gene lists were selected and implemented, with the settings summarised in Table 5. For each method, the table introduces  [de Borda, 1781, Lin, 2010. This type of methods calculates statistics like the mean rank of each entity to get the aggregated result rank. VoteCounting used in [Li et al., 2020] also uses simple statistics, which are easy to implement. Bayesian methods are becoming more popular these years [Badgeley et al., 2015, Deng et al., 2014, Yi et al., 2016, Li et al., 2018. BIRRA [Badgeley et al., 2015] initially marks a group of entities as truth and repetitively updates this set. BARD [Deng et al., 2014] defines a model with an independent parameter for each list to control the probability of position for true entities among noise entities within the corresponding list. In contrast, BiG [Li et al., 2018] and BARC [Yi et al., 2016] use a Thurstone-based model [Thurstone, 1927], which assumes there is a parameter for each entity and each ranking source is ranked by these parameters subject to noise. Given the solid Bayesian theoretical framework provided, the results can usually be shown in form of an estimated distribution instead of only a single ranking. But it often provides a result largely influenced by priors which are manually defined by the user or specific method. The computational cost is also likely to be large since Monte Carlo methods are usually used to estimate the posterior. BARD and BiG methods are more time-consuming than other methods except for Markov chain methods (MC1-3) that can also take days to run for a large dataset. In order to complete each running within or around a reasonable time (48 hours), the same omitting method is used as in [Li et al., 2018] and [Li et al., 2019]. Before running BARD, BiG (BiGbottom and BiGNA) or MC1-3 on a dataset, genes not ranked within the top 1000 in any list are removed because they are less likely to be ranked within the top 500 in the final result. After this operation, the number of entities in the SARS-CoV-2, NSCLC, and macrophage apoptosis data set became 4584, 2495, and 3267 respectively. The maximal number of iterations in BiG algorithms is set to be 2000 to reduce the running time. Figure 5 shows the comparison between the result of using 5000 iterations and 2000 iterations for BiGbottom (BiG treating absent genes as bottom-ranked) and BiGNA (BiG treating the order information of absent genes as unknown). They tend to show the same result so that reducing the number of iterations is not likely to influence the accuracy of the BiG algorithm much and is hence reasonable given the reduction in computing time. Because category information is not investigated in this study, a no-platform version of BiG is applied for both BiGbottom and BiGNA. Markov chain methods set each unique entity within all input lists as a state and rank entities by the probability of their stationary distributions [Dwork et al., 2001].
In addition to the Bayesian and Markov chain methods, there are other methods that also use the distribution of input rankings or propose a model with distributions like Normal distributions to model the noise which causes the variability within a list, like Stuart [Stuart et al., 2003] and Robust Ranking Aggregation(RRA) [Kolde et al., 2012]. The R Package RobustRankAggreg allows the user to set the number of entities and this parameter in all related implementations is set to be 20000 to emulate the human genome scale, including RRA, Stuart, rMED, rMEAN, rGEO, and the Mix version of Stuart and Borda's methods. These 'Mix' versions of Borda's methods work exactly the same as the original implementations in the RobustRankAggreg package when dealing with a dataset with ranked sources only MAIC also allows setting the number of entities, which was again set to 20000. Considering the variation of source quality, some methods try to recognise the difference between lists by weighting them using pre-defined weighting. Predefined weighting schemes like in CEMC [Lin and Ding, 2009] require information in addition to the order information, which is usually not available. Methods like MAIC [Li et al., 2020] define latent variables to weight the quality of each source and estimate them without additional information.
RepeatChoice [Ailon, 2010] is also investigated as a method which can deal with a mix of ranked and unranked data and includes stochastic processes.

Supplementary result figures
This section shows the supplementary plotting for results. Figure 5 shows parameter setting exploration for BiGbottom and BiGNA algorithms, by setting iterations to be 2000 and 5000. Similar results are shown by these two settings for each algorithm in three real datasets, which supports the selection of iterations for BiG algorithm. Figure 6 to Figure 12 shows the supplementary results for the comparison of ranking aggregation methods on simulated and real datasets. Figure 7 shows the top-1000 cutoff results of experiments for 3 real datasets and selected simulated datasets with various key features, including the number of included lists, whether unranked lists are included and the heterogeneity of list quality. To avoid duplicates, the 'Mix' version of Borda's methods and Stuart are only shown in the result of datasets with unranked sources since they perform exactly the same as the corresponding original version in the R Package RobustRankAggreg for ranked sources. It can be noticed that the ranking of performance for investigated methods is influenced by the datasets. For example, the performance of BIRRA is the best among the investigated methods for the large dataset with only ranked sources and high heterogeneity (RankLarge highHet), whereas it is not the best-performed method in the corresponding datasets with low heterogeneity (RankLarge lowHet) or with a mix of ranked and unranked sources (MixLarge highHet). Another example can be that rMixGEO has the higher ranking of the performance when heterogeneity is high for the simulated datasets.
The significant influence on the performance of ranking aggregation methods caused by the investigated features means that knowing the dataset features can help select appropriate methods to achieve better performance. A detailed analysis about the relationship between dataset features and the performance of ranking aggregation methods is provided in the Result and Discussion Section of the main document.