YQFC: a web tool to compare quantitative biological features between two yeast gene lists

Abstract Nowadays high-throughput omics technologies are routinely used in biological research. From the omics data, researchers can easily get two gene lists (e.g. stress-induced genes vs. stress-repressed genes) related to their biological question. The next step would be to apply enrichment analysis tools to identify distinct functional/regulatory features between these two gene lists for further investigation. Although various enrichment analysis tools are already available, two challenges remain to be addressed. First, most existing tools are designed to analyze only one gene list, so they cannot directly compare two gene lists. Second, almost all existing tools focus on identifying the enriched qualitative features (e.g. gene ontology [GO] terms, pathways, domains, etc.). Many quantitative features (e.g. number of mRNA isoforms of a gene, mRNA half-life, protein half-life, transcriptional plasticity, translational efficiency, etc.) are available in the yeast, but no existing tools provide analyses on these quantitative features. To address these two challenges, here we present Yeast Quantitative Features Comparator (YQFC) that can directly compare various quantitative features between two yeast gene lists. In YQFC, we comprehensively collected and processed 85 quantitative features from the yeast literature and yeast databases. For each quantitative feature, YQFC provides three statistical tests (t-test, U test and KS test) to test whether this quantitative feature is statistically different between the two input yeast gene lists. The distinct quantitative features identified by YQFC may help researchers to study the underlying molecular mechanisms that differentiate the two input yeast gene lists. We believe that YQFC is a useful tool to expedite the biological research that uses high-throughput omics technologies. Database URL: http://cosbi2.ee.ncku.edu.tw/YQFC/


Introduction
Nowadays high-throughput omics technologies (e.g. genomics, transcriptomics, proteomics, etc.) are routinely used in biological research (1). From the omics data, researchers can easily get two gene lists (e.g. stress-induced genes vs. stress-repressed genes, drug-induced genes vs. drug-repressed genes, hub genes vs. nonhub genes, genes with long mRNA half-life vs. genes with short mRNA half-life, etc.) related to their biological question. The ultimate goal is to understand the underlying molecular mechanisms that differentiate these two gene lists. To achieve this goal, researchers need to know which biological features are good candidates for further investigation. Since enrichment analysis tools can help researchers to identify possible biological features, which are worthy of pursue, the enrichment analysis tool development has become a popular research topic in the bioinformatics field (2).
Many enrichment analysis tools are available for analyzing yeast gene lists. First, DAVID (3) provides gene ontology enrichment, pathway enrichment, protein domain enrichment and chromosome enrichment analyses. Second, Yeast-Mine (4) provides gene ontology enrichment, pathway enrichment and publication enrichment analysis. Third, modPhEA (5) provides phenotype enrichment analysis. Fourth, YGA (6) provides physical/genetic interaction enrichment and functional group enrichment analyses. Fifth, YHMI (7) provides histone modification enrichment analysis. Sixth, YARG (8) provides arsenic-related genes enrichment analysis. Seventh, YEASTRACT (9) provides single transcription factor regulation enrichment analysis. Eighth, YCRD (10) provides cooperative transcription factor pair regulation enrichment analysis. Descriptions of many other enrichment analysis tools could be found in a review paper (2).
Although many enrichment analysis tools are already available for testing various biological features, two challenges remain to be addressed. First, most existing tools are designed to analyze one given gene list, so they (e.g. Yeast-Mine, YHMI, YARG, YEASTRACT, YCRD, etc.) cannot directly compare two given gene lists. Second, almost all existing tools (e.g. YeastMine, DAVID, modPhEA, YHMI, etc.) focus on finding the enriched qualitative features (e.g. gene ontology [GO] terms, pathways, domains, phenotypes, histone modifications, etc.). Many quantitative features (e.g. mRNA expression level, mRNA half-life, transcriptional plasticity, translational efficiency, protein abundance, protein half-life, UTR length, number of mRNA isoforms of a gene, etc.) are available in the yeast, but no existing tools provide analyses on these quantitative features. To address these two challenges, here we develop Yeast Quantitative Features Comparator (YQFC) to directly compare various quantitative features between two yeast gene lists.
In YQFC, we comprehensively collected and processed 85 quantitative features from the yeast literature and yeast databases. For each quantitative feature, YQFC provides three statistical tests (t-test, U test and KS test) to test whether this quantitative feature is statistically different between the two given yeast gene lists. Both tables and figures (box plots and cumulative distribution function [CDF] plots) are provided to visualize the testing results.

Construction and contents
Collection of 85 quantitative features from 7 yeast publications and 6 yeast databases We comprehensively collected and processed 85 quantitative features from 7 yeast publications (11)(12)(13)(14)(15)(16)(17) and 6 yeast databases (YeastMine, YeastNet (18), YAGM (19), SGD (20), BioGRID (21), SPELL (22) Three statistical tests (t-test, U test and KS test) used to compare a quantitative feature between two yeast gene lists For each quantitative feature, YQFC provides three statistical tests (t-test, U test and KS test) to test whether this quantitative feature is statistically different between the two given yeast gene lists (denoted as L1 and L2). For example, assume that a user wants to compare the 5 ′ UTR length between L1 and L2. YQFC provides the following three statistical tests to do this task. First, the t-test (or called Student's t-test) is used to determine if the mean 5 ′ UTR length of the genes in L1 is statistically longer/shorter than the mean 5 ′ UTR length of the genes in L2 (23). Note that the t-test is a parametric test based on the assumption that the observed data come from normal distributions. Second, the U test (or called Mann-Whitney U test) is used to determine if the median 5 ′ UTR length of the genes in L1 is statistically longer/shorter than the median 5 ′ UTR length of the genes in L2 under the location shift assumption (24). Note that the U test is a nonparametric test that does not assume anything about the distribution of the underlying populations. Third, the KS test (or called Kolmogorov-Smirnov test) is used to determine if the cumulative distribution of the 5 ′ UTR length of the genes in L1 is statistically larger/smaller than the cumulative distribution of the 5 ′ UTR length of the genes in L2 (24). Note that the KS test is a nonparametric test that does not CC, co-citation; CDS, coding sequence; CX, co-expression; DC, domain co-occurrence; EPA, expression profile association; FAA, functional annotation association; GI, genetic interaction; GIA, genetic interaction association; LEA, literature evidence association; MPA, mutant phenotype association; pI, isoelectric point; PI, physical interaction; PIA, physical interaction association; PTM, posttranslational modification; TFBA, transcription factor-binding association; TFRA, transcription factor regulation association; UTR, untranslated region. a We collected three datasets of mRNA levels generated from microarray (11) and RNA-seq (12). b We downloaded five datasets of translational efficiency from Csárdi et al. (16) who collected them from the literature. c Amino acid composition contains 20 kinds of amino acids. d Atomic composition contains five kinds of atoms: carbon, hydrogen, nitrogen, oxygen, and sulfur. e Extinction coefficient at 280 nm contains two kinds: (i) all cys residues appear as half cystines and (ii) no cys residues appear as half cystines. f We downloaded 23 datasets of protein abundance measured in the normal growth condition from Ho et al. (17) who collected them from the literature. g We downloaded the datasets of protein abundance measured in 11 different stress conditions (dithiothretiol, H2O2, hydroxyurea, methyl methanesulfonate, etc.) from Ho et al. (17) who collected them from the literature. h We collected three datasets of protein-protein physical interaction from BioGRID (21) and YeastNet (18). PI (physical interaction, 3 datasets h ) network GI (genetic interaction) network CC (co-citation) network CX (co-expression) network DC (domain co-occurrence) network GN (similar genomic context) network GT (similar profiles of genetic interaction partners) network PG (similar phylogenetic profiles) network TS (3-D protein structure of interacting orthologous proteins) network EPA (expression profile association) network FAA (functional annotation association) network GIA (genetic interaction association) network LEA (literature evidence association) network MPA (mutant phenotype association) network PIA (physical interaction association) network TFBA (transcription factor binding association) network TFRA (transcription factor regulation association) network assume anything about the distribution of the underlying populations. Figure 1 illustrates the configuration of YQFC. The web interface of YQFC was developed in Python using the Django MTV framework. The 85 processed quantitative feature data were deposited in MySQL. All tables, box plots, and CDF plots were produced by the JavaScript and feature-rich JavaScript libraries (jQuery, DataTables and Plotly.js) to visualize data on the webpage. Except for the main website (http://cosbi2.ee. ncku.edu.tw/YQFC/), we also have two backup sites (http://cosbi4.ee.ncku.edu.tw/YQFC/ and http://cosbi6.ee. ncku.edu.tw/YQFC/).

Utility and discussion
The usage of YQFC YQFC is a web tool for identifying the distinct quantitative features between two input yeast gene lists. To use YQFC, users have to go through a three-step process ( Figure 2). [Step 1] Users need to input two gene lists to be compared. Standard names, systematic names, or aliases are all acceptable. If users only have one input gene list, they can use our precomplied gene lists (e.g. 6604 ORF genes, 299 tRNA genes, 27 rRNA genes, etc.) to serve as the second input gene list. In other words, YQFC also supports analysis of a single gene list. [Step 2] Users need to select the quantitative features to be analyzed. [Step 3] Since YQFC tests many quantitative features (i.e. multiple hypotheses testing), users have to select a statistical method (Bonferroni correction or false discovery rate (FDR)) for multiple hypotheses correction and set the P-value threshold. Note that using different correction method will give you different corrected P-value. Bonferroni correction is more conservative than FDR That is, Bonferroni correction has a smaller type I error rate, resulting in a smaller power, than FDR does. The P-value threshold determines the statistical significance of how different of a quantitative feature is between the two input gene lists. The more stringent the P-value threshold, the higher the statistical significance of the identified distinct quantitative feature.
After submission, YQFC will perform three statistical tests (t-test, U test and KS test) to test whether a selected quantitative feature is statistically different between the two input gene lists (denoted as L1 and L2). This process will go through all the selected quantitative features. Once the analysis process is complete, YQFC will return the results with two parts. The first part (Figure 3) contains the information of the user's settings (i.e. the number of genes in L1, the number of genes in L2, the selected quantitative features, the selected multiple hypotheses correction method, and the selected P-value threshold). The second part contains the result of each selected quantitative feature shown as two sections: (i) Summary and (ii) Statistical testing results. In the 'Summary' section, users can choose '  (Figure 4a) provides a table containing two kinds of information. First, the numbers of genes (in L1 and L2, respectively), which have the feature values are given. If users click on the number, they will see the names and the feature values of these genes. By clicking on a feature value, users will see the original sources of the feature value. Second, the mean and median feature values of the genes (having feature values) in L1 and L2, respectively, are given. ' Figure  View' (Figure 4b) provides two kinds of plots for visualization. Box plots are used to display variation in the feature values in L1 and L2, respectively. CDF plots are used to show the probability that the feature value X is less than or equal to a specific value x (i.e. Prob (X ≤ x)). In the 'Statistical testing results' section ( Figure 4c), users can see a table with six P-values. Three P-values (calculated by t-test, U test and KS test) represent the statistical significance of claiming the quantitative feature of L1 is larger than that of L2 (denoted as QF (L 1 ) > QF (L 2 )). The other three P-values (calculated by t-test, U test and KS test) represent the statistical significance of claiming the quantitative feature of L1 is smaller than that of L2 (denoted as QF (L 1 ) < QF (L 2 )). To draw the users' attention, the P-values which are less than the P-value threshold are highlighted with the yellow background. To ensure the robustness of the testing results, we suggest the users to investigate the quantitative features with statistical significance in all three statistical tests (t-test, U test and KS test). As for the P-value threshold, we suggest the users to use 0.01 or less. The more stringent the P-value threshold,  Table View" provides a table containing two kinds of information. First, the numbers of genes (in L1 and L2, respectively) which have the feature values are given. If users click on the number, they will see the names and the feature values of these genes. By clicking on a feature value, users will see the original sources of the feature value. Second, the mean and median feature values of the genes (having feature values) in L1 and L2, respectively, are given. (b) "Figure View" provides two kinds of plots for visualization. Box plots are used to display variation in the feature values in L1 and L2, respectively. Cumulative distribution function (CDF) plots are used to show the probability that the feature value X is less than or equal to a specific value x (i.e. Prob(X≤x)). (c) In the "Statistical testing results" section, users can see a table with six p-values. Three p-values (calculated by t-test, U test, and KS test) represent the statistical significance of claiming the quantitative feature of L1 is larger than that of L2 (denoted as QF(L1)>QF(L2)). The other three p-values (calculated by t-test, U test, and KS test) represent the statistical significance of claiming the quantitative feature of L1 is smaller than that of L2 (denoted as QF(L1)<QF(L2)). To draw the users' attention, the p-values which are less than the p-value threshold are highlighted with the yellow background.
the less testing results that can be called statistical significance.

A case study
Microarray experiments have identified approximately 900 environmental stress response (ESR) genes whose mRNA levels respond to various environmental stresses (e.g. heat shock, oxidative stress, osmotic stress, etc.) in a stereotypical manner (25). These ESR genes can be divided into 585 repressed ESR (rESR) genes and 281 induced ESR (iESR) genes. Previous studies have shown that many rESR genes are housekeeping genes (e.g. ribosomal genes and genes involved in growth-related processes), while iESR genes are usually involved in various stress defense mechanisms (6,25). We are interested in knowing which quantitative features can differentiate rESR genes from iESR genes. Therefore, we input these two gene lists (first list: 585 rESR genes and second list: 281 iESR genes) into our YQFC as a case study (Supplementary Table 2).
Many interesting observations have revealed from the analysis results of our YQFC. To ensure the statistical significance and robustness of the observations, we only discuss those with FDR-corrected P-values < 0.001 in all three statistical tests (t-test, U test and KS test). First, in the normal growth condition, rESR genes have higher mRNA level, larger protein abundance and larger translational efficiency than iESR genes do ( Table 2). These observations are not surprising. Since many rESR genes are housekeeping genes, their proteins are highly needed in the normal growth condition. On the contrary, iESR genes are involved in the stress defense mechanisms, so their proteins are not highly needed in the normal growth condition. Second, rESR genes also have larger protein abundance than iESR genes do (Table 2) in various stress conditions (e.g. dithiothretiol, H 2 O 2 , hydroxyurea, and methyl methanesulfonate). This indicates that even in the stress conditions, the needed quantity of housekeeping proteins are still larger than that of the stress response proteins to maintain the cell physiology. Third, in the normal growth condition, rESR genes have shorter mRNA half-life than iESR genes do (Table 3), consistent with a previous study reporting that ribosomal genes (belonging to rESR genes) have short mRNA half-life (26). Fourth, iESR genes have larger transcriptional plasticity, more transcriptional regulators and longer 5 ′ UTR length than rESR genes do (Table 3). Since iESR proteins are only highly needed in the stress conditions, their expression must be under complicated controls. Having larger transcriptional plasticity and more transcriptional regulators indicate that the transcription of iESR genes may be more tightly regulated than that of rESR genes. Our finding is supported by a previous study reporting that the stressinduced genes are under a more complicated transcriptional regulation than the housekeeping genes (27). Since 5 ′ UTR is known to contain translational regulation signals (28), having longer 5 ′ UTR suggests that the translation of iESR genes may be under more complicated regulation than that of rESR genes. For example, uORFs are one kind of translational regulation signals (29). It is known that the proportion of genes that contain uORFs in the 5 ′ UTR is much greater for translationally up-regulated genes than down-regulated genes under various stress conditions (29). Therefore, we suspect that this observation is also hold for iESR genes vs. rESR genes. Indeed, we found that the proportion of genes that contain uORFs in the 5 ′ UTR is much greater for iESR genes than rESR genes (Supplementary Table 3). Fifth, rESR genes have more GO term annotations, higher mRNA isoforms, and more interactors in the physical/genetic networks than iESR genes do. The biological meaning of these observations still needs further investigation. Several other quantitative features that can differentiate rESR genes from iESR genes could be found in Tables 2 and 3.

Quantitative comparison vs. qualitative enrichment analysis
Since quantitative comparison between a single input gene list (L1) and the background gene list (L2) is possible in YQFC, users may want to correlate the quantitative comparison results with the qualitative enrichment analysis results. The qualitative enrichment analysis is to test the enrichment of a qualitative feature (e.g. GO terms) in the single input gene list compared to the background. Here we provide an example to illustrate the differences between these two testing strategies. Assume that we want to compare the 5 ′ UTR length of 281 iESR genes and the background (4418 genes which have 5 ′ UTR length data). For quantitative comparison, YQFC found that the 5 ′ UTR length of 281 iESR genes (L1) is statistically significantly longer than that of the background (L2) when using U test or KS test with P-value threshold 0.01 ( Figure 5). Note that only 233 out of 281 iESR genes have 5 ′ UTR length data.
For qualitative enrichment analysis, first we need to define a set of genes with long 5 ′ UTR length from the background (4418 genes which have 5 ′ UTR length data). Let us define a set of genes with long 5 ′ UTR length by including all the genes whose 5 ′ UTR lengths are within the largest X% (X = 10, 20, 30, 40) of the background. Then we can use the hypergeometric test to check whether the iESR genes are enriched with genes with long 5 ′ UTR length.
If we set the P-value threshold for calling enrichment as 0.01, then we can conclude that iESR genes are enriched with genes with long 5 ′ UTR length only when X = 40 but not 10, 20 or 30 (Table 4). It can be seen that the testing     results of the qualitative enrichment analysis are sensitive to how we define a set of genes with long 5 ′ UTR length. Therefore, we suggest researchers to use quantitative comparison rather than qualitative enrichment analysis when quantitative feature data are available.

Conclusion
In this study, we have developed a web tool called YQFC.
We have comprehensively collected 85 quantitative features (including 12 gene features, 4 mRNA features, 52 protein features and 17 network features) in YQFC. For each quantitative feature, YQFC provides three statistical tests (t-test, U test and KS test) to test whether this quantitative feature is statistically different between the two input yeast gene lists. In the case study, we demonstrated the functionality and usefulness of YQFC by identifying 40 distinct quantitative features between rESR genes and iESR genes (Tables 2 and 3). The distinct quantitative features (e.g. 5 ′ UTR length, mRNA half-life, transcriptional plasticity, translational efficiency, etc.) identified by YQFC may help researchers to study the underlying molecular mechanisms that differentiate rESR genes from iESR genes. We believe that YQFC is a useful tool to expedite the biological research that uses high-throughput omics technologies. In the future, we will keep updating YQFC once new quantitative features become available in the yeast literature or yeast databases. Now is YQFC Version 1.0. When YQFC has a minor update, we will change the version number to 1.1, 1.2, etc. In the future, if YQFC has a major update, we will change the version number to 2.0.