The y-ome defines the thirty-four percent of Escherichia coli genes that lack experimental evidence of function

Experimental studies of Escherichia coli K-12 MG1655 often implicate poorly annotated genes in cellular phenotypes. However, we lack a systematic understanding of these genes. How many are there? What information is available for them? And what features do they share that could explain the gap in our understanding? Efforts to build predictive, whole-cell models of E. coli inevitably face this knowledge gap. We approached these questions systematically by assembling annotations from the knowledge bases EcoCyc, EcoGene, UniProt, RefSeq, and RegulonDB. We identified the genes that lack direct experimental evidence of function (the “y-ome”) which include 1563 of 4653 unique genes (34%), of which 131 have absolutely no evidence of function. An additional 304 genes (6.6%) are pseudogenes or phantom genes. y-ome genes tend to have lower expression levels and are enriched in the termination region of the E. coli chromosome. Where evidence is available for y-ome genes, it most often points to them being membrane proteins and transporters. We resolve the misconception that a gene in E. coli whose primary name starts with “y” is unannotated, and we discuss the value of the y-ome for systematic improvement of E. coli knowledge bases and its extension to other organisms.

all of which are essential for growth, and a full 30% of which lack functional annotation (1,2) . Even in E. coli K-12 MG1655, perhaps the best-studied model organism, unannotated genes often appear in experimental studies of strain engineering (3) , laboratory evolution (4) , and pathogenicity (5) . These unannotated genes are one type of "dark matter" in the cell (6,7) , and efforts to build predictive models of the genotype-phenotype relationship for whole cells will be hindered by unannotated genes that still affect cell phenotype (8,9) .
Historically, unannotated genes in E. coli are known as "y-genes" because they have primary names starting with "y" (10) -not to be confused with "Y genes" which can indicate genes on the human Y chromosome (11) . However, genes with primary names that begin with "y" are often functionally annotated. For example, in a recent study where E. coli was engineered to produce fatty acids via reversal of the fatty-acid beta-oxidation pathway, the authors knocked out the genes yqeF and yqhD to increase production of target molecules (3) and included the genes ydiQRST , ydiO , and ydbK in a predictive model of the cell (12) . Searching for these genes in public knowledge bases such as EcoCyc (13) reveals that they vary greatly in annotation quality. Some (e.g., yqhD ) are well-annotated with direct experimental evidence, while others (e.g., ydiO ) have limited functional information. The variation of annotation quality between y-genes suggests a systematic approach to understanding the unannotated genes in E. coli is needed that goes beyond the primary gene name. There are several model organism knowledge bases that represent the collected knowledge of the E. coli K-12 MG1655 genome: EcoCyc (13) , EcoGene (14) , UniProt (15) , and RefSeq (16) . Other useful knowledge bases cater to specific classes of gene products, such as the RegulonDB, which contains manually curated functional information about transcription factors in E. coli (17) . Our initial review of these knowledge bases yielded conflicting information on gene function and level of annotation for many E. coli genes. Any attempt to systematically assess the function of unannotated genes must therefore draw from multiple knowledge bases and resolve these conflicts.
Many research groups have categorized E. coli genes and proteins by annotation quality as a part of their studies. In 2009, Hu et al. constructed a global functional atlas of E. coli proteins (18) . First, they identified all unannotated proteins in the K-12 W3110 and MG1655 genomes. In order for a protein-encoding gene to be considered functionally uncharacterized in their analysis, it had to meet the following criteria: (i) The gene name begins with "y", (ii) the gene does not have a known pathway within EcoCyc, and (iii) the gene does not have a functional description in GenProtEC (19) (any gene with a description containing the words "predicted", "hypothetical", or "conserved"). Based on these criteria, it was determined that 1431 of 4225 protein coding sequences were functionally unannotated. In 2015, Kim et al. published a database called EcoliNet that curated and predicted cofunctional gene networks for every protein coding gene in the E. coli genome (20) . This study also quantified the number of uncharacterized protein coding genes in E. coli . To assess functional annotation, they used the presence of experimentally supported "biological process" annotations in the Gene Ontology database (21) . They concluded that~2000 protein coding genes in E. coli were functionally unannotated. The most comprehensive effort to assess the level of annotation in bacterial genomes has been Computational Bridges to Experiments (COMBREX) (22,23) . The COMBREX knowledge base currently contains information about 4182 protein coding genes in E. coli K-12 MG1655, of which 2378 (57%) have experimentally verified function, 1741 (42%) have predicted but not experimentally verified function, and 63 (2%) have no predicted function. These studies of unannotated genes in E. coli K-12 MG1655 provided inspiration for this work. However, our effort covers both protein-coding and non-protein-coding genes, disregards nomenclature (i.e., whether a gene name begins with "y") as an indicator of annotation quality, and is presented as a reproducible workflow to keep the analysis up-to-date as knowledge bases improve.
It is difficult to define a Y-ome because there are no established rules that specify the level of annotation necessary for a gene to be "well-annotated". We draw on experience in systems biology to precisely define the Y-ome. The contribution of any gene function to cellular phenotype can now be codified computationally for various cellular systems, including metabolism (24) , cell signaling (25) , gene expression (26) , and replication (9) . Therefore, we define functional annotation as the information necessary to have an effect on phenotype predictions in a systems biology model. This allows us to define annotation differently for different cellular systems; for example, a metabolic enzyme should be annotated through enzymatic activity assay, a transcriptional regulator through DNA-binding assay and study of gene regulation. With this approach, we can be sure that the definition of annotation follows from the actual impact that the gene could have on cell phenotype. Such definitions are expected to evolve as cellular modeling methods evolve. Thus, the concept of a Y-ome can keep pace with new developments in the field.
To determine the Y-ome for E. coli K-12 MG1655, we first define it in precise terms. Next, we present a reproducible workflow to compile annotations from E. coli knowledge bases to determine the Y-ome. This workflow included an automated portion and a manual curation step to resolve 292 genes that could not be automatically assigned to a category. The resulting Y-ome includes 34% of E. coli genes. We describe some trends for these Y-ome genes, including their enrichment in the termination region of the E. coli chromosome, lower average expression levels than well-annotated genes, and evidence that certain types of genes (e.g., transporters) are more highly represented. Finally, we resolve the misconception that a gene in E. coli whose primary name starts with "y" is necessarily unannotated.

Definition of the Y-ome
The Y-ome encompasses all genes in an organism that lack functional annotation. However, it is difficult to precisely define whether a functional annotation is sufficient to call the gene "well-annotated." We drew on systems biology to precisely define this boundary. In systems biology, predictive models can be used to link genotype to phenotype. In these models, the definition of functional annotation is that a gene can be mechanistically linked through a network to a measurable phenotypic effect. Thus, we define the Y-ome as the set of genes that lack a known mechanistic effect on cell phenotype . For a given gene, we can use the following test, "Is this gene sufficiently annotated that it can (1) be incorporated into a predictive model and (2) have an impact on the predicted phenotype?" The model in this test can be hypothetical, but it helps frame the precise definition of the Y-ome. Because genotype-phenotype models can take different forms (metabolic models have a different theoretical basis from regulatory models), this definition allows for different kinds of evidence based on the gene function (enzymatic assays for metabolic genes; DNA-binding assays and gene knockout studies for regulatory genes).
Our definition of the Y-ome also restricts annotations to experiments on the organism of interest and excludes annotations that are drawn entirely from gene sequence similarity. Sequence similarity provides evidence of gene function, but generalist proteins can play different roles depending on the cell context (27,28) . Therefore, we looked for direct experimental evidence of function with the target organism or multiple lines of evidence beyond just sequence similarity. We also made an exception for insertion elements-these were considered well-annotated based on sequence similarity alone.
Genes that are well-annotated (not in the Y-ome) could potentially have secondary functions in the cell (29,30) , so these genes could still require additional study to improve our understanding of the system, and model-driven approaches can be used to systematically identify them (28) . Finally, genes with high sequence similarity might be pseudogenes or cryptic genes. Because they are not expected to affect cell phenotype, we excluded pseudogenes and cryptic genes from the Y-ome in this study. However, if they are found to contribute to cellular phenotype, they should be included.

A workflow for identifying the E. coli Y-ome
To systematically determine the Y-ome for E. coli K-12 MG1655, we developed a semi-automated approach (Fig. 1) to identify unique genes across five E. coli knowledge bases and integrate their data to define a consensus Y-ome. The automated part of this process proceeded in three steps: (1) downloading data from each knowledge base, (2) extracting text-based features (Supplementary Data S2), and (3) using keywords to automatically assign each gene annotation in a knowledge base to the categories "Y-ome," "Well-Annotated," or "Not enough information for automated assignment" (Fig. 2a). Pseudogenes and cryptic genes were kept separate and marked "Excluded." The keywords used to make these assignments are described in the Methods.
After assigning genes in each knowledge base to categories, consensus rules were applied to combine the results from the separate knowledge bases. In general, we checked first for an agreement among knowledge bases. For instance, if two knowledge bases indicated a gene was "Well-Annotated" and the others did not have enough information to assign a category, then the consensus was "Well-Annotated". When databases disagreed, then no consensus was possible. In these cases, manual annotations were made (for 292 genes) based on reading the knowledge base entries and consulting the literature. Finally, certain kinds of structured evidence (e.g., the Experimental Evidence section in EcoCyc) were taken as high-quality evidence, and we ignored conflicting details from other knowledge bases in these cases. Pseudogenes and phantom genes were treated separately in the "Excluded" category, and for these genes we also ignored conflict between the knowledge bases. More detail on the consensus rules can also be found in the Methods.
Based on this analysis, we identified 4653 unique genes across E. coli K-12 MG1655 knowledge bases, and each was assigned to the "Y-ome", "Well-Annotated", or "Excluded" categories (Supplementary Data S1). Of these 4653 genes, 2784 have . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/328591 doi: bioRxiv preprint first posted online May. 23, 2018; information that indicate a sufficient level of functional and regulatory evidence to exclude them from the Y-ome (Fig. 2a). Thus, 1563 genes (34%) are in the Y-ome of E. coli K-12 MG1655. No individual knowledge base provides information to fully define the Y-ome, but EcoCyc comes the closest. Of the 1563 Y-ome genes, there were 131 for which we found no information in the knowledge bases (see Methods) and 306 that were marked as pseudogenes or cryptic genes.

Gene expression and chromosome location
It was previously observed by Hu et al. that poorly annotated genes tend to be expressed at a lower level than well-annotated genes (18) . We confirmed this with the Y-ome by comparing gene expression of Y-ome genes and well-annotated genes in a compendium of RNA-seq data from our research group. The dataset includes expression values for 4319 E. coli genes across 85 conditions. Genes in the Y-ome tend to have lower expression across the surveyed conditions ( Fig. 3) with average normalized expression count for Y-ome genes being 234 compared to 1583 FPKM for well-annotated genes (t-test p-value < 1×10 -6 ). Attempts to annotate Y-ome genes may be more successful if they prioritize the highly expressed Y-ome genes that have a greater potential to affect observable phenotypes. Alternatively, experiments that identify growth conditions with greater expression of Y-ome genes could help elucidate their functions because the genes are more likely to have a phenotypic effect under conditions where they are expressed.
We observed a low density of Y-ome genes near the origin of replication (ORI) of the E. coli chromosome and a high density of Y-ome genes in the termination region (opposite ORI, Fig. 4a). Highly expressed genes are known to be enriched near ORI (31)(32)(33) , which was observed in our gene expression compendium (Fig. 4b).
These observations tell a simple story of highly expressed genes that have obvious effects on phenotypes under laboratory conditions and are therefore well-annotated, and lowly expressed genes that do not affect phenotypes enough to be easily characterized. However, the Y-ome genes with highest expression (above an arbitrary cutoff of 1000 FPKM) are split between the origin and termination regions (Fig. 4a), which suggests that some other factor might be keeping genes near the termination region from being characterized. High-throughput gene annotation might shed further light on this phenomenon.

Functions of Y-ome genes
The Y-ome of E. coli provides clues to the function of Y-ome genes. The most common terms associated with Y-ome genes can easily be extracted from E. coli knowledge bases (Table 1). These terms indicate that many membrane-associated proteins (485 genes) and particularly transporters (284 genes) remain to be annotated. Membrane-bound proteins and transporters are particularly hard to characterize with certainty (34) , but high-throughput methods might change that, as they already have for enzymatic assays (35) , gene-environment networks (36) , and protein-protein interactions (18) . Thus, the Y-ome offers a set of candidate transport-associated genes for high-throughput analysis.
High-throughput analysis could also be relevant for gene sets related to enzymes (269 genes), signaling (261 genes), lipoproteins (100 genes), and biofilms (69 genes). As evidence accumulates in E. coli knowledge bases, this workflow can be rerun to improve the candidate gene sets.

Conclusion
The Y-ome represents a systematic accounting of genes lacking direct experimental evidence of function and should be valuable for guiding systematic gene function discovery. The accumulation of omics data is still accelerating, so a Big Data approach where multi-omic data types are combined with manual analysis and machine learning to derive new predictions shows promise towards elucidating the Y-ome (35,37) . A data-driven discovery workflow could also be extended to poorly characterized organisms where high-throughput approaches to genome annotation can accelerate our accumulation of biological knowledge. The Y-ome is an ideal input to these approaches, and it can provide insights into possible gene function when paired with a genome annotation or gene expression data.
We defined the Y-ome by considering the effect a gene might have on a predictive model of cell phenotype, and this approach has corollary benefits for the longstanding effort to build predictive whole-cell models. Computational models are now being extended to include all cellular functions (8,38) . The Y-ome encompasses genes that cannot be included in models because their contributions to cellular phenotypes are not understood (9) . While whole-cell models provide a systematic approach to organizing our knowledge of organisms, the Y-ome is a systematic approach to evaluating our lack of knowledge in a genome. Comparing the 2784 well-annotated genes in E. coli to the 1678 genes in the latest genome-scale ME-model (39) , it is clear that the models can grow by over a thousand genes before running up against our lack of knowledge (Fig. 2B). However, because unannotated genes are known to affect cell phenotype, the content of the Y-ome will have to eventually be addressed.
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/328591 doi: bioRxiv preprint first posted online May. 23, 2018; In 1998, a year after the first E. coli genome was released, Kenneth Rudd proposed a systematic naming scheme for unannotated open reading frames where each was given a unique name starting with the letter "y" (10) . This is a convenient system, but the community did not settle on an official mechanism for assigning new names for these y-genes when functions were established. The tradition has been for newly identified functions to be published along with a proposed primary name, leaving it to peer reviewers to call out duplicate names and other issues. But without a central mechanism for standardized naming, many y-genes have been annotated without receiving new names (174 genes, Fig. 2c). And poorly annotated genes have received new names not starting with "y" because their function was partially established, determined based on computational predictions, or based on presence in an operon (448 genes, Fig. 2c). With the Y-ome, we can decouple gene names from assessments of functional annotation and provide a more consistent resource for anyone interested in systematic analysis of unannotated genes.
The concept of a Y-ome can be applied to any genome, and we hope that the provided workflow will inspire development of the Y-ome of other organisms. Knowledge bases that use the same knowledge base structure across organisms (e.g., EcoCyc/BioCyc) are a good place to start because many features of the workflow can be directly applied to a new organism. The developers and curators of knowledge bases play a central role in enabling this kind of workflow. To wit, the most useful feature across the five knowledge bases we analyzed was the "Evidence" section for EcoCyc genes; this section provides structured data on the experimental evidence for gene function, along with a literature reference. This feature is available now for the human genome on HumanCyc, so quick progress could made on the human Y-ome.

Methods
A workflow to determine the E. coli Y-ome experimental support" which were marked as Well-Annotated in the final categorization. 2. RegulonDB contains curated and experimentally-validated annotations of transcription factors. function. Thus, genes with "Strong" evidence in RegulonDB were marked as Well-Annotated in the final categorization. 3. When EcoCyc and UniProt were both categorized as Well-Annotated for a given gene, then this gene was automatically marked as Well-Annotated in the final categorization. This heuristic was helpful to identify cases where EcoGene was missing key evidence that the other knowledge bases had picked up (e.g. dhaM ). 4. Insertion elements, identified in EcoCyc by gene names beginning with "ins", were considered to be Well-Annotated in the final categorization.

Genes with no information
To identify genes for which no information at all is available, we filtered the database for genes with features drawn from knowledge-base-specific phrase lists that corresponded to genes with no other functional information. For example, "Putative uncharacterized" often appeared in such UniProt entries. As another example, EcoCyc genes with no information have summaries that begin with this phrase "No information about this". (e.g. "No information about this protein was found by a literature search conducted on February 23, 2017" for ybiU ). The full list of phrases that were used can be found in the file "notebooks/17.. 11.02 Genes with no information.ipynb" in the workflow repository for this project (see Section Reproducibility).
When genes were annotated only with a protein domain or family, we still included them in the list because such domains (e.g. DUF1479 for ybiU ) often themselves have no functional information associated (DUF1479 has the description "Protein of unknown function" on Pfam: https://pfam.xfam.org/family/PF07350 ).

Gene expression compendium
A compendium of RNA-seq data for E. coli K-12 MG1655 (wild type, single gene mutants, and laboratory evolution endpoints) was used to analyze expression of Y-ome genes. All RNA-seq experiments were conducted using the protocol described by Seo et al. (40) . A total of 85 unique strain-condition pairs are included in the compendium, and it will soon be available for public use. Fragments per kilobase of exon per million fragments (FPKM) were calculated using Cufflinks (41) , and the mean FPKM across all 85 conditions was calculated to generate a was used to categorize genes from each database as "Well-Annotated" or "Y-ome" based on the definition of the Y-ome. Pseudogenes and phantom genes were excluded. The resulting Y-ome includes 1563 genes. (b) Y-ome categories were compared to the content of the latest E. coli genome-scale ME-model. 1184 genes that are annotated and not in the ME-model represent an opportunity to increase the scope of whole-cell models. The 76 genes in the ME-model and in the Y-ome might have model-driven evidence of function that could be used to systematically annotate them. (c) 174 genes have primary names that start with "y" but are well-annotated, and 448 genes in the Y-ome have non-"y" primary names. . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.