SAFPred: synteny-aware gene function prediction for bacteria using protein embeddings

Abstract Motivation Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models—adopted from the natural language processing field—have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes. Results To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health. Availability and implementation https://github.com/AbeelLab/safpred.


Supplementary Text: SwissProt Dataset for Benchmarking
SAFPred was developed to improve gene function prediction in bacteria, and to evaluate our method on this specialized task we built a bacterial benchmarking dataset based on the SwissProt database (The UniProt Consortium, 2018).
SwissProt Database (release 2021-04, retrieval date 10 November 2021) forms the basis of our benchmarking dataset.First, we limited the entries to proteins of length [40,1000] to minimize the effect of protein length on our study.Then, we restricted the dataset to include only experimental GO annotations, we retained SwissProt entries with at least one GO term with the evidence codes EXP, IDA, IPI, IMP, IGI, IEP, HTP, HDA, HMP, HGI, HEP, IBA, IBD, IKR, IRD, IC, and TAS.Finally, we clustered the proteins using CD-HIT at 95% sequence similarity threshold to reduce redundancy (Li and Godzik, 2006).At the end of this procedure, we had 107,818 proteins in total.This dataset reflects the full scope of organisms present in SwissProt, with 2005 unique tax IDs, 737 of which belong to bacteria including proteobacteria, terrabacteria, archaea and spirochaetes.
To create our bacterial benchmark datasets, we selected the five bacterial organisms that had the largest number of entries in SwissProt: Escherichia coli (EC), Mycobacterium tuberculosis (MT), Bacillus subtilis (BS), Pseudomonas aeruginosa (PA), and Salmonella typhimurium (ST).For each bacteria, we obtained the "Full" dataset by reserving all proteins that belong to the organism to the test set.I.e., the "Full" EC dataset comprises a test set made up of only E. coli genes (3454 entries), and a training set of the remainder entries from SwissProt (104377 entries).
Next, to evaluate SAFPred on more challenging prediction scenarios where the goal is to predict gene function with few or no homologs in the existing databases, we created additional train/test set pairs for each bacteria.We removed proteins from the training set if they were more than 40%, 50%, 60%, 70% and 80% similar to any protein in the corresponding test set.This led to 6 train/test pairs for bacteria; the test set size remained constant, while the training set decreases in size as the sequence similarity threshold increases because more proteins are removed from the training set.In total, we had 30 train/test pairs (Table 1 in the manuscript).
In addition to the training set derived from SwissProt, SAFPred relies on the synteny database, SAFPredDB.Thus, we followed the same procedure to create additional databases corresponding to the preset sequence similarity threshold.We ran BLASTp on all genes in SAFPredDB, and removed members from the database if they were more than 40%, 50%, 60%, 70%, 80% and 95% similar to any protein in the corresponding test set.This operation altered the contents of the entries, thus we recalculated the intergenic distance for each entry in SAFPredDB and split the regions if the intergenic distance exceeded our 300bp threshold.

Comparison of SAFPredDB entries to experimentally determined operons
The goal of constructing our own large-scale synteny database (SAFPredDB) was to model the landscape of functions encapsulated within bacterial synteny in sufficient detail to be able to leverage this resource to improve bacterial function prediction.In this section, we validate SAFPredDB as an adequate database of conserved syntenic regions by comparing it to a smaller operon database published previously titled Operon Database (ODB v4) (Okuda and Yoshizawa 2010).In constructing SAFPredDB, we used information from ODB to refine thresholds such as region length, number of genes in a region, and intergenic distance between adjacent genes in a region (Methods).ODB contains a curated list of experimentally determined operons obtained from the literature.The conserved operon database from ODB, referred to here as "ODB Conserved", is essentially an expansion on their known, experimentally determined operons, where the additional predicted operon instances were identified in additional genomes.Overall, our synteny database was similar to the ODB Conserved database, in terms of region length, number of genes in an entry and intergenic distance within entries (Fig S1 -S3).Thus, with SAFPredDB we provide a database even more extensive than ODB Conserved, representing a more up-to-date and broader range of diversity within the bacterial kingdom.With our approach, we also demonstrated that we could either partially or completely predict existing bacterial operons accurately.In order to assess whether we could reconstruct experimentally-determined operons in ODB for E. coli, based on our predicted regions in SAFPredDB, we extracted ODB operons containing at least one E. coli genes (2845 total operons, including 1071 non-singleton operons).We compared this to the predicted syntenic regions containing at least one E. coli gene in SAFPredDB, including 15,618 non-singleton entries, of which 87% (13,610 entries) partially or fully overlap with the known ODB operons.Full concordance is not expected, as our database is substantially larger and encompasses far more genes.In addition, our database is inherently redundant; a single experimentally known operon from ODB can be split into multiple entries in SAFPredDB.Thus, it is highly likely that the remaining 2008 entries (23% of SAFPredDB) we predicted in addition to the ODB operons can be collapsed into a smaller number of actual syntenic regions or operons.
We selected a few predictions showcasing the accuracy of SAFPredDB in reproducing experimentally known operons (Table S1).Although SAFPredDB often accurately reproduces known operons, it can sometimes capture additional genes in the flanking regions (i.e.IS3 family transposase genes flanking the relBEF toxin-antitoxin system in Table S1).Although this can give us additional information about conserved mechanisms related to the operon's mobilization, this can also lead to false positive annotation transfers using SAFPredDB.

Annotating SAFPredDB entries
Although GTDB is the largest catalog of representative bacterial genomes, it does not include experimental annotations for the gene sequences.Thus, we assigned function to the SAFPredDB entries by transferring GO terms from similar protein sequences in the SwissProt database, using thresholds on both the pairwise identity and significance (Methods).When transferring GO terms to SAFPredDB entries, we aimed to balance minimizing false positives while retaining as many annotations as possible, in order to avoid having a database that was too sparsely annotated to be useful for predicting gene function.In our full SwissProt dataset experiments, 95.6% of the non-singleton entries (388,377 out of 406,293) were annotated with at least one GO term; however, in remote homology experiments, SAFPredDB was a lot more sparse in annotations.Since we removed training proteins that had more than a certain predefined level of sequence homology to any of the test proteins in order to create the remote homology datasets, there were only rare proteins with no homologs left in the database.Given that such rare proteins are less likely to be studied, annotated or even carry out any function within the cell, this loss of functional information was expected.Since SAFPredDB is a major source of annotation that SAFPred relies on, SAFPred's prediction coverage also suffers from this loss of information, and we observed that the difference in coverage between SAFPred and DeepGOPlus widens as more annotations are lost (Table S13 and S14).Furthermore, we note that annotations were transferred at uneven rates for different categories of GO terms, consistent with previous findings in the literature.When sequence homology is used as the basis for transferring annotation, GO terms in the Molecular Function (MFO) category are more likely to be transferred than those for Biological Process (BPO) and Cellular Component (CCO), as MFO can more readily be modeled using the primary sequence, or features derived from it (Zhou et al. 2019).We also report that the number of entries annotated with at least one GO term is the largest for the MFO category, even though the total number of MFO terms available in the SwissProt dataset is significantly smaller than that of BPO (Table S3).

Analysis of false positive enterococcal toxin predictions
To detect novel toxins, we first collected known pore-forming toxin genes observed in Enterococcus or closely related genera.Table S4 lists all such toxins, along with their Uniprot IDs and PDB identifiers, if their structure is available in the PDB database.
Table S4.Known pore-forming toxin genes found in Enterococcus or closely related genera.Protein structures for all but two genes have been solved experimentally and were downloaded from PDB.For Leucotoxin and Alveolysin, AlphaFold was used to predict their structure since they were not found in the databases.

Gene
Sequence Among the 10 toxin genes, we located the heat-liable enterotoxin B chain in an entry in SAFPredDB.
Forty-eight genes were associated with the toxin operon in SAFPredDB, but did not have structural similarity to a known toxin protein fold.The SAFPredDB entry codes for two membrane-associated proteins: 1) the toxin gene itself, and 2) another membrane protein of unknown function.Based on FoldSeek search results, 33 of these 48 genes appeared to perform other functions related to the cell membrane, involving signaling, secretion, and pore formation.To gain further insight into why these 48 additional genes without toxin structural folds were matched to the delta toxin entry in SAFPredDB, we compared the Euclidean similarity of embeddings vectors extracted from all 59 genes to the embedding vector of the known delta toxin operon (Table S5).While we did not find any significant differences in terms of the absolute similarity values between those 11 that had structural similarity to the delta toxin operon and the rest, we noted that the SAFPredDB toxin entry was consistently ranked in the top two among all SAFPredDB entries assigned to the 11 likely toxins, whereas for the remaining 48 genes this was not the case.This suggests that the ranking could be used in future versions of SAFPred as a proxy to infer confidence in region assignments and to reduce false positives.

Fig
Fig S1.The entries in SAFPredDB have a similar size distribution to those in ODB: A. operons in ODB.B. entries in SAFPredDB.Only non-singleton entries are shown.

Fig
Fig S2.The entries in SAFPredDB have similar intergenic distance (bp) distribution to those in ODB: A. operons in ODB B. entries in SAFPredDB.Only non-singleton entries are shown.

Table S1 .
Our synteny prediction algorithm can reproduce experimentally known operons in E. coli.Three example entries from SAFPredDB are shown, together with their corresponding operons in ODB, which were manually curated based on experimental studies.

Table S2 .
GO term annotations in SAFPredDB were more sparse in the remote homology detection experiments.
Percentage of entries in SAFPredDB that were annotated with at least one GO term for each GO term category and every organism in our SwissProt benchmarks.

Table S3 .
Annotation statistics and information content (IC) of entries in SAFPredDB.GO terms in the Molecular Function (MFO) category are more likely to be transferred when using sequence homology for annotation transfer.
b Molecular Function c Cellular Component Uniprot IDs for the first 8 rows, and NCBI protein IDs for the last two rows.

Table S5 .
SAFPred associated 59 enterococcal proteins with the delta toxin entry from SAFPredDB (entry ID 395786), found in Clostridium and Roseburia.The contents of this entry are shown here.Supplementary Table.S7.F max values for the full SwissProt dataset for the molecular function and cellular component ontologies. F

max values for each of five species 1,2
1 EC: Escherichia coli, MT: Mycobacterium tuberculosis, BS: Bacillus subtilis, PA: Pseudomonas aeruginosa and ST: Salmonella typhimurium. 2ighest value for each species and for each GO category are shown in bold.Supplementary Table.S8.S min values for the full SwissProt dataset.S

min values for each of five species 1,2
Area under the precision/recall curve for the full SwissProt dataset, for all 5 bacterial organisms and three GO categories.
1 EC: Escherichia coli, MT: Mycobacterium tuberculosis, BS: Bacillus subtilis, PA: Pseudomonas aeruginosa and ST: Salmonella typhimurium. 2Lowest value for each species and for each GO category are shown in bold.Supplementary Table.S9.

under precision/recall curve for each of five species 1,2
Salmonella typhimurium 2 The highest value for each species and for each GO category are shown in bold Supplementary Table.S10.F max values for the entire E. coli SwissProt benchmark sets.Supplementary Table.S12.F max values for the entire B. subtilis SwissProt benchmark sets.Prediction coverage 1 (in %) for the full SwissProt dataset.The number of test proteins annotated with at least one GO term at the threshold which maximizes the F1-score. 2EC: Escherichia coli, MT: Mycobacterium tuberculosis, BS: Bacillus subtilis, PA: Pseudomonas aeruginosa and ST: Prediction coverage 1 (in %) for the entire SwissProt benchmark evaluation, averaged over 5 bacterial organisms for three GO categories.The standard deviation across the 5 bacterial organisms is indicated in parentheses.The number of test proteins annotated with at least one GO term at the threshold which maximizes the F1-score. 2Because the Pfam method uses a different training set for predictions, it is not possible to compare the change in the coverage across different train/test pairs.

Table S17 .
Maximum % sequence identity of Enterococcus toxin genes to the 11 putative novel toxins predicted by SAFPred.