NAT/NCS2-hound: a webserver for the detection and evolutionary classification of prokaryotic and eukaryotic nucleobase-cation symporters of the NAT/NCS2 family

Abstract Nucleobase transporters are important for supplying the cell with purines and/or pyrimidines, for controlling the intracellular pool of nucleotides, and for obtaining exogenous nitrogen/carbon sources for metabolism. Nucleobase transporters are also evaluated as potential targets for antimicrobial therapies, since several pathogenic microorganisms rely on purine/pyrimidine salvage from their hosts. The majority of known nucleobase transporters belong to the evolutionarily conserved and ubiquitous nucleobase-ascorbate transporter/nucleobase-cation symporter-2 (NAT/NCS2) protein family. Based on a large-scale phylogenetic analysis that we performed on thousands of prokaryotic proteomes, we developed a webserver that can detect and distinguish this family of transporters from other homologous families that recognize different substrates. We can further categorize these transporters to certain evolutionary groups with distinct substrate preferences. The webserver scans whole proteomes and graphically displays which proteins are identified as NAT/NCS2, to which evolutionary groups and subgroups they belong to, and which conserved motifs they have. For key subgroups and motifs, the server displays annotated information from published crystal-structures and mutational studies pointing to key functional amino acids that may help experts assess the transport capability of the target sequences. The server is 100% accurate in detecting NAT/NCS2 family members. We also used the server to analyze 9,109 prokaryotic proteomes and identified Clostridia, Bacilli, β- and γ-Proteobacteria, Actinobacteria, and Fusobacteria as the taxa with the largest number of NAT/NCS2 transporters per proteome. An analysis of 120 representative eukaryotic proteomes also demonstrates the server's capability of correctly analyzing this major lineage, with plants emerging as the group with the highest number of NAT/NCS2 members per proteome.

Answer: We have now moved the URLs to the references and only cite the reference number in the main text. "36. JHipster -Generate your Spring Boot + Angular/React applications! [Internet].
[cited 2018 Oct 9]. Available from: https://www.jhipster.tech/ 37. Prediction and Evolutionary Classification Server of prokaryotic and eukaryotic NAT/NCS2 transporters [Internet]. [cited 2018 Oct 9]. Available from: http://bioinf.bio.uth.gr/nat-ncs2/ 43. NIH Human Microbiome Project -Project Catalog [Internet]. [cited 2018 Oct 9]. Available from: https://www.hmpdacc.org/catalog/" 3) include a "Availability of supporting source code and requirements" section (before the data availability section) List the following: Project name: e.g. My bioinformatics project Project home page: e.g. https://github.com/ISA-tools Operating system(s): e.g. Platform independent Programming language: e.g. Java Other requirements: e.g. Java 1.3.1 or higher, Tomcat 4.0 or higher License: e.g. GNU GPL, FreeBSD etc. RRID: if applicable, e.g. RRID: SCR_014986 This needs to be under an Open Source Initiative approved license where practicable compiled running software is made available. If the code is not hosted in a repository the GigaScience GitHub repository is also available for this purpose. Their reports, together with any other comments, are below. Please also take a moment to check our website at https://giga.editorialmanager.com/ for any additional comments that were saved as attachments.
In addition, please register any new software application in the SciCrunch.org database to receive a RRID (Research Resource Identification Initiative ID) number, and include this in your manuscript. This will facilitate tracking, reproducibility and reuse of your tool.
Answer: We now provide this section, as requested. Please note that we have also made some extra changes within the manuscript. We have now acknowledged two sources of funding, that will cover our article processing costs. In addition, Chris Armit from GigaDB has requested some modifications concerning supplementary tables that need to be provided as csv files and not excel files. We now mention these supplementary data within the manuscript accordingly. "All results and sequence IDs are found in supplementary tables 1-6" "…followed by many other γ-Proteobacteria (such as E.coli) with 10 members each (see supplementary tables 1 & 2)." "10 species with 50-100 strains and 9 species had over 100 strains (see supplementary_table_3_strain_volatility.csv)." "The number of NAT/NCS2 proteins per strain ranged from 14 to 0 (see supplementary tables 2 & 3)." "… in bacteria of the gastrointestinal tract (see supplementary table 4)." "… followed by Cluster 1 (14% of the total) and Cluster 2 (3% of the total) (see supplementary tables 5 and 6 for detailed results and analyzed sequences)." The due date for submitting the revised version of your article is 15 Oct 2018.
We look forward to receiving your revised manuscript soon.

Abstract
Nucleobase transporters are important for supplying the cell with purines and/or pyrimidines, for controlling the intracellular pool of nucleotides and for obtaining exogenous nitrogen/carbon sources for the metabolism. Nucleobase transporters are also evaluated as potential targets for antimicrobial therapies, since several pathogenic microorganisms rely on purine/pyrimidine salvage from their hosts. The majority of known nucleobase transporters belong to the evolutionarily conserved and ubiquitous NAT/NCS2 protein family. Based on a large-scale phylogenetic analysis that we performed on thousands of prokaryotic proteomes, we have developed a webserver that can detect and distinguish this family of transporters from other homologous families that recognize different substrates. We can further categorize these transporters to certain evolutionary groups with distinct substrate preferences. The webserver scans whole proteomes and graphically displays which proteins are identified as NAT/NCS2, to which evolutionary groups and subgroups they belong to and which conserved motifs they have.
For key subgroups and motifs, the server displays annotated information from published crystal-structures and mutational studies pointing to key functional amino acids that may help experts assess the transport capability of the target sequences. The server is 100% accurate in detecting NAT/NCS2 family members. We also used the server to analyze

Introduction
The NAT/NCS2 (Nucleobase-Ascorbate Transporter / Nucleobase-Cation Symporter-2) protein family encompasses ion-gradient driven transporters of key metabolites or antimetabolite analogs with diverse substrate preferences, ranging from purine or pyrimidine permeases in various organisms to Na + -dependent vitamin C transporters in human and other mammals [1][2][3][4][5][6]. Their additional function as providers of nitrogen/carbon source may also affect energy production, replication and protein synthesis through the salvage pathways for nucleotide synthesis [7][8][9]. In addition to their important direct role on the central metabolism of the cell, these and other nucleobase transporters have attracted interest as potential targets of purine/pyrimidine-based antimicrobials that could either be selectively routed into target cells to act as anti-metabolites or selectively inhibit an essential nucleobase transporter of the target cell [10][11][12][13][14] This protein family is one of the 18 known families of the APC superfamily [15] and represents a subset of APC families which conform to a distinct structural/mechanistic pattern. The NAT/NCS2 transporters consist of 14 transmembrane segments (TMs) divided in two inverted repeats (7+7) and arranged spatially into a core domain (TMs 1-4 and 8-11) and a gate domain (TMs 5-7 and 12-14) [16]. The core domain contains all major determinants of the substrate-binding site, whereas the gate domain contributes to alternating access by allowing conformational rearrangements and providing major gating elements. The proteins probably function as homodimers and may use an elevatorlike mechanism to achieve alternating access [17,18]. Similar structural features are described for transporters of two other APC families, the Sulfate Permeases (SulP) [19] and the Anion Exchangers (AE) which includes the well studied chloride/bicarbonate exchanger (band 3) of human erythrocytes [20].
The NAT/NCS2 is split phylogenetically in two subfamilies. The first one, COG2233 or NAT, contains bacterial and fungal permeases for purines (xanthine, uric acid), bacterial permeases for pyrimidines (uracil, thymine), plantal and mammalian broad-specificity uracil/purine permeases (not present in human), and the mammalian L-ascorbate transporters SVCT1 and SVCT2. Insight on the transport mechanism of this subfamily has been provided by high-resolution crystal structures for two members, the uracil permease UraA of E. coli [16,18] and the xanthine/uric acid permease UapA of Aspergillus nidulans [17], coupled with extensive mutagenesis studies on UapA [21], the xanthine permease XanQ of E. coli [1,22] and few other homologs [23,24]. The other subfamily, COG2252 or AzgA-like [25], contains bacterial, fungal and plantal permeases for salvageable purines (adenine, guanine, hypoxanthine) which are less well studied with respect to structure-function relationships [7,26] .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Despite their importance, membrane transporters in general and the NAT/NCS2 family in particular are not so extensively studied to date as other categories of proteins are, due to the inherent difficulties in experimentation and in accurate prediction of their function [27,28]. Based on a large-scale evolutionary analysis that we performed in this study, we have i) identified in prokaryotes the major evolutionary groups and subgroups, with distinct substrate specificities, ii) identified key motifs for each phylogenetic group and subgroup that are related to substrate specificity, iii) developed a webserver that utilizes all the above information to detect and classify at proteome-scale NAT/NCS2 transporters and iv) analyzed with this webserver 9109 prokaryotic and 120 Eukaryotic proteomes so as to investigate which evolutionary lineages are rich in these transporters. We expect that this type of analyses and the accompanying computational tool, which are lacking in general for other families of transporters, will facilitate the experimental study of new homologs, provide a practical tool for assignment of homologs into functionallyrelevant associated subgroups and also improve their annotation in the databases.

Development of HMMs and Meme motifs for the family, subfamilies and evolutionary clusters.
All the annotated sequences of the 2A APC superfamily (organized in 18 families) were obtained from TCDB [15]. For each of the 18 families we generated protein alignments with Muscle and Seaview [29,30] that were manually edited and then used to build a hidden markov model for each one of them with HMMER [31].
Next, 4442 Bacterial AND 213 Archaeal Proteomes were downloaded from UNIPROT (January 2017) [32]. Their protein sequences were scanned with the above 18 HMMs and thus, 8291 proteins of the NAT/NCS2 family were identified and retained for further analysis. Afterwards, close homologs were removed with the Blastclust software, using as cutoff 70% protein identity over 70% of sequence length. Thus 1355 NAT/NCS2 sequences were retained after this step.
Subsequently, these sequences were fed to the MEME software [33] so as to identify 14 motifs of length 14-21 or 18-25 amino acids each. Manual inspection of sequences with a very low number of motifs resulted in rejection of 14 sequences. Thus 1341 sequences were retained. These 1341 sequences were scanned again with the 14 MEME motifs, by MAST [33]. Custom Perl scripts were developed to obtain the motif presence/absence for each sequence as a vector of 0 and 1 values, based on detection with MAST (see supplementary folder Custom_scipts). The above vectors were clustered in MATLAB with the Clustergram function (default parameterscommands found in supplementary folder Custom_scipts). This first round of clustering revealed two major evolutionary subfamilies, designated SF1 and SF2 (see supplementary figure S1). The sequences of each subfamily were fed to another round of MEME motif detection with the same parameters as in the first instance. Again, 14 MEME-motifs were made for each 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 subfamily. These were used with the MAST software to identify MEME-Motif content for each subfamily, and again, vectors of motif presence/absence were generated for each subfamily (and their clusters were manually inspected; see supplementary figures S2 and S3). All Meme/Mast results and analyzed sequences are found in the supplementary folder "MEME_MAST_motifs".
Afterwards, the protein sequences of each subfamily separately were aligned and manually edited with Muscle and Seaview [29,30]. Furthermore, in each subfamily, sequences with experimental evidence of substrate specificity were added (eukaryotic ones as well). Phylogenetic trees were generated with the BioNJ method using the Poisson model and 1000 bootstraps. The two generated phylogenetic trees (for each of the two distinct subfamiliessee supplementary figures S4 and S5) were annotated and visualized in Archaeopteryx and Treedyn [34,35]. Subfamily 1 was organized in six major and four very small clusters. Subfamily 2, that was more homogeneous than subfamily 1, was organized in many small clusters. Hidden Markov Models were thus constructed for the NAT/NCS2 family, its two subfamilies and for each of the 6 major clusters in subfamily 1. For several of the small clusters in subfamily 2 that contained sequences with known substrates we also generated extra HMMs. In addition, we generated 14 MEME motifs for each subfamily and each of the 6 clusters in subfamily 1.
A workflow of how the various HMMs and MEME motifs were generated is found in supplementary figure S12_workflow_diagram. All edited sequence alignments, HMMs and phylogenetic trees (in newick format) are organized in supplementary folder "Sequence_alignments_HMMs_phylogenetic_trees".

Development and Evaluation of the server
All the above HMMs and MEME motifs were incorporated in a webserver, named NAT/NCS2-hound, that may scan protein sequences in FASTA format, identify members of this family and further classify them in the various subfamilies and clusters. The webserver is based on the Jhipster Application Framework [36] that utilizes Angular Javascript Framework for the front-end and the Java language and Spring Framework for the back-end. The server is freely available at [37]. The server and instructions for local installation are found in supplementary folder "Server_for_local_installation". Also, the server is registered at SciCrunch.org with RRID: SCR_016473.
Functional information for the various amino acids was obtained from several mutational studies [1,21,24] and from the structural studies on UraA [16,18] and UapA [17].
We performed an evaluation analysis, in order to assess the effectiveness of the NAT/NCS2-hound server. TCDB annotated transporters of the 18 families of the APC superfamily were used as bait to obtain best blast hits against bacterial reference proteomes downloaded from Uniprot. The best blast hit of a bait sequence was designated as a member of the family that its annotated (from TCDB) bait sequence belonged to. These retrieved best blast hit sequences constituted the evaluation set. Any of these sequences that had been used to train the HMMs were removed from the evaluation set.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Thus, we retrieved/retained 7799 APC sequences, of which 975 belonged to the NAT/NCS2 family. These were scanned by our server for detection and evolutionary classification. The server demonstrated 100% accuracy (100% sensitivity and 100% specificity) in detecting NAT/NCS2 family members and can further categorize them to the various evolutionary subgroups, display conserved motifs and relevant functional information/annotation.
In order to assess the distribution of NAT/NCS2 family in major taxonomic lineages, 9109 prokaryotic proteomes (downloaded from NCBI at March 2018) and 120 Eukaryotic Reference Proteomes (downloaded from Uniprot at March 2018) were scanned by our server. The presence of a minimum number of seven MEME motifs was required as a cutoff, to filter out any sequence fragments. All results and sequence IDs are found in supplementary tables 1-6.

Results and Discussion
The NAT/NCS2 family is organized in two major subfamilies. An analysis of 1341 proteins, based on the presence/absence of conserved MEME motifs within the NAT/NCS2 family clearly revealed the presence of two distinct and major subfamilies (see supplementary figure S1). Previous phylogenetic analyses also revealed the presence of these two major subfamilies [7], in accordance with the presence of two COGs, designated as COG2233 (Xanthine/Uracil permease) and COG2252 (AzgA-like). Subfamily 1 (COG2233) consisted of 748 sequences and subfamily 2 (COG2252, AzgAlike) consisted of 593 sequences. The members of Subfamily 1 display a greater degree of sequence divergence among them, whereas members of subfamily 2 constitute a more homogeneous set of sequences (Supplementary figures S2-S3).

Distinct Phylogenetic clusters within Subfamily 1.
Further phylogenetic analysis of the more diverse members of Subfamily 1 reveals the clear presence of six major clusters (Clusters 1-6) and four minor clusters (see Figure 1). The incorporation of functionally annotated sequences from all kingdoms of life further helped us understand the substrate specificity profile of each cluster, whenever relevant information for representative homologs was available. The largest and most diverse cluster (Cluster 1) contains sequences that have been annotated to transport Xanthine or Uric Acid or both. The second largest cluster (Cluster 2) contains sequences that are known to transport Uracil or Uracil and Thymine. The third largest cluster (Cluster 3) contains the YbbY gene from E.coli, a homolog that is not functionally annotated in the databases but recent evidence suggests that it transports adenine, guanine and hypoxanthine (Botou and Frillingos, unpublished data). The fourth cluster (Cluster 4) contains functionally characterized sequences from Eukaryotes, but also encompasses functionally unknown homologs from Archaea as well as a few Bacteria. All the other clusters do not contain any sequences of known function.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Subfamily 2 is more homogeneous and is organized in many small clusters, with small differences among them. For several of those small clusters that contain sequences with known substrates we generated additional HMMs. A more detailed inspection of the various phylogenetic trees for each subfamily and each of the major 6 clusters within subfamily 1 are available in supplementary materials (Supplementary figures S4-S11). 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62 63 64 Figure 1. Phylogenetic tree of Subfamily 1 of the NAT/NCS2 family. The various phylogenetic clusters are depicted with different colors. Sequence redundancy was removed at a cutoff of 70% protein identity over 70% of sequence length. Well characterized known homologs are indicated with their major substrates in parenthesis (U, uracil; T, thymine; A, adenine; G, guanine; HX, hypoxanthine; La, L-ascorbic acid; X, xanthine, UA, uric acid).
A web server for the detection and evolutionary classification of NAT/NCS2 family members.
All the above evolutionary analyses, the HMMs and MEME motifs generated for the various subfamilies and evolutionary clusters have been used to develop a web server for the detection and evolutionary classification of NAT/NCS2 family members, named NAT/NCS2-hound. The webserver is freely available at [37]..This webserver detects and distinguishes this family of transporters from the other 17 homologous families of the APC superfamily. Furthermore, it can categorize these transporters to certain subfamilies and clusters associated with distinct substrate specificities, based on the large-scale phylogenetic analysis that we performed on prokaryotic proteomes. For each one of them separately, the identified set of characteristic signature motifs is detected. Furthermore, for several key subgroups we have integrated information from published crystalstructures and mutational studies to help experts identify key functional amino acids and help them assess the transport capability of the scanned sequences. Nevertheless, this server does not function as a prediction tool of substrate specificity. The NAT/NCS2hound server implements for this important family the same principles and computational protocol that were developed/implemented recently for another prokaryotic superfamily, the tRNA-synthetases [38].
The input for this server is a protein sequence or a proteome file in FASTA format. The webserver displays graphically (see figure 2) which proteins have been identified as NAT/NCS2, to which subfamily and cluster they belong to and which conserved motifs have been identified on the target proteins. For several key subgroups and motifs, the server further displays annotated (by our experts) information from published crystalstructures and mutational studies pointing to key functional amino acids of well-studied representative homologs.