The Transporter Classification Database

The Transporter Classification Database (TCDB; http://www.tcdb.org) serves as a common reference point for transport protein research. The database contains more than 10 000 non-redundant proteins that represent all currently recognized families of transmembrane molecular transport systems. Proteins in TCDB are organized in a five level hierarchical system, where the first two levels are the class and subclass, the second two are the family and subfamily, and the last one is the transport system. Superfamilies that contain multiple families are included as hyperlinks to the five tier TC hierarchy. TCDB includes proteins from all types of living organisms and is the only transporter classification system that is both universal and recognized by the International Union of Biochemistry and Molecular Biology. It has been expanded by manual curation, contains extensive text descriptions providing structural, functional, mechanistic and evolutionary information, is supported by unique software and is interconnected to many other relevant databases. TCDB is of increasing usefulness to the international scientific community and can serve as a model for the expansion of database technologies. This manuscript describes an update of the database descriptions previously featured in NAR database issues.


INTRODUCTION: THE TC SYSTEM: DESIGN AND RATIONALIZATION
In 1995, Fleischmann et al. (1) published the full genome sequence of a living organism, Haemophilus influenzae, the first time such a feat had been accomplished. This revolutionary event marked the beginning of the genomics era. Because of our long-standing interest in molecular transmembrane transport, members of the Saier laboratory recognized the need for a classification system for transport systems equivalent to the Enzyme Commission (EC) system already in existence for enzymes (2). The EC system classified enzymes strictly on the basis of function, as it was designed before sequence and phylogenetic data were available. Even before the advent of the genomics revolution, it became clear that the EC system was tremendously deficient because it could not accommodate phylogenetic data without restructuring the entire system. Although considered desirable by many, such a restructuring of the EC system has never been achieved.
Even before 1995, our laboratory was conducting phylogenetic analyses of transport proteins [for review, see (3)]. We realized that phylogeny reflects protein structure, function and mechanism, and therefore, is an essential component of any molecular classification system. With a desire to conduct whole genome analyses of transporters, we recognized a need for a universal system of transport protein classification that took cognizance of both function and phylogeny. With this conviction in mind, we designed what is now known as the Transporter Classification (TC) system.
Transporters in the TC Database (TCDB) are classified using a functional/phylogenetic five-tier system (4,5) as follows: N1.L1.N2.N3.N4, where N is a number and L is a letter: N1 is the class; L1 is the subclass; N2 is the family (sometimes actually a superfamily); N3 is the subfamily; (or family in the case of a superfamily) and N4 is the actual transport system. Classes 1-5 are well defined (channels, secondary carriers, primary active transporters, group translocators and transmembrane electron carriers, respectively); classes 6-7 are presently empty, being reserved for yet to be discovered classes, and classes 8 and 9 represent accessory proteins and incompletely characterized proteins, respectively. This system, describing transport systems from all types of living organisms, was formally adopted by the International Union of Biochemistry and Molecular Biology (IUBMB) in June 2001 and has served the international scientific community effectively ever since (6)(7)(8)(9).

DATABASE CONTENT AND ACCESS
Encoded within the relational database schema is the functional/phylogenetic TC taxonomy ( Figure 1). Users can access the information through our intuitive interface, where information can be viewed at different levels of granularity by returning populated HTML data to the web browser client (the superficial tier). Users can enter at the top levels for information about classes and families and descend to the deepest level about individual proteins.
Since its last publication in the NAR database issue in 2009 (5), there has been significant change in the database design (schema above). Some basic issues pertaining to data integrity, redundancy and management have led to conversion of the MySQL Table Engine from MyISAM to InnoDB. Perhaps the most important justification for this conversion is the fact that different levels of TC classification have a type of parent-child relationship. A foreign key constraint should allow cascading action when a row (tuple) is inserted/updated/deleted. Thus, all related tables are affected, leaving no orphaned records. Roughly one half of the schema follows the standard relationships between class, subclass, superfamily, family, cluster or subfamily and system, and the other half shows tables of information pertaining to unique UniProt protein accession numbers.
The steps involved and basic ideas behind the TCDB Admin interface for curation are the same as above and follow the DB design schema. However, the look and feel of the interface has changed since its update in 2010, along with some new options such as 'View Task Queue' and 'View Staff Logs'. We share our mapping file with different databases, and these files are automatically updated every time a new protein is added to the database.
The entire web interface has been revamped. The new look and feel should be consistent across all major browsers, easier to navigate, URL friendly, and overall, a huge improvement from the previous HTML framebased web pages. For example, the browse tab for viewing the TC System (http://www.tcdb.org/browse. php) has been entirely redesigned using jQuery. For a more detailed description of the capabilities available to the user, see Wakabayashi et al. (10).
In addition to the search option under the search tab, one can search TCDB from a search box on the main page using single or multiple terms including TC ID#, key word, protein name or abbreviation, organismal source, author name, UniProt accession number, PDB ID number, associated disease, reference, etc. The following details are returned with a protein search, or can be easily accessed following such a search: (i) TC ID#, (ii) reference, (iii) accession number, (iv) protein name, (v) length, (vi) molecular weight, (vii) species, (viii) predicted number of TMSs, (ix) location/ Figure 1. Current MySQL schema, displayed using Workbench 6.0 CE and showing the tables currently in TCDB's database architecture. Each line in a table represents a column and displays which datatype (such as int, varchar, text, etc.) can be stored. Ten tables, which are not being used directly by TCDB but that have been used for maintenance tasks are not shown in the diagram: test, lang error, proteinold, tc2acc broke, tc2acc 1, flags, cflags, temp_tms, temp_preds and misc. A table that has a trifork (entity relationships) pointing toward it contains a column with explicit IDs from another table. The tables having no entity relationships are grouped on the left. The diagram contains four layers (left to right, and from top to bottom): the protein layer (green), the family layer (yellow), the ontology layer (blue) and the compounds layer (red).

topology/orientation and (x) database of interacting proteins (DIPs) and Pfam reference.
The user is also given an option of either BLASTING/ PSI-BLASTING the protein against the non-redundant National Center for Biotechnology Information (NCBI) or TCDB (accessed from the sidebar). Additional analysis options, such as predicting number of TMSs through hydropathy plots, are also available (see below).
TCDB collaborates with many important databases (see Reference #10 for recent technical improvements), and shares cross-database links with them; these are available on the individual protein pages. Internal hyperlinks connecting references to classes, families and proteins have been updated.

RECENT TECHNICAL IMPROVEMENTS (2011-13)
We have: (1) Incorporated an improved administration page, built-in semi-automatic machine learning tools (11) and and SFT2, which use tens of thousands of BLAST bit scores instead of multiple alignments, thus avoiding the pitfalls often encountered when determining the phylogeny of distantly related proteins (31)(32)(33). While SFT1 constructs trees allowing visualization of individual proteins, SFT2 allows depiction of family/subfamily relationships (31)(32)(33). (17) Provided a mechanism for user-generated input.

GROWTH OF THE DATABASE (2010-13)
A file containing the current sequence set is available for download from: http://www.tcdb.org/public/tcdb. About 150 TC families are introduced each year, reflecting the extensive and continual manual curation work being conducted. Figure 2 shows the parallel growth of TCDB protein, family and superfamily compositions from 2010 to 2013. However, it should be noted that each year, several families in Class 9 are moved to classes 1-5 when sufficient information becomes available to allow definition of their mechanisms of action.

SUPERFAMILY ADDITIONS (2011-13)
Analyses (34)(35)(36)(37)(38)(39)(40)(41)(42)(43) have revealed distant relationships between preexisting TC families. These relationships have been integrated into TCDB as a hyperlink, and superfamily relationships are mentioned with hyperlinks in the description of each constituent family. The number of superfamilies that are either new or expanded (marked with superscript 'a' in Table 1) has more than doubled during the last 3 years (Figure 2), and the further expansion of such knowledge continues.

ESTABLISHING HOMOLOGY BETWEEN PROTEINS USING TCDB-RELATED SOFTWARE
Affiliation with a family requires satisfying rigorous statistical criteria of homology. Superfamily status is based on the superfamily principle (44,45), stating that if protein A is homologous to protein B, and protein B is homologous to protein C, then protein A must be homologous to protein C, regardless of the degree of sequence similarity observed between proteins A and C. To avoid the concern of convergent evolution, the minimal length of aligned sequences to establish homology is 60 residues, and the comparison score must be at least 12 standard deviations using the GSAT program [see also Wakabayashi et al. (10)]. As the protein databases grow, this value must be increased (44)(45)(46). It should be noted that homology means 'derived from a common evolutionary origin'. Homology is therefore an absolute term and does not require a specific degree of sequence similarity between any two protein sequences such as sequences A and C discussed above (45).

OTHER TRANSPORT DATABASES
Only TCDB is comprehensive, including transport systems from all living organisms, and only TCDB has been adopted by the IUBMB. However, several databases have been developed (see Table 2) which represent transporters in restricted groups of organisms, or are restricted to a certain category of transporter: (i) TransportDB (59) contains computerized annotations of transport proteins in organisms with fully sequenced genomes, and classifies them according to TCDB using a semi-automated pipeline.

HARMONIZATION AND FUTURE GOALS
The most important goals we have identified for future development of TCDB include (i) the creation of an ontology for the TCDB database, (ii) improving our integration with Pfam and (iii) streamlining the use of phylogeny and synteny information to provide functional predictions. Some of the new functions will be implemented as links, and some as software. Synteny should probably be implemented as links, because the information is often already available elsewhere (Microbes Online, JGI's intuitive resource IMG, SEED and RegPredict). Pfam may prove more difficult, because many families in Pfam are incomplete or not appropriately arranged in clans. Working with Pfam as we have in the past (69), we plan to improve upon the transport protein section of this database. It is well-known that many families that include domain duplicated transporters do not accurately reflect the domain borders in the way hidden Markov models (HMMs) have been trained (68). Currently, we do not show 'repeat units' in TCDB, even though this information is presented in many of our publications. We will continue to work with Pfam to integrate and coordinate information in both databases in a systematic way (69). Ideally, such a process should be automated or semiautomated.
Another worthwhile goal is to establish the user base so we can serve the needs of the scientific community more effectively. We plan to collect more access statistics to understand the needs of the user community. Google Analytics was installed in 2011, but improvements are required so we can recognize which TCDB features are most used.
One million PubMed abstracts are created every year, and 10% of the 2012 abstracts were not indexed as of May 2013. Other databases that link to TCDB, such as EcoGene (70), manually review literature. 'Transporter' is a MESH term PubMed uses, but there is a 6-month delay to add MESH terms, and sometimes the word 'Transporter' is not obvious from the title. TCDB uses machine learning classifiers, as well as keyword searches which are continuously extracted from TCDB and used as search terms to identify relevant articles. We are considering new ways for users to provide sequence data and information either with or without the use of email; suggestions by email could be used as test sets to evaluate the efficiency of an automated process. We are also considering implementing links for reference, sequence and information input from users. Adding a feature allowing TCDB to be searched as a library of HMMs is also under consideration. Current TCDB users report that the present system of presenting search results is satisfactory, but we constantly strive to improve the database, and suggestions from users are most welcome.
TCDB needs an ontological hierarchical system and a controlled vocabulary. EBI's ChemDB (71) has created a chemical classification system, and we have already set up a prototype which can be accessed from this link: http:// www.tcdb.org/ontology/. The substrate text needs to be extracted from the description and then correlated with ChemDB. One system already exists, but due to inconsistencies in the description, it has been difficult to implement. If we could link with gene ontology, TC numbers would be more accessible. Another important area for improvement concerns user access to the most recent entries. Perhaps TCDB should have 'recent releases', such as those of Pfam. Since we already track protein histories, adding this feature would not be Amino acid/Polyamine/organoCation (APC) a 3.
Viral envelope glycoprotein (Env) a a New or recently expanded superfamilies.
difficult. Some basic statistics, where database growth can be followed, are already available at: http://www.tcdb.org/ search/index.php. We are currently undertaking the development of standardized workflows to confirm homology results from TCDB's in-house statistical methods, based on structural superimposition and HMM:HMM comparisons. For instance, we use structural superimposition in addition to sequence statistical analyses to identify or confirm structural and evolutionary relationships between members of a superfamily (40). This helps to establish reference points in structural space for homology detection.

CONCLUSION
In 2006, TCDB contained 3000 proteins, classified into 400 families, but in 2013 it exceeded 10 000 proteins in 750 families. The availability of TCDB has allowed major basic research advances including answering fundamental biological questions, determining the routes of evolution taken for the appearance of these proteins, identifying superfamily relationships and allowing structural, functional and mechanistic predictions. Within practical limits, TCDB reflects the current state of our knowledge concerning its constituent parts. Conflict of interest statement. None declared.