BCL2DB: database of BCL-2 family members and BH3-only proteins

BCL2DB (http://bcl2db.ibcp.fr) is a database designed to integrate data on BCL-2 family members and BH3-only proteins. These proteins control the mitochondrial apoptotic pathway and probably many other cellular processes as well. This large protein group is formed by a family of pro-apoptotic and anti-apoptotic homologs that have phylogenetic relationships with BCL-2, and by a collection of evolutionarily and structurally unrelated proteins characterized by the presence of a region of local sequence similarity with BCL-2, termed the BH3 motif. BCL2DB is monthly built, thanks to an automated procedure relying on a set of homemade profile HMMs computed from seed reference sequences representative of the various BCL-2 homologs and BH3-only proteins. The BCL2DB entries integrate data from the Ensembl, Ensembl Genomes, European Nucleotide Archive and Protein Data Bank databases and are enriched with specific information like protein classification into orthology groups and distribution of BH motifs along the sequences. The Web interface allows for easy browsing of the site and fast access to data, as well as sequence analysis with generic and specific tools. BCL2DB provides a helpful and powerful tool to both ‘BCL-2-ologists’ and researchers working in the various fields of physiopathology. Database URL: http://bcl2db.ibcp.fr


Introduction
Two distinct groups of BCL-2-related proteins control the mitochondrial apoptotic pathway and probably other cellular processes as well (1,2). The first group is formed by a family of homologs related to BCL-2 by a common ancestry, and the second group comprises a heterogeneous collection of evolutionarily and structurally unrelated proteins characterized by the presence of a single short stretch of sequence similarity with BCL-2, termed the BH3 motif.
BCL-2 homologous proteins share a similar a-helical bundle fold (the 'BCL-2 domain'), have up to four different BH motifs (BH1-BH4) and can be either anti-apoptotic (e.g. BCL-2 and BCL-xL) or pro-apoptotic (e.g. Bax, Bak and Bid), whereas all of the BH3-only proteins are pro-apoptotic. Moreover, a variety of viral proteins have been found to be structurally similar to BCL-2 with or without obvious sequence similarity (3).
Since the discovery of the bcl-2 gene 30 years ago, intense research in various disciplines has exponentially increased the quantity of data available on the BCL-2 family and BH3-only proteins. Therefore, it is of considerable interest to use bioinformatic tools to (i) understand the various groups of proteins structurally or functionally linked to BCL-2 and their implication in diseases; (ii) bring all the available information together in a specialized database [for which we have previously developed a prototype (4)]. We recently proposed a novel classification scheme for BCL-2-related proteins, based on phylogenetic information and computational analysis of sequence data (5,6). Here, we describe an enhanced version of the BCL-2 database, a  computer-annotated sequence database dedicated to BCL-2 homologous and BH3-only proteins, as well as the integrated Web interface that provides easy and efficient access to the data.
The BCL2DB database BCL2DB is available since July 2013. The release 2 comprises 1039 entries, including 880 BCL-2 homologous proteins (655 encoded by metazoan genomes and 225 from viruses) and 159 BH3-only proteins. Based on our new classification scheme, we built an automated workflow to feed BCL2DB. The workflow relies on a set of specific profile HMMs (7) derived from 40 reference protein sequences representative of the various orthologous subgroups present within the BCL-2-like and BH3-only groups. This computational pipeline was able to identify both close and distant homologs of BCL-2 (including viral members) as well as the known repertoire of BH3-only proteins when searching the UniProt Knowledgebase (UniProtKB) (8). The identified sequences are then annotated to provide entries in the European Nucleotide Archive (ENA) (9) EMBL-Bank format, which is loaded into a PostgreSQL relational database management system. Finally, sequence data sets are extracted and multiple sequence alignments are computed together with associated data. BCL2DB is updated on a monthly basis. All the programs of the computational pipeline have been written in Java, and SQL was used for database queries.

Identification of BCL-2 homologous sequences and BH3-only sequences
The FindBCL2 program ( Figure 1A) ensures sequence identification and provides two modes of execution: discovery and production. In the discovery mode, a profile HMM is computed (hmmbuild program of HMMER package 3.0) for each reference sequence (of individual BCL-2 homologs or BH3-only proteins) from a multiple alignment of their closest homologous sequences extracted after a BLAST search against UniProtKB with a score threshold tailored for each reference sequence. Each profile HMM is then used to search UniProtKB (hmmsearch program), and an E-value The results are the profile HMMs and their associated classification E-value thresholds deduced after a HMM search against UniProtKB. The production mode used to generate the BCL2DB entry templates is described in the bottom part. After an hmmsearch on UniProtKB with the computed profile HMMs, the Ensembl or ENA entries are retrieved from cross-references or BLAST searches with nonfragment protein sequences and after removing duplicated sequences. Then, the entries are cleared of unwanted annotations and merged into a single one if they refer to the same Ensembl, Ensembl Genomes or ENA entry. (B) The AnnotateBCL2 process enriches each BCL2DB entry template with annotations from reference sequences, sequence classification information (protein/gene name and orthology group/cluster), location of BH motifs and structural data retrieved from the PDB. threshold is defined for use during the annotation process to classify the sequences into orthology groups (for BCL-2 homologous proteins) or clusters (for BH3-only proteins). The discovery mode is run periodically to improve the profiles sensitivity or when a new sequence is included in the seed set. The production mode is used to generate BCL2DB. The process starts by searching UniProtKB with the profile HMMs that were computed in the discovery mode. Then, for each selected sequence (E-value < 0.1) the Ensembl (10), Ensembl Genomes (11) or ENA entry is retrieved from UniProtKB cross-references or after a BLAST search. UniProtKB sequences corresponding to identical Ensembl or ENA entry are merged into one single entry. Unwanted annotations (i.e. uncertain, poor quality or nonconformity to the vocabulary standards) retrieved from Ensembl/ENA entries are then deleted to create a BCL2DB entry template that will be enriched with standardized data during the annotation procedure.

Annotation procedure
The annotation procedure (AnnotateBCL2 program; Figure 1B) starts from the entry templates generated for sequences that belong to the group of BCL-2 homologs or BH3-only proteins. The annotation process automatically affiliates each identified protein to its closest orthology group or cluster based on a specific curated gathering threshold cutoff (different for each profile). Above the threshold, the entry (typically a sequence from a nonmammalian organism) is considered as 'unclassified'. Moreover, homemade BH1-4 motif profiles were developed (see below) for use in computational annotation of BCL2DB sequences to precise the positions of their respective BH region(s). Finally, the Protein Data Bank (PDB) (12) sequences are searched for known structures with the profile HMMs.

Entry content
The text format of a BCL2DB entry is an extension of the ENA EMBL-Bank format (14,15).   Figure 2B). (C) The multiple sequence alignment computed with MUSCLE and displayed in Clustal W format. The color code used is red, green, black for residues that are conserved, strongly similar, weakly similar and variable in the alignment column, respectively, as defined by Clustal W. Dashes indicate gaps. (D) Residue repertoire computed from the previous alignment with the same color code.

Web interface
BCL2DB is publicly accessible through a Web site, which was accessed 1556 times by 259 unique users since 26 July 2013 (representing 28 requests by day excluding internet robots and counting only access to data Web pages). The Webbased interface allows easy browsing of the site and fast access to data, thanks to the menu bar and the set of buttons available.

BCL-2 menu
The Nomenclature submenu describes the terminology and classification used in BCL2DB with three protein subgroups (BCL-2 homologous, BH3-only proteins and structurally related proteins) composing the BCL-2 protein group. The Domain and Motif submenus list the clades, orthology groups and reference sequences existing within the three subgroups. For each reference sequence, the recommended protein and gene names, their synonyms, their primary function with regard to apoptosis, their accession number in BCL2DB and a literature citation with a link to PubMed are provided in a table (Figure 2). The Community submenu offers links to other Web resources related to cell death and apoptosis, as well as pointers to the community working in the field.

Data menu
The DATA menu provides access to sequence and structure data updated at the time of the database release computation by means of tables available in the Web pages. The tables allow simple and fast access to the data by biologists, Figure 4. Example of sequence analysis with BCL2DB sequences and the NPS@ server. (A) Results of a blastp search with the BCL-2 protein sequence (P10415) against the BCL2DB protein sequences. A first link (NPSA) is provided to extract the matching sequence from the sequence databank and perform further analyses with the set of tools available in NPS@. The second link on the sequence identifier is provided to view the BCL2DB entry ( Figure 2B). The pairwise sequence alignment between the query sequence and the matching sequence can be viewed, thanks to the link on the E-value. (B) Partial view of the multiple sequence alignment of BCL2DB BCL-2 proteins in the region of the BH3 motif. The alignment was computed with sequences selected and extracted from the blastp results. The color code used is red, green, black for residues that are conserved, strongly similar, weakly similar and variable in the alignment column, respectively, as defined by Clustal W. Dashes indicate gaps.

Tools menu
The analysis tools provided with the database are categorized either as generic or specialized. The generic analysis tools are available through the NPS@ server (22), our integrated resource for sequence analysis. For instance, BCL2DB nucleotide and protein sequences can be searched with BLAST (23) and selected sequences can be extracted and aligned with Clustal W (24) (Figure 4). The Annotate specialized tool permits users to annotate their own protein sequences with the set of programs used to feed BCL2DB. Users can determine whether their sequence is predicted as belonging to either the BCL-2 homologous or BH3-only group, its classification according to the orthology groups or clusters and its BH motif arrangement. The Annotate main result page contains a summary table showing each input sequence listed with its classification, its name and a link to access the detailed result page ( Figure 5). Information displayed in the latter page includes classification, sequence name and the protein annotations as described in the Entry Content paragraph.

BCL2DB menu
General information about the database is provided under this menu. The users have access to (i) the composition of the scientific advisory board (About submenu), (ii) a contact form to send messages to the BCL2DB team, (iii) the help about the Web interface, (iv) the news related to BCL2DB releases and changes, as well as Web site updates, and (v) the usage statistics. The main result page summarizes the submitted data and offers a table listing each input sequence (here, 18 sequences were uploaded) with its predicted protein name and its BH motif composition. A link to a detailed result page is provided on each sequence identifier when the sequence is annotated as belonging to the BCL-2 protein group. (B) The detailed result page for sequence MySeq02 displays the predicted protein name, the classification, the BH motif composition and the homologous known 3D structures. Numerous links allow the user to (i) download the sequences corresponding to the various annotations, (ii) download a UniProtKB formatted entry of the annotated sequence and (iii) browse structure entry at the PDB Web site. The submitted sequence is also displayed with colored BH motifs for easy cut-and-paste to other programs.