MobiDB: intrinsically disordered proteins in 2021

Abstract The MobiDB database (URL: https://mobidb.org/) provides predictions and annotations for intrinsically disordered proteins. Here, we report recent developments implemented in MobiDB version 4, regarding the database format, with novel types of annotations and an improved update process. The new website includes a re-designed user interface, a more effective search engine and advanced API for programmatic access. The new database schema gives more flexibility for the users, as well as simplifying the maintenance and updates. In addition, the new entry page provides more visualisation tools including customizable feature viewer and graphs of the residue contact maps. MobiDB v4 annotates the binding modes of disordered proteins, whether they undergo disorder-to-order transitions or remain disordered in the bound state. In addition, disordered regions undergoing liquid-liquid phase separation or post-translational modifications are defined. The integrated information is presented in a simplified interface, which enables faster searches and allows large customized datasets to be downloaded in TSV, Fasta or JSON formats. An alternative advanced interface allows users to drill deeper into features of interest. A new statistics page provides information at database and proteome levels. The new MobiDB version presents state-of-the-art knowledge on disordered proteins and improves data accessibility for both computational and experimental users.


INTRODUCTION
Intrinsically disordered regions (IDRs) of proteins do not adopt a highly populated structure in isolation, but sample a wide range of conformations. IDR-mediated interactions are implicated in a variety of cellular processes from signal transduction and liquid-liquid phase transition (1)(2)(3)(4)(5). IDRs are subjected to extensive pre-and post-translational regulation to modulate protein function in response to cellular stimuli (6)(7)(8)(9). Many functions of IDRs, such as entropic springs, flexible linkers or spacers are directly associated with their structural attributes (10,11).
Proteins containing intrinsically disordered regions are present in a considerable fraction of the proteome of eukaryotic organisms (e.g. 43.6% in humans based on MobiDB-lite predictions). In the human proteome, for example, IDRs are highly enriched in interaction interfaces and post-translational modification sites, which serve as regulatory switches of a variety of biochemical pathways (12). Short linear motifs in the IDRs of viruses lead to functional promiscuity and enable compact viral proteomes to extensively rewire their host cells (13). Given the lack of a requirement for a stable globular fold, interfaces in IDRs also appear to have higher evolutionary plasticity allowing them to proliferate in a proteome by ex nihilo evolution. As a result, IDR interfaces are possibly more widespread than ordered interfaces (14). Despite their central roles in key cellular regulatory processes, only a small fraction of disordered regions have been experimentally characterised (15,16). While difficulties in protein expression, purification and structural characterisation hamper experimental characterization, assigning functional modules to dynamic conformational ensembles presents a technical problem for a database.
D362 Nucleic Acids Research, 2021, Vol. 49, Database issue Manually curated databases, such as DisProt (17,18) and IDEAL (19), collect annotations from the literature and standardise that information. These databases assign the position of IDRs in the sequence, and aim to assemble disorder-related functional annotations. Other databases focus on functional features: DIBS (15) and MFIB (20) collect regions undergoing folding upon binding, and ELM (16) annotates linear motifs. IDRs forming fuzzy complexes are collected in the FuzDB database (21), which also aims to establish links between structural properties and function. Proteins reported to undergo liquid-liquid phase separation (LLPS) typically contain IDRs, which are available in PhasePro (22), PhaSepDB (23) and LLPSDB (24). Altogether, these specialized datasets exhibit a limited overlap, which presents a bottleneck for comprehensive characterisation of the IDRs contained therein (25).
Large-scale identification of IDRs often depends on primary deposition databases of structural data, like the PDB (26), which can be used to indirectly derive disorder information from missing and mobile residues. Although annotation based on this information is less reliable compared to curated resources, it provides a larger set of examples (27,28). A complementary source of information is provided by sequence based predictors, which typically exploit amino acid biases in regions of proteins to identify IDRs (29). Deriving information from these methods requires a critical assessment of the different techniques, which has been implemented in independent blind tests and in the Critical Assessment of Intrinsic protein Disorder (CAID). These efforts, in particular CAID, enable integrating large-scale predicted information on disordered regions into other core data resources such as InterPro (30), UniProtKB (31) and PDBe (26) via MobiDB-lite (32).
In contrast to structural data, functional annotations related to disordered regions are underrepresented in public databases and ontologies (33). This presents a bottleneck for large-scale functional assignment of disordered regions. Functional predictions are in essence limited to regions that fold upon binding, e.g. ANCHOR (34), DISOPRED3 (35) and MoRFCHiBi (66). Recently, prediction of regions that remain disordered upon binding (36) and of regions that exhibit fuzzy binding (37) became available. A new generation of disorder function predictors is also being developed (36) based on the Critical Assessment of Function Annotation challenge (CAFA) (37) and DisProt-based Disorder Ontology (17).
The objective of the new MobiDB (version 4) is to provide stable and sustainable access to data on disordered regions and proteins. As an ELIXIR resource, this is in line with the overall ELIXIR mission and complies with the goals described in the ELIXIR IDP community whitepaper (38). Sustainability has been improved primarily by optimizing the database schema, and by improving the update process and simplifying the data representation formats. The new version incorporates additional curated data from specialized databases and presents novel annotation features including the binding modes derived from processing PDB data as well as an extended list of predictors. It is now possible to download search results and entire genomes with the possibility to specify the format and selected features. A novel statistics page allows the users to compare annotation coverage of different types of annotations and compare the annotation content at the proteome level for all reference proteomes.

Database content
MobiDB data serves both experimental scientists, who are interested in different aspects of disordered protein regions as well as bioinformaticians, who develop softwares for analysis and prediction of disordered regions. In the new version of the database, several new data resources and features were added to increase the coverage and usability of information about protein disorder. Figure 1 gives an overview of the annotations and services provided by MobiDB. All curated disordered regions are mapped across homologs obtained from GeneTree alignments (39) with >80% of sequence similarity and alignment lengths of 10 residues. This expands the curated set of proteins ten fold. Structural and functional properties of disordered regions are based on third party databases and a set of prediction methods.
The full list of software and resources used in MobiDB is provided in Tables 1 and 2. The different types of annotations and predictions are assembled to provide the user a comprehensive view of properties of disordered regions at the residue level. As compared to the previous releases, MobiDB v4 provides annotations on novel functional aspects of disordered proteins, such as binding modes (36), involvement in post-translational modifications and phase separation (3,4,7).
Disorder predictions are provided via the MobiDB-lite software (32) over the entire UniProtKB set of protein sequences. In addition, MobiDB calculates annotations from PDB structures, based on manually curated annotations from specific third party databases and propagating manual curation by homology. MobiDB-lite is also integrated for example into InterProScan (30) which propagates its predictions onto several other EBI resources like UniProtKB, InterPro and PDBe.

Feature classification in MobiDB v4
Different levels of reliability and different features are reported as different and independent annotations. Each MobiDB annotation is identified by an 'evidence-featuresource' triplet which uniquely identifies one level of quality, type and source of annotation. Possible values for 'evidence' are: (i) 'curated', the annotation comes from a manually curated database; (ii) 'derived', the annotation is inferred automatically processing primary data, for example PDB structures are processed to detect missing, mobile and lip residues, (iii) 'homology', annotations are obtained aligning curated data and (iv) 'prediction', annotations are automatically extracted from the protein sequence. The 'feature' element of the triplet represents the type (or flavour) of the annotation, for example 'disorder', 'missing residues', 'binding mode', 'lip', etc. The 'source' element represents the annotation source which can be the name of a software, a third party database or the type of consensus. Possible values are: 'mobidb lite', 'disprot', 'merged', 'th 90', etc. The  full list and explanation of all available triplets is provided on the about page of the website.

Novel functional annotations in MobiDB v4
The main difference in MobiDB v4 as compared to the previous releases of MobiDB is the addition of novel functional aspects related to disordered regions, which are listed below and with a white background on the first block of Figure 1. These contribute to establishing links between conformational ensembles and their biological roles.

Binding modes
Binding modes of disordered regions refer to the conformational transitions of IDRs upon interacting with spe-cific partners. IDRs can undergo disorder-to-order transitions and fold upon the template, or remain disordered (disorder-to-disorder transition) in a partner-bound form (36). Some IDRs exhibit both behaviors, and exhibit context-dependent binding with different partners or cellular conditions (37). These binding modes were assigned from collecting all bound-state experimental evidence from PDB, as defined in (36).
MobiDB implementation identifies inter-chain binding regions and the mono-/multimeric state using the RING software (40). RING generates residue level contactnetworks by identifying all types of non-covalent interactions at atomic level in a protein structure (PDB). The residue interaction network generated by RING is combined with data on disorder derived from missing D364 Nucleic Acids Research, 2021, Vol. 49, Database issue residues. Comparison of disordered residues in free and bound form of the protein is used to define the different binding modes: disorder-to-order, disorder-to-disorder and context-dependent binding.

Linear interacting peptides (LIPs)
Interactions of disordered regions are realized via short peptide motifs (41). These linear motifs can be collected from the sequence within the Eukaryotic Linear Motif database (16) or predicted using automatic methods such as ANCHOR (42). MobiDB provides a different approach, by exploiting a new software, Fast Linear Interacting Peptides Predictor (FLIPPER), which is able to detect linear interacting peptides (LIPs) from protein structures (PDB).

Post-translational modifications
Posttranslational modifications of disordered regions serve as regulatory points of many biochemical pathways (44). Therefore annotating PTM sites is crucial for functional annotation of disordered regions. Data on all types of PTMs are derived from UniProtKB which propagates a small set of well annotated modifications by sequence similarity. Additional phosphorylation sites are provided from the Scop3P database. The Scop3P sites are extracted by large scale re-processing of 36 projects from the PRIDE database (45) with the ionbot approach (https://ionbot.cloud/).

Liquid-liquid phase separation
Regions forming dynamic liquid droplets are a new facet of disordered proteins (46). Recent data indicates that weak interactions involving disordered regions often govern protein phase separation (47). Although the general biological roles of phase separation remain to be elucidated, systematic annotation of regions undergoing LLPS will contribute to elucidating the underlying sequence-codes.
Regions associated with phase separation processes in MobiDB are derived from the PhasePro database (22). This dataset not only assembles proteins, which were reported to phase separate in vitro or in vivo, but also specifies those regions, which mediate this process.

Data generation pipeline and updates
From the technical point of view, the major change is the new format of the entry document in the database. Multiclass annotations, e.g. secondary structure, have been converted to binary classifications by splitting classes into different fields. The grouping of different features on a common field is deprecated since it encodes an arbitrary interpretation of the type of prediction directly in the entry document. Now all predictions appear at the same level and have the same format. It is now possible to integrate new annotations-without refactoring the entry document. This strategy provides a greater flexibility simplifying updates and database maintenance. This enables us to keep MobiDB annotations up to date with UniProtKB releases, with a minimal delay of a few weeks. For new UniProtKB releases MobiDB generates predictions for new sequences by running MobiDB-lite and other tools, updates protein metadata (UniRef IDs, taxonomy, etc.) and removes obsolete entries.

MobiDB website
The MobiDB website has been completely redesigned in order to improve user experience and satisfy both general use and detailed computational analyses. The graphical user interface retrieves data from a public API which is now well documented and will remain stable for several years.

Entry page
The entry page has been completely renewed as compared to MobiDB v3 and is now divided in two views: simple and advanced. The simple view of the entry page offers an overview of the entry, focusing on a few annotations. Switching to the advanced view, the user accesses the full collection of data associated with an entry, with dedicated Nucleic Acids Research, 2021, Vol. 49, Database issue D365 tools for their visualization. In particular, both views revolve around the feature-viewer (48) to organize and dynamically report annotations along the sequence. An interactable network map and a Mol* instance (49) allows an integrated in-depth exploration of residue contacts and structural features.
The feature-viewer combines annotations into a tree of tracks. Parent tracks summarize the information of children tracks by different consensus strategies. Some tracks, e.g. MobiDB-lite tracks are automatically expanded on page load to highlight disordered sub-regions. For each track a squared icon indicates the evidence quality level of the annotation. When a track has children its name gets bold on mouse over and the evidence icon is outlined. On the advanced view, it is possible to customize the feature viewer by hiding specific tracks. By clicking on a region, it is possible to visualize the corresponding annotation in the PDB structure and on the contacts network.

Statistics page
MobiDB v4 has a new statistics page which provides information about all types of annotations for the entire database and for all proteomes separately. The statistics page provides coverage of a given feature both at the database and proteome level. A search bar on the top allows searching for an organism by providing the identifier of the reference proteome or a string which is evaluated against the organism name and kingdom.

Searching and downloading data
The browse page is designed upon the MobiDB server API. MobiDB provides programmatic access to perform a search through a RESTful web service API. A single entry can be retrieved by using UniProtKB identifiers, while database searches can be performed by specifying query fields directly as URL parameters in the HTTP request. Protein metadata (UniProtKb accession, gene, organism, NCBI taxon ID, etc.) can be searched for exact matches, whereas the disorder content and sequence length accept a range of values. In the case of a free text search (free text parameter) the server uses the EBI Proteins API (50) to retrieve proteins containing the input string in the protein name. In that case the result order is provided by the UniProtKB annotation score (https://www.uniprot.org/ help/annotation score). Both from the browse page and from the API it is possible to download the full search result in different formats, TSV, FASTA and JSON. Entire genomes can be searched and downloaded in minutes.

BioSchemas
To increase MobiDB interoperability and findability, key web pages are decorated with BioSchemas markup (https: //bioschemas.org) (51) developed within ELIXIR, the European infrastructure for biological data. BioSchemas extends Schema.org (https://schema.org) for life sciences making the data contained in the database indexable by search engines and other services such as the Google dataset search tool. MobiDB uses DataCatalog and Dataset profiles in the main page, and DataRecord for the entry pages.
The DataRecord includes Protein and SequenceAnnotation profiles. These profiles were recently implemented in Mo-biDB which makes the database, together with DisProt, among the first resources exposing sequence region information using BioSchemas.org.

CONCLUSIONS AND FUTURE WORK
The rapidly accumulating experimental data on the structure and function of disordered protein regions highlights the importance of systematic analysis of protein functions related to protein disorder. MobiDBv4 provides a major improvement in this respect as compared to previous releases by adding descriptions of functional aspects of disorder, such detailed information about different binding modes, including both disorder-to-order and fuzzy binding; posttranslational modifications; and regions associated with phase separation processes. Interpretation of residue interaction networks is also facilitated by visualising contact maps in an interactive manner. The database data and programming infrastructure has also been upgraded, facilitating systematic analysis, searches and updates. The new website provides more comprehensive information of each entry in a convenient format for both experimental and computational users.
MobiDB aims to provide gold standard disorder predictions, learning from the results of the Critical Assessment of Intrinsic protein Disorder (CAID, http://idpcentral.org/ caid) and the Disorder Ontology prediction challenge at the Critical Assessment of protein Function Annotation (CAFA, https://www.biofunctionprediction.org/cafa). These initiatives will further contribute to the improvement of the disorder predictions available in MobiDB. MobiDB predictions are also integrated into several EBI resources including InterPro, UniProtKB and PDBe. MobiDB is a central resource for the IDP community in ELIXIR.
Thanks to the statistics page, for the first time MobiDB allows to answer questions like for example which is the fraction of disordered residues in a given organism, which is the fraction of residues annotated by the majority of predictors or how many proteins have at least one disordered region.
Further developments of the database aim to improve predictions related to protein disorder. On the one hand, the sequence-based predictions of disordered regions have to be standardised and improved, including the predictions of regions, which can undergo liquid-liquid phase separation. These predictions can be facilitated by evolutionary analysis of disordered regions and vice versa prediction of disordered regions can facilitate evolutionary analysis of proteins with low degrees of sequence similarities (52,53).
The other major line of development is functional predictions for disordered regions, including their roles in interaction networks, conformational switches and organisers of liquid droplets. The novel annotations implemented in MobiDB v4 contribute to these efforts. These functional annotations will also be integrated into CAID and CAFA challenges, and will be related to GO classifications. In the present form, MobiDB provides up-to-date knowledge on disordered proteins and regions, which can be widely used