The eukaryotic linear motif resource – 2018 update

Abstract Short linear motifs (SLiMs) are protein binding modules that play major roles in almost all cellular processes. SLiMs are short, often highly degenerate, difficult to characterize and hard to detect. The eukaryotic linear motif (ELM) resource (elm.eu.org) is dedicated to SLiMs, consisting of a manually curated database of over 275 motif classes and over 3000 motif instances, and a pipeline to discover candidate SLiMs in protein sequences. For 15 years, ELM has been one of the major resources for motif research. In this database update, we present the latest additions to the database including 32 new motif classes, and new features including Uniprot and Reactome integration. Finally, to help provide cellular context, we present some biological insights about SLiMs in the cell cycle, as targets for bacterial pathogenicity and their functionality in the human kinome.


INTRODUCTION
Short linear motifs (SLiMs) are small functional protein modules that mediate protein-protein interactions and protein sequence modifications (1,2). They play essential roles in almost all cellular processes, including cell signaling, trafficking, protein stability, cell-cycle progression and molecular switching mechanisms (2)(3)(4)(5). SLiMs have also been found to play an increasingly important role in human disease, including viral pathogenicity (6) and are also emerging as major players in cancer, especially the degron class of motifs (7,8).
SLiMs are short degenerate sequences, generally between 3 and 15 amino acids in length, and are typically formed by a few highly conserved residues located between more loosely conserved positions (1). As a result, an individual motif binds with relatively weak affinity, usually in the low micromolar range. However, multiple SLiMs often cooperate to create strong yet dynamic interfaces. They generally occur in intrinsically disordered regions, and (in the absence of a binding partner) have no stable three dimensional structure. Although SLiMs are short and mostly participate in transient interactions, they are essential to a protein's binding specificity and proper functioning. Current estimates suggest there may be in the order of 1 000 000 different SLiMs in the human proteome (9). However, despite their abundance and importance, far fewer have been properly described. The eukaryotic linear motif (ELM) resource is a project dedicated to cataloging, characterizing and identifying these motifs.

THE ELM RESOURCE
The ELM resource is a database and web server focused on SLiMs (elm.eu.org). ELM was first released in 2003 (10), and has grown into one of the most widely used and reliable resources for high quality SLiM annotations, mostly focused on, but not limited to, eukaryotic proteins (11)(12)(13). The resource consists of two main components: a manually curated database of SLiM definitions and an exploratory pipeline which uses these definitions to look for putative SLiMs in protein sequences.
The database component of ELM contains manually curated characterizations of over 275 SLiMs contributed by our community of biologists and biocurators. Each SLiM (named 'motif class' in ELM) is defined using a regular expression, a computational syntax that can express complex patterns of letters (or single letter amino acid abbreviations). For example, the regular expression ...[ST]P[RK] is used to express an amino acid sequence that starts with any three amino acids ... before either a serine S or threonine T, followed by a proline P, and ending in an arginine R or lysine K. Curators annotate each motif class based on experimentally validated motif occurrences (named 'motif instances' in ELM) from the scientific literature. Each motif class annotation is accompanied by a detailed description, links to the original studies and crosslinks to external databases and ontologies including the Gene Ontology (14), Proteomics standards initiativemolecular interaction (PSI-MI; (15)), the NCBI taxonomy (16,17) and the Protein Data Bank (PDB; (18)).
The ELM exploration pipeline is used to detect matches to SLiMs in protein sequences. When a user submits a sequence, it is matched against all regular expressions annotated in the ELM database. Since SLiM patterns are short and often highly degenerate, SLiM pattern matching alone is likely to generate many false positive predictions. Any motifs likely to be non-functional are deprecated by applying structure and domain architecture filters based on protein disorder (from GlobPlot (19) and IUPred (20)), protein secondary structure (21) and protein domains (from SMART (22) and Pfam (23)). The result contains putative SLiMs located in disordered regions that are accessible for making binding interactions. The motif occurrences are also given a conservation score to reflect how conserved this sequence is across aligned homologous proteins. The results of the ELM exploration pipeline are a useful starting point for inferring possible functions of a protein and selecting novel candidates for further examination with other bioinformatics resources. For example, the context of motifs in a sequence alignment and other information such as intrinsic disorder prediction and disease mutations can be visualized with ProViz (24). To follow up interesting individual motifs a tool such as SLiMSearch (25) can query protein databases, providing a ranking for your protein of interest relative to other proteins containing the motif.

New ELM classes
The main type of data curated in ELM are the motif classes. Each motif class consists of the SLiM name and description, its regular expression, and the complete set of motif instances and experimental data used to define the class. Currently there are over 275 motif classes, 32 more than in the last NAR publication in 2016 (13) (Figure 1 and Table 1). Most notably six variants of the mitogen-activated protein kinases (MAPK) docking D-motifs and additions and improvements to cell cycle regulatory motifs including relevant degrons and kinases such as the Polo-like kinases (Plks). We have tried to be comprehensive for degron motifs (recently reviewed in (8)) with the most recent addition being the pLxIS motif involved in immune response of interferon-regulatory factor IRF3 but which has degron-like properties for rotavirus hijacking (26). Another example of a hijacked motif is the tyrosine-kinase regulating motif EPIYA, which is a common motif mimic used by pathogenic bacteria. Also, several existing motif entries have been redefined or expanded, including recent updates to the abundant and versatile class of 14-3-3-binding motifs and to the cell cycle checkpoint retinoblastoma protein pRb-binding LxCxE motif.

New ELM instances
One of the principal types of data contained in ELM are the motif instances, i.e. experimentally validated occurrences of motif classes in proteins. As of September 2017, ELM has 3093 instances, having added 491 new motif instances since the last NAR database publication (13) and also updated many existing entries ( Figure 1). Following previous years, the majority of new motif instances are for human proteins and other animals, although we have had a large increase in the number of viral motif mimics and we have begun the process of adding instances of bacterial motif mimicry from a systematic review of the literature ( Table 2).

NEW FEATURES IN ELM
In this release, we have further integrated ELM with other bioinformatics databases and resources. An important development is that UniProt (27) now includes ELM as a database cross-reference in the 'protein-protein interaction databases' section. We have also updated the experimental evidence codes used in ELM to the latest version of PSI-MI Canonical Arg-containing phospho-motif mediating a strong interaction with 14-3-3 proteins. LIG 14-3-3 ChREBP 3 1 14-3-3 protein binding to a nonphosphorylated helical peptide in ChREBP is promoted by adenosine monophosphate. LIG 14-3-3 CterR 2 5 C-terminal Arg-containing phospho-motif mediating a strong interaction with 14-3-3 proteins. LIG ANK PxLPxL 1 10 The consensus PxLPxI/L motif, which can be found in diverse proteins, binds to the ankyrin repeat domains of ANKRA2 and its close paralog RFXANK. LIG APCC ABBA 1 11 Amphipathic motif that is involved in APC/C inhibition by binding of CDH1/CDC20. In metazoan cyclin A, the motif also acts as a degron, enabling the cyclin's degradation in prometaphase.
LIG APCC ABBAyCdc20 2   2 Amphipathic motif that binds to yeast Cdc20 and acts as an APC/C degron enabling cyclin Clb5 degradation during mitosis. LIG BH BH3 1 19 The BH3 motif is found in pro-apoptotic proteins and interacts with BH domains of the anti-apoptotic Bcl-2 family members to regulate apoptosis. Since the last NAR database issue publication 32 motif classes have been annotated to the database. (13) version 2.5 (15). The most notable changes in PSI-MI are that terms 'GST-pulldown' and 'HIS-pulldown' have each been demerged into a combination of terms: 'glutathione s transferase tag' and 'pull down' and 'his tag' and 'pull down'. We have also integrated ELM with the Reactome pathway database (28), and introduced programmatic access to the ELM exploration pipeline, both of which we describe below in more detail.

Reactome
One way to gain additional insights into which biological processes a SLiM may be involved in, is to examine the cellular pathways that contain proteins with this motif. ELM already has links to pathways contained in the KEGG pathway database (12,29). In order to augment the cellular net-work knowledge potential available in ELM, we have now integrated ELM with another pathway database: Reactome. Reactome is a manually curated peer reviewed pathway database (28). Pathways are defined by reactions and the entities participating in them (nucleic acids, proteins, complexes and small molecules), and are supported by literature citations and expert curation. It is now possible to visualize and download all Reactome annotations for proteins available in ELM. Every protein in ELM having a Reactome annotation now has a link to display a Reactome pathway diagram that highlights where this protein functions. The complete list of Reactome annotations can also be retrieved from the ELM downloads page. Later in this paper, we will illustrate how the ELM annotated Reactome data can be used to analyze the motifs involved in the cell cycle.  Taxon  Motif instances added  Motif instances modified  DEG  1  1  Human  315  10  CLV  0  1  other Animal  87  2  TRG  0  0  Fungi  17  0  LIG  19  9  Plant  10  3  MOD  5  2  Bacteria  23  0  DOC  7  5  Virus  39  0 Since the last NAR database issue publication in 2016 (13)

The ELM API
The ELM exploration pipeline is a useful tool to predict putative SLiMs in protein sequences. Nevertheless, the graphical user interface is not suitable for automated or large scale analyses. One of the latest updates to the ELM resource has been to include an application programmatic interface (API) to the ELM exploration pipeline (30). The ELM exploration API allows users to submit either a protein sequence or a UniProt ID to predict which SLiMs might exist in it. The protein sequence is matched against all of the regular expressions annotated in ELM and each motif match is passed through a combination of structural context filters, which help to predict whether the motif is likely to be biologically functional. Motif matches are filtered out of the predicted motifs if they occur in globular domains, transmembrane regions or extracellular regions. The API also returns whether any of the motifs detected are already annotated in ELM, or whether the motif has been annotated in a homolog in ELM. The output is provided as a tsv (tab separated values) file, which is easy to read and analyze computationally. The API can be accessed using any programming framework that can process HTTP requests, for example wget, curl and the python 'requests' package. For more information on using the API, usage guidelines and how to interpret the results, see (30) and read the documentation on elm.eu.org/api/manual.html.

MOTIFS IN BACTERIAL PATHOGENS
Motifs are not unique to eukaryotes; they also exist in bacteria and viruses. It has been known for some time that viruses use motif mimicry to interfere with biological processes of the host cell (6). This behavior is not limited to viruses, but the data for pathogenic bacteria are more limited (31,32). In the latest version of ELM, we report instances from a handful of bacteria that are now known to use motif mimicry for pathogenicity. Among the bacterial proteins with newly added motifs are OspF from Shigella flexneri and SpvC from Salmonella Typhimurium, which use a D-motif to recognize MAPK proteins like ERK, JNK or p38 and irreversibly modify a phosphorylated residue to block downstream MAPK signaling, thus preventing the activation of the immune response (33,34). Enterohaemorrhagic Escherichia coli uses the multi-valency of a GBD domain-binding motif to activate up to seven WASP proteins with a single effector protein, espFU (35,36). The same protein has five tandem PxxP motifs that bind to the SH3 domain of BAIAP2L1/IRTKS with the highest reported affinity for a motif-SH3 complex (500 nM) (35). Finally, the tyrosine-phosphorylated EPIYA motif present in the cellular protein Pragmin is also used by CagA from Helicobacter pylori and LspA1 from Haemophilus ducreyi to recruit CSK and phosphorylate Src-family kinases (37,38), interfering with cell fate and phagocytosis (39,40). Besides their role in pathogenicity, motif mimicry by bacteria also has implications for bacterial oncogenicity, such as the oncogenic potential of H. pylori strains (41).

MOTIFS IN THE CELL CYCLE
One of the new features included in ELM are the protein's pathway annotations from the Reactome database (28), which can be downloaded from the ELM 'downloads' page. These annotations allow the construction of network diagrams to examine the roles of motifs within any signaling pathway in Reactome. As an example, we have annotated the motifs present in the cell cycle (R-HSA-1640170) ( Figure 2A, created with Cytoscape (42)), which consists of 610 proteins, 199 of which have motifs annotated in ELM. In Figure 2B, we highlight the 20 proteins involved in the mitotic cell cycle checkpoint (R-HSA-69618), almost all of which have multiple SLiMs (Figure 2A). Degradation motifs recognized by the APC/C complex, an E3 ubiquitin ligase, as well as LIG MAD2 motifs are prominent in these checkpoint proteins. Many proteins involved in the cell cycle contain one or more functionally important linear motifs and combining SLiM annotation with pathway information will help unravel the roles SLiMs play in the cell.

THE KINOME IN ELM
In this release, we report an expansion of the portion of the human kinome annotated in ELM, including new motif entries for CDKs (not discussed in this article), MAPKs and Plks.
MAPKs form an important part of conserved signaling pathways involved in processes such as cell division, differentiation, growth and apoptosis (43)(44)(45). MAPKs are serine/threonine kinases that recognize substrates by the [ST]P motif, and for specificity rely on additional motifs (for example D-motifs) to bring the kinase and its substrate close together for phosphorylation. These motifs harbor one or two basic residues, a variable linker segment and usually three hydrophobic amino acids. Interestingly, the motif orientation can be from the N-to C-terminus where charged residues are followed by linker and hydrophobic residues (for example DOC MAPK NFAT4 5, Figure 3A, produced using Chimera (46)) or C-to N-terminus, where hydrophobic residues precede the charged amino acids (e.g. Figure 3B).  Plks are central to the cell cycle and are often found restricted to cellular locations involved in mitosis (such as centrosomes, kinetochores and the spindle) (47). Humans have four functional Plks. The C-terminal parts of Plks 1-3 have two polo-box domains that help target and recruit the kinase substrates by recognizing the short consensus sequence (S[ST]) which, when phosphorylated on the second residue, acts as a docking/activation site (48,49). Specificity is conferred by the Plk's specific target motif: Plk1 requires an Asp or Glu two positions before the phosphosite, Plks 2 and 3 require an Asp or Glu either two positions before or after the phosphosite and Plk4 has a varied motif requirement where hydrophobic residues are strongly favored after the phosphosite consensus sequence.

CONCLUSION AND FUTURE DIRECTIONS
Every year ELM continues to grow in terms of new content and connectivity to other resources. As more content is added to ELM we also expect to characterize more and improve existing motif classes. Each addition to the database will allow researchers to uncover new biological insights involving motifs in protein-protein interactions, pathways and networks as well as better understanding the roles of SLiMs in disease and pathogenicity. One of the important aspects of this work will be not only to add new content to the database, but also to review and update the existing content with new discoveries from the scientific literature. We will also continue integrating ELM with existing and emerging bioinformatics resources for SLiM research and protein biology. In parallel we will further develop the ELM API to facilitate the integration of ELM with other bioinformatics tools and resources. We expect that ELM will continue to be a useful and unique resource for SLiM research and the life science community. Users are also encouraged to visit the 'external links' page (http://elm.eu.org/infos/links.html) which lists many other useful tools and databases for SLiM research such as QSLiMFinder (for motif discovery (50)) and ProViz (for motif exploration (24)). We also welcome any feedback you can give us that can help us improve ELM.