Proteome-level assessment of origin, prevalence and function of leucine-aspartic acid (LD) motifs

Abstract Motivation Leucine-aspartic acid (LD) motifs are short linear interaction motifs (SLiMs) that link paxillin family proteins to factors controlling cell adhesion, motility and survival. The existence and importance of LD motifs beyond the paxillin family is poorly understood. Results To enable a proteome-wide assessment of LD motifs, we developed an active learning based framework (LD motif finder; LDMF) that iteratively integrates computational predictions with experimental validation. Our analysis of the human proteome revealed a dozen new proteins containing LD motifs. We found that LD motif signalling evolved in unicellular eukaryotes more than 800 Myr ago, with paxillin and vinculin as core constituents, and nuclear export signal as a likely source of de novo LD motifs. We show that LD motif proteins form a functionally homogenous group, all being involved in cell morphogenesis and adhesion. This functional focus is recapitulated in cells by GFP-fused LD motifs, suggesting that it is intrinsic to the LD motif sequence, possibly through their effect on binding partners. Our approach elucidated the origin and dynamic adaptations of an ancestral SLiM, and can serve as a guide for the identification of other SLiMs for which only few representatives are known. Availability and implementation LDMF is freely available online at www.cbrc.kaust.edu.sa/ldmf; Source code is available at https://github.com/tanviralambd/LD/. Supplementary information Supplementary data are available at Bioinformatics online.


Proteins with least likely LD motif sequences:
Ral GTPase-activating protein subunit alpha-2 (RALGAPA2). RALGAPA2 is the catalytic α2 subunit of the heterodimeric RalGAP2 complex. The putative LD motif (residues 1519-1528) lies in a poorly ordered region. The RALGAPA2:PYK2 interaction has a PrePPI score of 0.93, shows medium-level co-expression with GIT2, but failed to show significant binding to GIT1 FAH.
C8orf37 (C8orf37). The 207-amino acid C8orf37 protein is widely expressed, with highest levels in brain and heart, and mutations are associated with ciliopathies and retinal dystrophy (Heon, et al., 2016). The putative LD motif (residues 4-13) is in the disordered Nterminal half of the protein.

Computational Methods
As the number of known LD motifs is small, it becomes an imbalanced dataset problem, which usually causes issues for classification methods. Therefore, we used a two-phase approach for building the prediction model. In the first phase, we considered the known LD motifs as the positive set and the remaining 10-mers extracted from these proteins as the negative set. As expected, these extracted 10-mers can be easily differentiated from the true LD motifs because they do not satisfy sequence patterns, secondary structure patterns or physicochemical patterns of the LD motifs. Therefore, a model trained based on such a trivial negative set may not be practically useful. Yet it provides us a rough predictor by assigning different weights to sequence-, secondary structure-and physiochemical-patterns. In a second phase, we used this predictor to obtain more difficult negative sets. This was done by selecting the 10-mers from the proteins in the Protein Data Bank (PDB) which satisfy some of these patterns according to the first predictor, but not all of them. We then used these new negative sets as well to train the final predictor. This results in an active learning framework to train an LD-motif predictor.

Features that characterise bona fide LD motifs in silico.
To first determine features that characterise LD motifs in silico, we analysed known LD motifcontaining proteins using algorithms to predict protein disorder, secondary and tertiary structures. We found that established LD motifs (paxillin family, DLC1 and RoXan), as well as gelsolin's C-terminal LD-like motif are located within protein regions predicted as disordered (Supplementary Fig. 1). Secondary structure prediction assigned a significant α-helix likelihood to those LD motifs, in agreement with structural studies of paxillin LD motifs 1, 2 and 4, DLC1 and gelsolin (Fig. 1C) (Alam, et al., 2014;Hoellerer, et al., 2003;Lorenz, et al., 2008;Nag, et al., 2009;Zacharchenko, et al., 2016) (Supplementary Fig. 1). Bona fide LD motifs are therefore computationally characterized as short α-helical segments within disordered protein regions.

Initial training data set
Our model uses information from protein sequence content of data-windows of length 10AA. Such windows are denoted as core windows. A core window is shifting one residue ahead. So, if a protein has a length L >= 10 AA residues, then there are L-10+1 possible candidate core window to be considered by scanning the protein sequence as containing a putative LD motif.
By surveying the literature, the known LD motifs were found in Paxillin, Leupaxin, PaxB, Hic-5 (Tumbarello, et al., 2002), RoXaN (Vitour, et al., 2004), and DLC1 (Durkin, et al., 2007) and we selected these LD motifs. This resulted in a set of 18 genuine LD motif windows generated from six proteins. We denote this set as the set of known LD motifs (positive set PS1). All the possible windows of length 10AA from the remaining regions of the above-mentioned six proteins were selected as the core windows of the initial negative set (NS1). This produced a set of 4020 windows from six proteins that formed NS1. To consider the importance of surrounding regions of LD motifs, 20AA residues flanking regions on each side of the scanning window were analysed.

Feature extraction from protein sequences
From the set of aligned 18 windows with their flanking sequences, position frequency matrix (PFM) was constructed. If the flanking region of scanning window is shorter than 20AA (at Nterminal and C-terminal region) then the positions are filled up by a gap ('-'). PFM was then normalised to produce Position Weight Matrix (PWM) using normalisation technique analogous to (Bajic, et al., 2003). We only consider twenty IUPAC unambiguous AA codes (http://www.bioinformatics.org/sms/iupac.html) and gap ('-') for building PWM. We built PWM from the scanning core window (PWMCoreSeq) which consists of 10 residues, the two flanking regions each with 20 residues produces two other PWMs (PWMUpSeq, PWMDownSeq) and the whole segment (upstream flanking region + core window + downstream flanking region) of 50 (20 + 10 + 20) AA residues produces the additional PWM. Then, during the scanning of protein sequences, we matched the four PWMs with corresponding window segments to get the respective four matching scores (Bajic, et al., 2003). We also considered the average values of the mapping score from the PWM of core window (PWMCoreSeq) and PWM of flanking regions (PWMUpSeq, PWMDownSeq). Thus, we generated five features for each window. While generating the scores from the core PWM (PWMCoreSeq) we used our previous knowledge of the properties of bona fide LD motifs (Alam, et al., 2014;Hoellerer, et al., 2003). If there are no acidic residues (Asp or Glu) either at position 0 or 6, we assign the score zero to PWMCoreSeq. Proline has a tendency to break the helix. Consequently, if there were two consecutive prolines in core motif we also assigned 0 to PWMCoreSeq.

Feature extraction from secondary structure (SS)
We predicted the secondary structure (SS) of the whole protein using PSIPRED (McGuffin, et al., 2000) against the NR database. Each residue in the 50AA window (core + flanking regions) was tagged as belonging to helix ('H') or coil ('C') or strand ('E'). Gap ('-') was also considered for the windows near N/C-terminal of proteins. From the set of 18 windows that correspond to known LD motifs (with flanking regions), we constructed PFM matrices (analogously as mentioned in the previous section) based on SS annotation of residues. PFM was then normalized to PWM. We built the PWM from the scanning core window (PWMCoreSS), the two flanking regions each with 20 residues produces two other PWMs (PWMUpSS, PWMDownSS) and the whole segment (upstream flanking region + core window + downstream flanking region) of 50 (20 + 10 + 20) AA residues produces the additional PWM. Using PWMs, we were able to generate five features from SS information in the analogous manner as explained in the previous section. In these cases, if the core motif part does not have any helical prediction, we assign zero to the core motif score from PWMCoreSS.

Feature extraction using AAindex
From Amino Acid Index (AAindex) database (Kawashima, et al., 2008) three physiochemical properties were extracted: hydrophobicity (Backer, et al., 1992), volume, and electric charge (Fauchere, et al., 1988). For each of the 10 residues in a core window, we calculate the AAindex values of the above-mentioned three properties that produced 30 (3*10) features.

Model Development
We generated an initial model based on the initial training data. Since this model is based on data derived from only six proteins and contains a very small number (18) of known LD motifs, we extended the training set by hypothetical LD motifs and additional negative data. For this, we used a procedure (explained below that, among other things, utilizes the initial model) that is likely to generate motifs highly similar to known LD motifs. Once the training set is expanded this way, we retrained the model as we used initially.

The Initial Model
We extracted five features using primary sequence information, five features using SS information, and 30 features using AAindex for data-windows as discussed previously. Then we used a support vector machine (SVM) model (Cortes and Vapnik, 1995) with linear kernel (Shawe-Taylor and Cristianini, 2004) to build a predictive model (M1). We used 'svmtrain' function of MATLAB 2012b with default parameter setting to build the model (there was no need to optimize parameters of the SVM model as the default setting provided an excellent performance).

LD Motifs from Homologous Proteins
As we have very limited number of known LD motifs, we tried to increase that number using standard protein-protein BLAST (blastp) hits which are similar to motifs (Altschul, et al., 1997). We used the six proteins that contain the known LD motifs for the blastp program and selected the complete sequence of the proteins with the high score of BLAST hits (E-value:1e-7, bit score > 40, against NR database). Then, we applied our M1 to identify the LD motifs from these proteins homologous to the six proteins that contained known LD motifs. In this way, we predicted 40 more LD motifs from these proteins. These additional 40 candidate LD motifs were also considered as correct and used for building our final model.

Active Learning Dataset from PDB
We downloaded a culling set (Wang and Dunbrack, 2003) of proteins from the Protein Data Bank (PDB) to enhance our negative dataset-. We predicted SS of the full chain using PSIPRED. We built three independent models from the initial dataset based on five sequence features (M1seq), five SS features (M1ss) and 30 AAindex features (M1aaindex). For each of these models, we used an SVM model with linear kernel and default parameter setting.
We applied M1seq to the culling set to predict windows with LD motifs. These windows formed the set Sseq. Analogously, we generated sets Sss and Saaindex using M1ss and M1aaindex, respectively. Our hypothesis was that a window that does not belong to the intersection of these three sets is less likely to contain LD motifs. So, we included such windows in the negative set. This has resulted in 2,279 additional negative data-windows used for building the final model.

The Second Model (M2)
We extracted the features from all (18+40) positive and all (4020+2279) negative data-windows in the same fashion as discussed previously and we used an SVM with the linear kernel to build a predictive model (M2). We used 'svmtrain' function of MATLAB 2012b with default parameters setting to build the final model. This model predicts 13 new LD motif from human proteome. We applied a version of the 18-fold cross-validation (CV) to assess the model accuracy. We divided the negative set randomly into 18 disjoint subsets. At each step of CV, we excluded a different subset from the negative data and the window that corresponds to one of the 18 known LD motifs. Moreover, from the additional 40 positive data (windows) we excluded all windows from proteins homologous to the excluded one to which the known LD motif belongs. This last step is done in order to avoid dependent data in the training set. Then, the model is derived from the remaining data as described in the section above, and it was tested on the excluded data.

The Final Model
We experimentally (in vitro) verified the 13 new LD motifs and found that four of them show a strong binding affinity ("Highly likely" category) towards their binding partners. So, we integrate these four motifs in the roster of true LD motif and build the final model following the same method described above. This final model predicts eight LD motifs. Three were new LD motifs and five were common to previously predicted 13 LD motifs by M2. Using CV approach, mentioned in the above section, the final model achieved over 88.88% sensitivity and accuracy of 99.97% (Supplementary Table 1).

Validation of LDMF using Random Sets
To evaluate the robustness of our final model we tested it on random sequences generated by Sequence Manipulation Suite (Stothard, 2000). We generated 1,000 random sequences and applied the model to them. LDMF did not predict any LD motif in these sequences.

Availability
LDMF is available at www.cbrc.kaust.edu.sa/ldmf. For the result mentioned in this manuscript, we used the NR database for PSIPRED predictions (McGuffin, et al., 2000). But for our online LDMF server, due to the prohibitive time required to obtain the results from the NR database, we used UNIPROT database for PSIPRED predictions.

Overview and Rationale
For initial high-throughput screening, we used three plate assays: 1) differential scanning fluorimetry (DSF) was chosen as a semi-quantitative label-free binding indicator; 2) a direct anisotropy (DA) assay with labelled candidate peptides was chosen to estimate the interaction affinity; and 3) an anisotropy competition assay (ACA) where unlabelled candidate peptides compete against fluorescently labelled known LD motifs, was chosen to assess whether the (unlabelled) candidate motifs bind to the same sites as the known LD motifs. For all candidates, we used microscale thermophoresis (MST) with labelled peptides as an orthogonal quantitative method. ITC was used as an additional label-free method in selected cases to provide an additional binding Kd, or binding stoichiometry. Nuclear magnetic resonance (NMR) was used in special cases to map binding sites. Peptide sequences included four to eight flanking residues outside the 10-residue core sequence. These additional residues were chosen based on homology modelling, secondary structure and disorder predictions to include helix-capping residues and residues that might additionally contact the LDBDs. Peptides were synthesized with and without a FITC-Ahx N-terminal fluorescent label.
Peptide mimics of paxillin LD4, which were used as positive controls, displayed micromolar Kd values for FAT and α-parvin as expected, and competed efficiently against labelled LD4 in ACA (Fig. 3A, Supplementary Fig. 3). Although the presence of LD4 resulted in a significant change in melting temperature Tm in DSF with FAT, the Tm change with αparvin was not significant compared to a negative control (a peptide with the scrambled LD4 sequence). This result led us to include an LD2 peptide as a positive control in DSF.

Differential Scanning Fluorimetry
Experiments were performed in 20 mM HEPES pH 7.5, 150 mM NaCl, 2 mM EDTA, 1 mM TCEP. FAT, α-parvin-CHC and GIT1 were used at a concentration of 10 μM. Protein stability was assessed for each peptide at 100 and 250 μM. SYPRO Orange was used as fluorescent dye at 1x the protein concentration. The samples were heated from 20°C to 95°C at a rate of 0.03°C/s on a LightCycler 480 II RT-PCR from Roche. To estimate the melting temperature (Tm), a generalized sigmoid was fitted by least squares and the inflection point was computed.

Direct Anisotropy Assay
Protein was serially diluted in buffer (20 mM HEPES pH 7.5, 150 mM NaCl, 2 mM EDTA, 2 mM DTT, 0.005% Tween-20) and labelled peptides were added at a final concentration of 0.1 μM. Fluorescence anisotropy was measured on a PHERAstar FS device (BMG Labtech) using a fluorescence polarization module 485/520/520, at room temperature. Fluorescence anisotropy was determined as: 1000*(I// -I+)/(I//+2*I+), where I// and I+ are parallel and perpendicular components of fluorescence intensity excited by parallel polarized light. Data were analysed with Origin software using a logistic fit.

Anisotropy Competition Assay
First, FAT and α-parvin were titrated, and FITC-Ahx-labelled LD4 was added as described for the direct anisotropy assay. Competition for the LD4 binding site of FAT and α-parvin was then assessed as follows: the proteins were kept at a concentration corresponding to the Kd of their interaction with labelled LD4 (10 μM for FAT and 25 μM for α-parvin), in the presence of 0.1 μM labelled LD4. To that, each non-labelled peptide was added at 100 and 250 μM. When competing for the binding site, the unlabelled peptide displaces labelled LD4 resulting in a lower anisotropy. All measurements were performed as for direct anisotropy assay. Values are represented as a ratio to the point estimated to be the Kd of the protein with LD4 labelled.

Isothermal Titration Calorimetry
Proteins were dialysed in ITC buffer (20mM HEPES pH 7.5, 150mM NaCl, 1mM EDTA, 1mM TCEP). 1.5 ml of protein solution was placed in the cell at a concentration varying depending on the interaction from 50 to 150 μM for FAT and 125 μM for GIT1. Peptides were dissolved into the dialysis buffer to a concentration of between 1 to 1.25 mM and placed in the injection syringe. Titrations were performed at 25 °C. As a control, the peptide was titrated into the buffer and the resulting heats subtracted from the protein-binding curve. ITC was performed either on a Nano ITC (TA Instruments), and data were fitted using NanoAnalyze Software, or using a ITC 200 (GE) and data were fitted using Origin Software.

Microscale Thermophoresis
Serial dilutions of proteins were prepared starting from 630 μM (GIT1), 560 μM (FAT) or 530 μM (α-parvin) in reaction buffer (20 mM HEPES pH 7.5, 150 mM NaCl, 2 mM EDTA, 2 mM DTT, 0.05% Tween-20). Labelled peptides were added to a final concentration of 0.1 μM. The experiment was performed at 20 % LED power and 20, 40 and 60 % MST power in standard capillaries (GIT1) and MST Premium Capillaries (FAT and α-parvin) on a Monolith NT.115 device at 25 °C (NanoTemper Technologies). Thermophoresis and temperature jump were fitted using the KD formula derived from the law of mass action on the provided NT analysis software.

Nuclear Magnetic Resonance
Cells were grown with 15 N-labelled ammonium chloride dissolved in M9 minimal media solution, induced at OD=0.8 with 300uM IPTG and harvested after incubation overnight at 22 °C. Protein samples were purified and NMR samples were prepared by dissolving the 15 N-labelled protein in a 10% D2O/90% H2O solution with a monomer concentration of 100 μL in a total volume of 500 μL and pH of 7.5. LD motif-containing peptides were dissolved with FAT gel filtration buffer (20 mM HEPES pH=7.5, 150mM NaCl, 2mM EDTA and 2mM DTT). 2 μL of 25 mM 2,2dimethyl-2-silapentane-5-sulfonate (DSS) sodium salt was added as an internal chemical shift reference for 1 H at 0 ppm. The samples were stable over the course of the NMR experiments.

Data-driven Molecular Docking
The data-driven HADDOCK 2.1 protocol (van Zundert, et al., 2016) was used to generate the models of complexes for FAT:CCDC158 and FAT:LPP. Crystal structures of FAT (1ow8 and 1ow7) were used for the modelling. Initial models for CCDC158 and LPP were modelled in helical form based on the LD4 peptide. The NMR chemical shift perturbation (CSP) data was used to define the residues, which could be potentially involved in the binding known as active residues. The residues 915,926,929,933,934,936,938,940,956,1031,1032,1033,1035,1036 and 1038 were marked on FAT helix 1/4 as active residues for FAT:CCDC158. The residues 914,916,934,936,938,1022,1027,1031,1032 and 1033 were marked on FAT helix 1/4 and 948, 955,956,957,959,962,963,964,991 and 1007 were marked on FAT helix 2/3 as active residues for FAT:LPP. The CSP data was only used to define the binding site and not the binding poses. Structures that were listed in the output clusters with best scores were further analysed using PyMol (pymol.org).

Cellular Analyses
Design and preparation of eGFP-coupled tetra LD motifs eGFP-LD fusion constructs contained an N-terminal eGFP followed by a HRV3C protease recognition site (LEVLFQGP) and then four times the same LD motif sequence. LD motifs were separated by glycine-serine-threonine linkers of different lengths to enable multivalent associations with LDBDs: LD-GSGST-LD-GSGSTGSGST-LD-GSGSTGSGSTGSGST-LD. LD sequences were LD4: TRELDELMASLSD; LPP: EIDSLTSILADLESS; EPB41L5: ATDELDALLASLTENLID; C16orf71: EAWDLDDILQSLQGQ. Constructs were synthesized as gBlock (IDT) fragments separately for the N-terminal eGFP and the C-terminal tetra-LD motif sequences. CPEC cloning (Quan and Tian, 2009) was used to create the construct-including vectors and confirmed by the sequencing.

Cell lines, transfection, and antibodies
HeLa cells were cultured in DMEM with 10% FBS and transfected with plasmid DNA using Lipofectamine 3000. For cell spreading and immuno-localization experiments, HeLa cells were plated at low density on fibronectin-coated coverslips, transfected and used for immunofluorescence 24h later, as previously described (Astro, et al., 2011). For live cell imaging, HeLa cells were plated on fibronectin-coated 6-well plates, transfected with GFPtagged plasmids, manually scratched and recorded 36 h after transfection. The pAb against GFP and Vinculin, and the AlexaFluor 647-conjugated phalloidin were from Thermo Scientific. Fixed cells were observed with the EVOS FL Auto 2 Microscope (Thermo Scientific) using a Plan Apochromat 1.42 NA/60X oil objective (Zeiss).

Morphological analysis and functional assays
The measurement of cell area projection, aspect ratio and roundness of transfected HeLa cells spread for 24 hours on fibronectin was evaluated on thresholded images using ImageJ. For wound healing assays, images were captured with a 10x lens at 60-min interval for 30 h using an optical microscope (JuLI™ Stage Real-Time Cell History Recorder, NanoEntek) equipped with a High-sensitivity monochrome CCD (Sony sensor 2/3") and an automated x-y-z stage, with a 0.3 NA/10X objective (Olympus). During live imaging cells were kept at 37°C and 5% CO2 in a cell incubator (Heracell, 150i, Thermo Scientific). Migration paths were calculated from the nuclear positions of GFP-positive cells obtained from 4 fields per well using two plugins available for ImageJ software (Manual tracking and Chemotaxis tool). The track of each cell was used to measure different parameters of migration: total and Euclidean distances (length of the line segment, calculated between the start and the end point of the cell trajectory), cell velocity and directionality (index of the persistence of the cell movement, given by the ratio between the Euclidean and the total distances. This value may change between 0 and 1, where 1 corresponds to the maximum linearity of the trajectory).

Supplementary Figure 1. Features that characterise bona fide LD motifs in silico.
For each known LD motif, we present the secondary structure predictions (SS3: three states, namely H: helix, E: beta strand, C: coil; SS8: eight states, namely H: a helix, G: 3-helix, I: 5helix, E: extended b ladder, B: b bridge, T: hydrogen bonded turn, S: bend, L: loop), solvent accessibility (ACC; B: buried; M: medium exposed, E: solvent exposed) and disorder (DISO: order [.] and disorder [*]) as predicted by the RaptorX server (Kallberg, et al., 2014). Amino acid are numbered starting with 20 positions upstream of the LD motif (unless the LD motif is situated at the N-terminus, which is then taken as number 1).  (Brown, et al., 1998). These motifs were suggested based on a pattern search with the sequence pattern (L,V)(D,E)X(L,M)(L,M)XXL used by Brown et al. (Brown, et al., 1998).

LD1
Extended Results: 16 out of the 18 suggested LD motifs were predicted to be an integral part of a folded protein domain. In 15 out of these 16 cases, the hydrophobic patch of the suggested LD motif is inaccessible to solvent and hence ligands. In the one remaining case (LTK), the suggested LD motif is part of the catalytically important αC helix of a protein kinase domain. Thus, unless unlikely large unfolding events occur, these 16 putative motifs cannot function as LD motifs despite containing the correct sequence pattern. For the remaining two of the 18 proteins, the suggested LD motif sequence is located in a flexible region. However, in one case (Eph-2) the putative LD motif is part of a signalling peptide that is cleaved in vivo, and hence an unlikely candidate. Only the remaining LD sequence from chicken tensin was a plausible candidate, being located in an unstructured region and implicated in FAs (Lo, 2004). (Brown, et al., 1998)  Yellow shaded molecules: no high-quality model exists, but either low-identity structural homology or other functionality make an LD-motif function unlikely. Green shading: no 3D model is available, and strong biological assumptions to rule out LD-motif function are lacking. However, known biological function speak against it, and the motif is highly degenerate.

Summary of Previously suggested LD motifs by Brown et al.
Red shading: this motif is potentially likely to be a bona fide LD motif, because of its structural characteristics and supporting biological evidence.

P09104; GAMMA ENOLASE; ENO2
Location in protein: 90-LDNLMLEL-97 Structural Information: 100% Sequence Identity with PDB 2akm. The suggested LD motif is part of the catalytic domain.

P29376; LTK
Location in protein: 556-LDFLMEAL-563 Structural Information: 77% Sequence Identity with PDB 3ics. The LD motif is situated in the aC helix of the protein kinase domain.
Location in protein: 163-LDDLLVVL-170 Structural Information: 40 % sequence identity with PBD 5jjtA; LD motif is inaccessible in the catalytic region.

P54762; Ephrin type-B receptor 1;EphB1
Location in protein: 3-LDYLLLLL-10 Structural Information: No structure modelling possible for this region. The region is identified as an extracellular signaling peptide (cleaved during maturation) by Phobius (below).
Location in protein: 1361-LDGLLNQL-1368 Structural Information: No homology model possible. The LD motif is found in the coiled-coil STEM region.  . 2003). Tensin is also involved in the function of focal adhesions. The LD motif of tensin is located in a disordered region and predicted helical. Supplementary Figure 2.2. Secondary structure predictions (SS3: three states, namely H:

P51592; E3 UBIQUITIN-PROTEIN LIGASE; HYD
helix, E: beta strand, C: coil; SS8: eight states, namely H: a helix, G: 3-helix, I: 5-helix, E: extended b ladder, B: b bridge, T: hydrogen bonded turn, S: bend, L: loop), solvent accessibility (ACC; B: buried; M: medium exposed, E: solvent exposed) and disorder (DISO: order [.] and disorder [*]) for the non-paxillin motifs suggested by SlimSearch4 (Krystkowiak and Davey, 2017), which was the only algorithm which predicted a reasonable number of LD motif candidate in the human proteome (see Supplementary Table 1). The feature predictions were established by the RaptorX server (Kallberg, et al., 2014). The suggested LD motif region is boxed. Amino acid are numbered starting with 20 positions upstream of the LD motif (unless the LD motif is situated at the N-terminus, which is then taken as number 1).
According to this analysis, 27/34 of the suggested sequences appear to have secondary structure or order/disorder features unfitting for known LD motifs. Of the remaining ones, 4/7 lack the typical amino acid features, in particular the presence of additional acidic charges (GAPD1, F16B1, TENC1, CK072). Hence, only 3/34 motifs would remain as plausible candidates (MIAP, SRTD1, AZI1).

Supplementary Figure 3. Binding Assays
Binding assays of known LD motifs and LD motifs proposed by LDMF-proposed to FAT, αparvin and GIT1. ACA: anisotropy competition assay; DA: direct fluorescence anisotropy; MST: microscale thermophoresis; DSF: differential scanning fluorimetry.  The sensitivity, specificity, accuracy stated are based on the performance of the machinelearning model on the test set. We used the known LD motifs to build the machine learning model. We then tested the performance of the computational model using a leave-one-out cross validation approach. Given the imbalanced nature of our training data, 'sensitivity' appears as the most appropriate evaluation metric. Insulin-like growth factor-binding protein 2 | IGFBP2