Improving sequence-based modeling of protein families using secondary-structure quality assessment

Abstract Motivation Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family. Results We introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments. Availability and implementation Data and code available at https://github.com/CyrilMa/ssqa Supplementary information Supplementary data are available at Bioinformatics online.


Pattern Matching details
A pattern r is defined as an ordered set of elements called motifs (r k = (C k , d k )) 1≤k≤K where r k is the motif, C k the motif class (α-helix, β-strand or coil), d k is a distribution of probability defining the transition matrix.
A structure s ∈ {α-helix, β-strand, coil} n is said to match the pattern r if ∃(t k ) k≤K such as : 1. t 0 = 0, t K = n 2. ∀k > 0, t k − t k−1 ≥ 1 3. ∀i such as t k−1 ≤ i < t k , we have x i = C k From now on, we define the distribution d k based on the minimal and maximal length of the different class of secondary structure we observe in available sequences in PDB. Let's consider a class of secondary structure C. We denote by M (C) the maximum length of a structure of class C we observe in PDB and m(C) the minimum length. For each k, we then have d k define as : d k is then a uniform distribution on the acceptable length of the structure. Other distribution have been tested based on the frequencies of the length of the motifs in PDB, but yielded lower performances.
Afterwards we will denote by R = {s ∈ P({α-helix, β-strand, coil} n )} the set of secondary structures that match the pattern r. We will define Match(x, r) the probability of x having a structure that matches r We explain below how to compute Match(x, r) in a time polynomial in the sequence length n.

Likelihood of a pattern
Given the probability matrix of the structure P x defined previously we want to assess the probability of this probabilistic structure to match r, denoted by Match(x, r).
It is possible to represent this problem with a Hidden Markov Model. A Hidden Markov Model (HMM) is a statistical model in which a system followed or is modeled as a Markov process Z = (z k ) k , where Z is not observable. In addition of this Markov process, there is another process Y = (y k ) k , observable and such that ∀k, y k depends only on z k . The objectives we can meet with these model are multiple : decoding (finding the most likely z), marginalizing (finding p(z k |y 1 , ..., y k )) ...
In our case we consider the following Markov Model : Transition probability : Observation states : y k ∈ {0, 1} where y k = 1 if a motif matching r k = (C k , d k ) conditions was emitted. given t k , We can see from the previous definition that x as a structure s that matches pattern r if and only if ∀k, y k = 1. We then have : Match(x, r) = p(y 1 = 1, . . . , y K = 1|x, r) In order to compute this probability, we will rely on our Markov chain. We here recall a dynamic programming way of marginalizing a Hidden Markov Model with a process called sum product algorithm (see [13]). For a HMM model with observations y k and hidden states z k we define recursively : we then have after computation : We will be using the sum-product algorithm with our own Hidden Markov Model ( Figure 1). After re-arrangement, it gives us : In this case, we have ∀k:

Pattern Inference
For a lot of proteins (in particular the ones that are driving attention), structures from which we infer patterns are available online, we retrieve most of them on the Protein Data Bank [3]. For others we may be require to infer the pattern from the result of our secondary structure prediction.
From now on, we define the distribution d k based on the distribution of length of the different class of secondary structure we observe in available sequences in PDB. Let us consider a class of secondary structure C. We denote by f (C, l) the frequency of the length l for the structure of class C we observe in PDB. For each k, we then have d k defined as : We now expand our Hidden Markov Model so the class of each motif is integrated in the hidden state. We will have to also adapt our transition probability to our new model (see Figure 2). The adapted model is then defined as follows: Hidden states : z k = (C k , t k−1 , t k ) the class C k of the motif r k the intervals of residues [t k−1 , t k [ of the motif r k of class C k .
Transition probability : The transition probability will be modelled as where p(C k |C k−1 ) is the probability of having a motif of class C k following a motif of class C k−1 , and d k the distribution of probability define above.
From a technical point of view, since we do not a priori know the length of the motif, it is necessary to add a final stationary state S, with p(C k = S|C k−1 = S) = 1 and d(S, t k−1 , t k ) = 1 t k−1 =t k .
With this formalism we can run the max-product algorithm (or Viterbi algorithm) to find the most likely pattern given the predicted secondary structure.

Errors done by Secondary Structure Predictor
We noticed two kinds of errors done by the algorithm, against which Pattern Matching is robust : • Border errors: A lot of errors are done at the transition between two structures because of uncertainty of where a structure begin and where the other end. Though these errors are numerous as we can see on Figure 3, they do not affect much the performance of Pattern Matching as the length of the structure is flexible in the pattern.
• Weak errors: When the structure is incorrectly predicted on a residue, the true label often has a non-negligible likelihood, i.e. above 0.1 in more than 60% (see Figure 4) of the cases. Pattern Matching is robust against these errors as it can correct them with a low cost for the Match score. Figure 3: Accuracy (share of the labels predicted correctly) as a function of the distance to a transition. Accuracy drops at the boundary between two structures. Figure 4: Likelihood of the true secondary structure when a wrong structure is predicted. To remove the bias created by border errors we focused on residues at a distance at least 2 from the boundary between two structures. Most true labels still have pretty high accuracy.

Model Selection for Supervised SSQA with Chorismate Mutases
We worked with different models with Scikit Learn [19] for supervised reduction of Pattern Matching and Dot Product with SSQA. Here we display the AUCs obtained with different models: sklearn.linear model.LogisticRegression(), sklearn.ensemble.GradientBoostingClassifier(), sklearn.ensemble.RandomForestClassifier ({10, 50, 200}). To select the model we worked only with the training set of natural sequences. We proposed two methods to evaluate the models : • Method 1 : We extract a random sample of 20% of the training dataset for validation. We then train the models with the training dataset and evaluate them (by computing the AUC) on the validation set. We do this operation 100 times to lower the variance.
• Method 2 : We build 70% similarity clusters of similar sequences and we pick at random clusters representing around 20% of the training set for validation. We then train the models with the training dataset and evaluate them (by computing the AUC) on the validation set. We do this operation 100 times to lower variance Overall we see that Logistic Regression performs slightly worst or as well as Random Forest with 200 trees when it comes to predict activity for samples different from the ones in the training set (with validation set built from similarity clusters). Random Forest though is overfitting leading to better performance with random validation samples susceptible to be close from the one in the training set.

Random
Similarity In Figure 5, we plot the ROC curve for both unsupervised and supervised metrics. Unsupervised scores, though they perform less well than supervised scores, are able to achieve some discrimination of active and inactive samples, potentially helping improving sequence generation.  . DCA is able to discriminate easily a lot of bad sequences but fails for some of them. Supervised SSQA is also very able to discriminate some bad sequences. Unsupervised SSQA is also able to discriminate inactive sequences though it performs less well than Supervised SSQA. Last panel: violinplot as a function of the temperature of generation (0 corresponds to Natural Sequences). supervised SSQA shows a good discrimination at every level of temperature.
5 Improvements of SSQA in function of the secondary structure in the betalactamase We considered single mutations dataset sequence from beta lactamase [4] (Uniprot ID : P62593). Activities for each single mutation had been experimentally determined (see Figure 7) and the experimentally determined secondary structure is also available (PDB Id : 4ZJ1). We linearly combined Dot Product and Pattern Matching features we computed with taking DCA energy from [10], and built activity predictors out of these metrics. We below computed for mutations happening on each kind of structure the balanced accuracy of the activity prediction task.
Balanced accuracy differs from usual accuracy in which it mean the accuracy on positive and negative label, therefore being able to better evaluate balanced dataset. With TP (number of true positives), TN (true negative), FN (false negative) and FP (false positive), we have the following formula for balanced accuracy : balanced-accuracy = 1 2 In comparison, usual accuracy can be expressed as : In Figure 8 (with 3-class structure) or Figure 9 (with 8-class structure), predictor based on DCA energy was able to reach 70% to 75% accuracy on α-helix and coil but failed on β-strand. SSQA predictor brought significant improvements in particular on β-strand.   : Balanced accuracy for activity prediction on Beta-lactamase for single mutations given the 8-class secondary structure (only 6 classes are present in the structure) of the mutated residue. SSQA brings particular improvement on β-strands where interaction between residues are usually more complex than in α-helix.
6 References to Datasets used for "Secondary structure quality assessment on mutational datasets"