A large-scale comparative study on peptide encodings for biomedical classification

Abstract Owing to the great variety of distinct peptide encodings, working on a biomedical classification task at hand is challenging. Researchers have to determine encodings capable to represent underlying patterns as numerical input for the subsequent machine learning. A general guideline is lacking in the literature, thus, we present here the first large-scale comprehensive study to investigate the performance of a wide range of encodings on multiple datasets from different biomedical domains. For the sake of completeness, we added additional sequence- and structure-based encodings. In particular, we collected 50 biomedical datasets and defined a fixed parameter space for 48 encoding groups, leading to a total of 397 700 encoded datasets. Our results demonstrate that none of the encodings are superior for all biomedical domains. Nevertheless, some encodings often outperform others, thus reducing the initial encoding selection substantially. Our work offers researchers to objectively compare novel encodings to the state of the art. Our findings pave the way for a more sophisticated encoding optimization, for example, as part of automated machine learning pipelines. The work presented here is implemented as a large-scale, end-to-end workflow designed for easy reproducibility and extensibility. All standardized datasets and results are available for download to comply with FAIR standards.

and evaluated a fixed set of points on f and append it to the final feature vector [4,5]. Specifically, we employed the function gaussian kde provided by the SciPy package [3].

Distance frequency
For each input sequence s i , the algorithm first replaces each amino acid a k with its respective chemical group, that is, basic, hydrophobic, or others. Afterwards, s i is split into three parts: N-terminal, middle section and C-terminal [6]. Note, that owing to various sequence lengths, different split methods are used.
Refer to the original publication for the algorithmic details [6]. For each section, the distances are obtained by counting the number of amino acids between two, for instance, basic amino acids. Afterwards, the values are assigned to a distance class, hence distance frequencies, for each property group. Besides the distance frequencies for each group and part, also the amino acid, as well as the di-peptide composition, build up the final feature vector [6].

Electrostatic hull
A further StBE is the electrostatic hull (EH). First, the structure is optimized using the amber force field and is standardized for further processing by employing the PDB2PQR v2.1.1 [7] command-line tool.
Afterwards, the solvent accessible surface (SAS) as well as the electrostatic potential (EP) is calculated by means of the APBS v1.5 [8] package. The coordinates of the EH are now computed based on the SAS. For the final feature vector, only these points from the hull are retained, where an EP has been determined in the previous step. Since the sequence length can vary, a cubic spline interpolation to the median length is conducted afterwards, utilizing the Interpol v1.3.1 package [9]. The general workflow as well as the core algorithm has been adapted from Löchel et al. (2018) [10].

Fourier transform
The Fourier transform (FT) SeBE, decomposes a continuous-valued input signal into its frequency domain, such that previously unknown patterns might be observable [2]. We leverage this circumstance by computing the discrete FT onŝ i . Specifically, a mapping a k →â k is obtained from the AAindex database [11]. Nagarajan et al. (2006) applied this encoding to predict antimicrobial activity [12].

Five level dipeptide compostion
The five-level di-peptide composition (FLDPC) SeBE [13], is based on five groups, i.e., the highest, high, medium, low, and lowest values of a specific amino acid index [11]. The assignment of an amino acid to a group occurs by employing the k-Means clustering algorithm, with k = 5. The final feature vectorŝ i for an input sequence s i is composed of the sums of the frequencies of all di-peptides mapped to the same group [13].

Five level grouped composition
Exactly like the FLDPC, the five-level grouped composition (FLGC) is based on the five groups obtained from a specific amino acid index [11]. In order to compose the final feature vector, the amino acid composition is calculated for each sequence s i and the frequencies of all amino acids from the same group are added up [13].

N-gram
The n-gram encoding [14] encodes sequences based on the singular value decomposition (SVD). There are several types of this encoding, depending on the according grouping of a specific amino acid: the di-peptide or tri-peptide composition (A), the exchange (E) as well as the structural groups (S). The latter encompasses amino acids, which have a tendency of an internal, ambivalent, or external configuration in the three-dimensional conformation of a protein [14] and the E group refers to six amino acids groups, computed from point-accepted mutations (PAM) [14,15]. Furthermore, as the name suggests, two different sizes of n are considered: two (bi-gram) and three (tri-gram).
For the E and S groups, the preprocessing is conducted as follows. First, the cartesian product of the groups, i.e., E×E is calculated. Next, for all amino acids a k ∈ s i , a k is mapped to its respective group, leading toâ k . Now we count the occurrences of a bi-gramâ j ,â k or tri-gramâ j ,â k ,â l and compute the total frequency with respect to the amount of all possible combinations c i [14].
The next step comprises the matrix factorization step, hence SVD, in the form of In particular, X is the encoded datasetD i of size n × m, where n is the number of features and m is the number sequences. T is a matrix with the left singular components of size n × k, S * denotes a diagonal matrix of size k × k and P refers to a matrix with the right singular components of size k × m [14]. Hereinafter, the SVD is employed as a feature reduction method, that is, the input feature space will be reduced from a n-dimensional space into a k-dimensional space. Thus, the transpose of P , hence P T , is used subsequently as the final feature matrix [14]. For predicting unknown sequences X u , the n-gram encoding requires to retain the matrices T and S * . In the case of a prediction, the former and the inverse of the latter is utilized to scale the unknown data X u into the same feature space: whereby P u , the encoded matrix, has k columns as well as m u rows and X T u , the transpose of the non-classified input data, is of size m u × n.
Quantitative structure-activity relationship The quantitative structure-activity relationship (QSAR) StBE, an encoding type relating molecular properties of the structure to a certain activity, e.g., antimicrobial efficiency, has been added [2]. In particular, we adopted the QSAR-pipeline suggested by Haney et al. (2018) [16]. On each sequence s i , a sliding window approach is applied, using a window size of k, if |s i | ≥ k. For each window w l , we construct a molecule from the respective structure section utilizing RDKit v2020.03.4 (http://www.rdkit.org/).
Note, that if |s i | < k, the complete sequence will be used instead. In the present study we set k = 20.
For each molecule, we used the Mordred v1.2.0 package, in order to calculate all molecular descriptors [17]. For a comprehensive descriptor list, refer to the original publication [17]. If |s i | < k, the descriptor vector v l will be used as feature vector as it is, otherwise the column-wise average will be used.

Weighted amino acid composition
The weighted amino acid composition [13] SeBE weights the respective amino acid composition aac of an amino acid a i in a sequence s i with the accompanying amino acid index [11] f : a i →â i : aac * f (a i ).

Supplementary Note 3
The final report is composed of three sections, namely Home, Single dataset, and Multiple datasets, which fulfill specific, analytical purposes. The first provides a general overview of the study and the data, the second sheds light on the performance across multiple datasets, and the third section introduces the results for specific datasets. In general, all visualizations are interactive, hence including different mouse events (mouse-over, click, double-click, and scrolling). We used the streamlit v0.70.0 (https://www.streamlit.io/) framework to embed the graphics. Hereinafter, the respective visualizations are described in more detail.

Multiple datasets
The Multiple dataset section includes the Overview, the Ranks, Clustering, Embedding, and Time visualization, in order to investigate the performance all datasets • Ranks. This figure visualizes the encoding performance as ranks across all datasets [18]. The encodings are grouped by SeBEs and StBEs. In addition, the datasets are sorted by imbalance.
The respective groups are visually separated by horizontal and vertical bars.
• Clustering. The result of the automated clustering is shown here. Encoding groups and datasets are arranged according to the hierarchical clustering, further highlighted by row and column dendrograms.
• Embedding. The t-SNE based embedding of the sequences of the positive class. All datasets are arranged as a 3 × 4 scatter plot matrix. All sub-plots are sorted by cluster area ascending. In addition, using the menu in the bottom-left allows to display additional datasets with a higher cluster area.
• Time. The total computation time of all datasets is visualized as a bar plot (top). Moreover, the scatter plot compares the computation time with the dataset size (bottom-left) and a further scatter plot relates the mean sequence length with the overall computing time (bottom-right).

Single dataset
More detailed information about specific datasets and the respective performance of all encodings reports the Single dataset section. It includes the Overview, Metrics, Curves, Similarity, Diversity, Difference, Composition, Correlation, as well as the Time sub-sections. • Correlation. The dataset correlation is visualized as a circular dendrogram, which aggregates more related datasets into the same branches. The relatedness is based on adjusted RV-coefficient.
Encodings from the same group are highlighted in the same color.
• Time This section deals with the required execution time of every task. In particular, the scatter charts compares the median performance of the encoding groups with their respective calculation time.
The top-left corner shows the groups with the preferred properties, i.e., fast computation and high performance. Moreover, the scatter chart shows the execution time for each step on a log scale. Each meta node type is colored accordingly.

Supplementary Note 4 PseKRAAC encoding
This SeBE takes four parameters, leading to hundreds or even thousands, in general k's of encoded In order to reduce this vast amount of data, i.e, to find a representative subset of encoded datasets Θ∩{D i1 , . . . ,D i k }, the filtering is conducted as follows: first, the datasets are grouped according to their descriptor type, i.e., for eachD ij ∈ Θ of the same descriptor type, the datasets are interpolated to the same dimension using a one-dimensional linear interpolation. The Pearson correlation coefficient R is calculated on the vectorized matricesD ix andD iy , denoted as X and Y , respectively: n is the length of the vectorized matrices, X i and Y i are data points from the respective datasets X or Y and X as well as Y being the respective mean. Next, for each descriptor type and based on the Pearson correlation coefficients from the previous step, a distance matrix m is calculated. m is used as the pre-computed distance matrix for the successive t-distributed Stochastic Neighbor Embedding (t-SNE, default parameters) [20] in order to embed m into a two-dimensional space. Finally, the representative dataset for each of the 19 descriptor types (see Supplementary Table 4) is determined by computing the cluster center by means of the k-Medoids algorithm [21].

Amino acid index correlation
Some amino acid indices (AAI) are highly correlated [11]. Hence, let X be a matrix of size 20 × k with 20 rows for the corresponding natural amino acids and k columns for each AAI. First, we computed the pairwise Pearson correlation coefficient for each column in X, i.e., the AAI, using Equation 10. Next, we utilized principal component analysis (PCA) [22] to compute the first principal component, that is, to reduce the size of X to a one-dimensional matrixX. By regardingX as distances, we observed that AAIs with a high separation after PCA also have a high correlation and conversely, AAIs with a low separation after PCA also have a low correlation. Hence, we only keep those indices with a correlation closely at 0.0 ± 0.3. Finally, only those encoded datasets based on low correlated AAIs are used for the later benchmark.

Supplementary Figures
Supplementary Figure 1: Encoding group performance, sorted by biomedical domains and encoding type. Color coding corresponds to the max F1-score of a group. The x-axis is organized by sequence-and structure-based encodings. The y-axis is sorted by biomedical application. Groups are separated by gaps.
Supplementary . The higher the diversity value, the higher the similarity (Phi correlation). Accordingly, the lower the diversity, the higher the similarity (Disagreement measure). The graphic shows the example of the hiv ddi dataset.
Supplementary Figure 9: Pairwise predicted probabilities and class separation for two datasets. Predicted probabilities for the respective class labels applied on the x-and y-axis. Encodings are selected with respect to their level of disagreement (div) and cluster quality, depicted as the Davis-Bouldin score (dbs, lower is better). The graphic shows the example of the ace vaxinpad (left) and the hiv ddi dataset (right) for sequence-vs. structure-based encodings (top) and all vs. all encodings (bottom).
Supplementary Figure 10: Pairwise predicted probabilities and class separation. Predicted probabilities for the respective class labels applied on the x-and y-axis. Encodings are selected with respect to their level of disagreement (div) and cluster quality, depicted as the Davis-Bouldin score (dbs, lower is better). The graphic shows the example of the hiv ddi dataset for sequence vs. structure based encodings (top) and all vs. all encodings (bottom). Supplementary