PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer

Abstract Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS assembly programs tend to return contigs, making plasmid detection difficult. This problem is particularly grave for metagenomic assemblies, which contain short contigs of heterogeneous origins. Available tools for plasmid contig detection still suffer from some limitations. In particular, alignment-based tools tend to miss diverged plasmids while learning-based tools often have lower precision. In this work, we develop a plasmid detection tool PLASMe that capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmid sequences as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and their correlation through positionally token embedding and the attention mechanism. We compared PLASMe and other tools on detecting complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. After validating PLASMe on data with known labels, we also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools.


Evaluating similarity between plasmids and chromosomes
During the evolutionary process, plasmids and chromosomes undergo gene transfer, which results in sharing similar regions between their sequences. We aligned plasmids with chromosomes using BLASTN and calculated query coverage and identity to demonstrate the gene transfer between them. A query may have several alignment regions S = s 1 , s 2 , ..., s n , and the region s i may be aligned to multiple subject sequences. Therefore, we keep only the results with the maximum bit score for overlapping regions. When calculating the coverage, we divide the total length of the query by the length of all alignment regions on the query, as shown in Eqn. 1. When calculating the identity, we calculate the weighted identity of different regions based on their lengths, as shown in Eqn. 2. coverage = length(s i ) length(query) (1) identity = identity max score (s i ) * length(s i ) length(s i ) 2 Applying nt-BPE on the nucleotide sequences The basic idea of applying BPE in biological sequences is to count the most frequent nucleotide combinations in the corpus. To build the vocabulary, BPE will start with single nucleotides (A, T, C, G) and gradually expand the vocabulary by adding DNA substrings with high occurrence frequency. BPE will iterate and repeat the above operation to synthesize longer tokens until the target vocabulary size is reached. During the encoding process using nt-BPE, we cut the plasmids into segments of length 1500bp to limit the length of the encoded vectors. Then we use the trained BPE model to encode the segments into vectors that contain 350 tokens. For the encoded vectors that contain more than 350 tokens, we keep only the first 350 tokens, while for vectors that contain less than 350 tokens, we will pad the zeros afterward until the length reaches 350. We input all the segments from the training database to train the model. For testing, because each contig will be cut into 1500bp segments, we take a majority vote on the results of all segments to predict whether the contig is a plasmid.

Thresholds of PC tokens
To obtain the threshold of each PC token, we did the all-against-all alignment of the proteins inside each PC and aligned all plasmid proteins with chromosome proteins. Then we set the thresholds on the identity. We set the identities within proteins in P C i to I pi , and between proteins in P C i and chromosomes to I ci . If min(I pi ) ≤ max(I ci ), we discard P C i and finally leave 145,400 PCs as tokens of the Transformer, occupying 96.2% of the total PCs. As shown in Eqn. 3, we set the threshold t for each PC.
As shown in Eqn. 3, when P C i does not hit any chromosomes, because there are some small PCs in which the protein similarity is very high and min(I pi ) to be close to 1, we set the threshold to min(min(I pi ), 80) in this case to improve the sensitivity. When min(I pi ) ≥ max(I ci ), we set (min(I pi ) + max(I ci ))/2 as the threshold value in order to balance sensitivity and precision. When min(I pi ) < max(I ci ), we use Youden's J index [1] to calculate the threshold. This measure combines sensitivity and specificity, usually defined as J = sensitivity + specificity − 1. We calculate the threshold t of J max by computing J max = max(sensitivity(t) + specificity(t) − 1).

Evaluation metrics
Following the previous works, we calculate the precision, recall, and F1-score as the performance metrics: where T P , F P , F N are the number of true positive, false positive, and false negative predictions, respectively.
5 Comparing the performance and computational costs between order-specific and unified models Considering the high diversity of plasmids, we trained models for each plasmid order. To show the advantages of this method, we trained a unified model using the PLSDB training data and compared its performance with order-specific models on the PLSDB test set and contig 1K/2K/3K test sets.
To train the unified model, we used all the training sequences from all the orders to train a single model. The results are shown in Figure S1. Figure S1: The performance of order-specific and unified models on different test sets. a) The PLSDB test set. b) Contig 1K test set. c) Contig 2K test set. d) Contig 3K test set.
As shown in Figure S1, the comparison shows that these two pipelines yield similar results on the PLSDB/contig datasets, with the unified model 0.1% lower in F-1 scores. However, a closer look shows that the two models' performance mainly differs in the test sequences that tend to be misclassified. Specifically, we extracted the plasmid contigs that have low similarity with the training set (alignment e-value greater than 10) and chromosome contigs that can be aligned with training plasmids with coverage/identity above 50%. We build the hard 1K, hard 2K, and hard 3K test sets from the contig 1K, contig 2K, and contig 3K test sets. The performance of the order-specific model and the unified model is shown in Figure S2. Figure S2: The performance of order-specific and unified models on the hard-case samples.
On the hard-case datasets, we can see that order-specific models still perform better than unified models. As the main usage of plasmid detection is to find the hard cases that tend to be missed by alignment-based methods, we choose to use the order-specific model. In addition to model performance, another consideration is the computational costs of the two methods. During the training process, these two methods have almost no difference in computational resource usage and time cost because the training data size is the same. Specifically, the unified model uses all the data to train a single model. In contrast, the order-specific model divides all the data into 35 parts and trains the model using the corresponding subset. On the PLSDB training set, training the order-specific models took 7h20min, while training the unified model took 7h14min. During the testing stage, these two methods still do not differ much in memory, but there is a slight difference in running time. For example, when predicting the PLSDB test set, the Transformer in the unified model took 7.6s, while the order-specific model took 40.4s. The extra time comes from the overhead of switching models based on the prediction of the orders. Since the model is switched at most 34 times during prediction, the order-specific model takes at most 32.8 extra seconds regardless of the number of input sequences.

Comparing the performance on different orders
We used the PLSDB test set and fed the testing contigs to the corresponding orders (models) to evaluate the performance of PLASMe on different orders. The results are shown in Figure S3, which shows that the performance of PLASMe on some orders is worse than on others. Specifically, the differences in precision are relatively small, while the differences in recall are larger. Therefore, we hypothesize that sequence similarity between the test set and training set may account for this observation, with lower similarities leading to lower recall. To test this hypothesis, we used Dashing [2] to calculate the similarity between the test set and training set under different orders. We plotted the relationship between precision and recall against average similarity, as shown in Figure S4. We only colored the orders with average similarity below 0.2 (other orders are left with gray). According to the right panel, orders with lower average similarities tend to have lower recalls.

Evaluating the performance in different stages
PLASMe identifies some plasmid contigs by alignment with the stringent threshold and then uses Transformer to annotate the contigs that BLASTN cannot identify. We first analyzed the model's performance in both the alignment and transformer stages. Then, we tested a modified PLASMe that only used the alignment for taxonomic assignment and assigned predictions based solely on the order-specific Transformer models. We used the PLSDB, contig 1K, contig 2K, and contig 3K test sets to evaluate the performance. The results are shown in Figure S5. Alternative: the performance of the alternative design that uses BLAST to assign taxa and then classifies using Transformer.
As shown in Figure S5, we examined the model's performance at different stages. We first use BLASTn to identify plasmids, and then input contigs with identity and coverage less than 90 to Transformer to classify them. When evaluating the results of the different stages, we used all the sequences in the BLASTn stage, while in the Transformer stage we used the remaining sequences after the BLASTn prediction. We can see that on the PLSDB test set, BLASTN has higher precision but lower recall, which is expected. And the Transformer classifies the rest of the sequences with a high F-score. On the contig test sets, as the contigs' length increases, the BLAST's precision remains almost the same while the recall decreases. This is also expected because of the stringent thresholds set in the alignment, which becomes increasingly difficult to meet for longer sequences. Additionally, longer sequences typically contain more proteins, leading to increased precision and recall of the Transformer.
Furthermore, we conducted a more detailed analysis of the alternative design "Alternative", which assigns taxa using alignment and then predicts plasmid using Transformer. We observed that if the classification was performed solely using Transformer, the performance was almost equivalent to that achieved using both alignment and Transformer. It suggests that Transformer can achieve classification results consistent with alignment on highly similar sequences. To demonstrate that combining alignment with Transformer gives better results than using only Transformer for prediction, we compared the performances on the CAMI datasets, which are more complex datasets. The results are shown in Figure S6. We can see that combining alignment and Transformer can achieve better results.

The UpSet diagrams of the identified plasmids between different tools on the PLSDB test set
To further compare the results of the different tools on the PLSDB test set, we further computed the intersection of the results between different tools, as shown in Figure S7. Figure S7: The intersection of the identified plasmids between different tools on the PLSDB test set. We compared PLASMe with alignment-based, learning-based, and hybrid tools, respectively.
PLASMe exhibited higher sensitivity than alignment-based methods, thus we can observe more uniquely identified plasmids. As demonstrated in Figure S7 (b), most contigs are shared among PLASMe, Deeplasmid, and PlasmidVerify, whereas PlasForest identified fewer plasmids. For the comparison with learning-based methods, the largest set is the intersection of all the tools, containing 4598 contigs.
9 Comparing the performance of PLASMe and Deeplasmid on the PLSDB test set We further analyzed the meaningful biological difference between PLASMe and Deeplasmid on the PLSDB test set. First, we plotted an UpSet diagram to compare their results, as shown in Figure  S8. In Figure S8, we observe that PLASMe and Deeplasmid identified 6,723 and 6,352 contigs as putative plasmids, where 6,118 contigs were commonly identified by both tools. PLASMe and Deeplasmid identified 605 and 235 distinct contigs, respectively, of which 336 and 135 contigs are plasmids. Thus, the differences in the identified plasmid contigs by the two tools mainly lie in the 336 and 135 contigs. We conducted a comparative analysis from three perspectives in Table S1.  Table S1, PLASMe can identify a greater number of short plasmids. We then aligned the identified contigs to the training plasmid sequences and observed that some plasmid contigs uniquely identified by Deeplasmid cannot be aligned to our training data, which explains why these contigs are missed by PLASMe. Moreover, we analyzed the taxon of identified distinct plasmids. We found that PLASMe identified more contigs from Enterobacterales, while Deeplasmid identified more contigs from "other" order, which combines small orders. This is expected because these orders are very small and thus lacks sufficient training data for PLASMe. To further compare the biological differences, we also used MOB-Suite to predict the relaxase type of identified plasmids. We found that PLASMe could find more MOB H , while Deeplasmid could find more MOB P .
Considering that the sequence length affects the performance of the tools, we compared their performance on their identified putative plasmids with different length ranges, as shown in Figure (4) S9. The recall of PLASMe is always higher than Deeplasmid at different lengths, and the difference between the two is bigger for shorter sequences. And the precision of PLASMe is slightly lower, but overall, PLASMe achieves higher or similar F-score at different lengths. Hence, it can be concluded that PLASMe outperforms Deeplasmid on short contigs, while both tools exhibit comparable performance on long contigs. Figure S9: The performance of PLASMe and Deeplasmid on inputs with different sequence lengths ranging from 1000 to 5000 bp, 5000 to 25,000 bp, and 25,000 to 350,000 bp.

5-fold cross-validation on the PLSDB dataset
To further evaluate the robustness of our approach to data partition, we performed the 5-fold crossvalidation on the PLSDB dataset. For each order of plasmids and chromosomes, we divided them into five parts each and then performed a 5-fold cross-validation. The result is shown in Figure S10. The performance of PLASMe on 5-fold cross-validation is very stable, indicating that the model's performance is not affected by different sets of training/ test data. And the overall performance (precision: 0.981±0.005, recall: 0.969±0.002, F-score: 0.974±0.002) is slightly better than that on the PLSDB test set (precision: 0.970, recall: 0.965, F-score: 0.967), which suggests that the date-based data partition method does not lead to an easier test set for the model. Based on the design and applications of Transformer in other fields, the attention mechanism in Transformer gives it an essential advantage over LSTM in that Transformer can better capture longrange contextual relations. Through our experiments, we found that Transformer also outperforms LSTM on short sequences. In order to compare the performance of the Transformer and LSTM on short sequences, we trained an LSTM model using the PLSDB training set and compared its performance with the Transformer on the short contig test sets. The LSTM model consists of one bidirectional LSTM layer followed by two fully connected layers, with sizes of 64 and 1, respectively. The size of the word embedding is 512, and we set the hidden dimension in the LSTM unit to 2048. We trained the model using a learning rate of 0.001, a batch size of 512, binary cross-entropy loss, and the Adam optimizer. To construct the dataset, we merge the contig 1K, contig 2K, and contig 3K test sets. We only used BLAST to assign the taxonomy of contigs, and then used LSTM and Transformer to predict the labels of query contigs. We plot the precision-recall curve and calculate the area under the curve. The comparison result is shown in Figure S11. We can see that Transformer outperforms LSTM slightly. No obvious advantage of using LSTM on short contigs is observed.

Comparing PLASMe and Recycler on the CAMI datasets
In addition to comparing PLASMe with learning-based and alignment-based tools, we also compare it with Recycler, a representative graph-based tool, on the CAMI2 datasets. Unlike PLASMe and other benchmarked tools that identify plasmids from provided contigs, graph-based tools assemble plasmids from a provided assembly graph. We used metaSPAdes to assemble the reads to obtain the assembly graph. Recall and precision were used to evaluate the results, as shown in equations 7 and 8.

Recall =
# of identified reference plasmids # of total reference plasmids (7) Precision = # of correctly identified plasmid contigs # of identified putative plasmids (8) The recall is defined as the ratio of the identified reference plasmids to the total number of plasmid genomes in a sample. To obtain the number of identified reference plasmids, we aligned the identified putative plasmids to the reference plasmids. Then, we calculate the ratio of plasmids in the reference that can be aligned with an identity and coverage greater than 80%. And precision is defined as the ratio of correctly identified plasmid contigs to the returned plasmid contigs. The results are shown in Figure S12. PLASMe has higher recall and precision than Recycler on the CAMI datasets. As a graphbased approach, Recycler identifies putative plasmid sequences using the circularity and coverage from the assembly graph, which tends to be affected by the sequencing coverage of the plasmids and chromosomes. When the plasmids can only be partially assembled, the reliance on topological circularity can lead to a low recall. On the other hand, although Recycler can find novel plasmids, inconsistent chromosome coverage in complex samples leads to false positives, which is discussed in [3]. Alla et al. mentioned that graph-based tools are not well suited for metagenomic data because of the typically non-uniform chromosome coverage. In the original Recycler paper, the performance worsens when the sample becomes more complex or the coverage decreases. The CAMI dataset contains many species, and most of them have low coverage, which may explain why the performance of the Recycler is not good on the CAMI datasets.
13 The intersection between different tools on the real data Figure S13: Intersection of the identified plasmids between different tools on the real metagenomic and plasmidome datasets To compare the results of the different tools, we computed the intersection of the results between different tools, as shown in Fig. S13. To show the results more clearly, we divided the tools into two groups for analysis: a group of learning-based methods with higher recall and a group of referencebased and hybrid methods with higher precision. From Fig. S13 (a), S13 (b), we can see that PLASMe shared the highest number of contigs with other tools using alignment-based and hybrid methods (60 and 64 in metagenomic and plasmidome data, respectively) Based on the experimental results on CAMI datesets, the other four learning-based tools have low precision and introduce more false positives. Although they have relatively large overlaps, as shown in Figure S13 (d), the overlap may contain many contigs from chromosomes.

Running time analysis of different tools
To show the reasons behind the significant differences in runtime among different tools, we will briefly outline the steps involved in running each tool. Tools cBar, PlasFlow, and PlasClass use k-mer frequency to construct feature vectors for machine learning prediction. Building the k-mer frequency vector is highly efficient. Mob-Recon, PlasForest, and PlasmidFinder rely mainly on alignment to identify plasmids or to construct feature vectors. Their runtime is mainly spent on running BLASTn in megablast mode, which is fast too. PlasmidVerify and Platon both use HMMER to check whether contigs contain the protein families in the database, which runs slower and therefore leads to longer runtime for these tools. Deeplasmid needs to derive multiple features on the sequences to construct the feature vector, including HMMER-generated protein domain alignment features. Additionally, it feeds the input to multiple models and uses the averaged prediction probabilities to minimize the error. Both factors contribute to the long runtime of Deeplasmid. To evaluate the token importance, we extracted the attention distribution of each head by feeding the sequence into the trained model. The attention distribution is calculated and normalized using sof tmax(QK T /d k ), where Q is query, K is key, and d k is the dimension of K. As shown in Fig. S14, the rows in the attention distribution represent the query, and the columns represent the key. The larger the attention between a pair of query and key, the stronger the association between them. To evaluate the importance of tokens in the contig, we summed up the attention of all queries for each key. The key with the largest sum was treated as the most important token in the sequence. There were eight heads in the Transformer, and each head might learn a different attention distribution. So, we calculated the most important token for each head and measured importance by dividing the frequency of the token rated as the important token by the number of times it appears.

Method of extracting important tokens
Here we choose the order containing the most plasmids, Enterobacterales, to interpret the Transformer. We inputted 16,432 Enterobacterales plasmids into the Enterobacterales model to evaluate the token importance. We only evaluated the tokens whose occurrence was at least 50.
16 The distribution of the average attention score of each token In order to evaluate the significance of different tokens to the model, we also plotted the distribution of average attention scores. Some tokens exhibit very low occurrence, meaning that these tokens don't have much impact on the results. Thus, we selected tokens with frequencies exceeding ten times and computed their average attention scores. In the end, we analyzed the attention score distribution of 17,495 tokens, as shown in Figure S15. Figure S15: The histogram of the average attention scores of 17,495 tokens in Enterobacterales. X-axis: the average attention score. Y-axis: the number of tokens (scaled by the logarithm with a base of ten). In this plot, we used a bin width of 0.005, which resulted in some attention scores not including any tokens.
The distribution aligns with our expectations, as certain PCs are known to have core functions in plasmids, resulting in high attention scores and a decisive effect on plasmid identification. However, the majority of tokens do not have high average attention scores and may occur frequently but only contribute to specific sequences, resulting in a lower average attention score.

Predicting protein functions of the unknown tokens
Although we used multiple databases and tools to annotate the function of the top 50 important tokens, there are still 27 tokens that cannot be annotated with any function. To predict the functions of these unknown tokens, we used DeepFRI, which integrates the features from protein sequence and structure, to predict the functions. During the prediction, we used AlphaFold to predict the structure of the protein and then input the predicted structure into DeepFRI for prediction. To ensure the reliability of the results, we only predict the function for PCs with a length greater than 150 amino acids, as shown in Table S2. We can see that P C 41646 is associated with iron binding. This protein may act as a virulence factor and enhance the pathogenicity of bacteria. For example, in bacteria whose hosts are eukaryotes, iron-binding proteins can help bacteria improve their ability to obtain iron in the environment, which is a critical factor for bacterial growth [4]. Meanwhile, P C 32278 and P C 15827 are both predicted to be transmembrane-related proteins. These two PC are likely related to the horizontal gene transfer of plasmids between bacteria.
18 Annotating the high-similarity regions on the plasmids To annotate regions on plasmids sharing high similarity with chromosomes, we aligned complete bacterial and archaeal genomes in RefSeq to plasmids in PLSDB. Based on the alignment, we defined the high-similarity region as the alignment where the length was greater than 300 bp, and the identity was greater than 90%.
During testing, we will parse the alignment results and provide the alignment length and the start and end positions of the query contigs aligned to the high-similarity regions in the output file. Users can use the alignment results to determine whether the identified plasmids are likely to be false positives.