PhaTYP: predicting the lifestyle for bacteriophages using BERT

Abstract Bacteriophages (or phages), which infect bacteria, have two distinct lifestyles: virulent and temperate. Predicting the lifestyle of phages helps decipher their interactions with their bacterial hosts, aiding phages’ applications in fields such as phage therapy. Because experimental methods for annotating the lifestyle of phages cannot keep pace with the fast accumulation of sequenced phages, computational method for predicting phages’ lifestyles has become an attractive alternative. Despite some promising results, computational lifestyle prediction remains difficult because of the limited known annotations and the sheer amount of sequenced phage contigs assembled from metagenomic data. In particular, most of the existing tools cannot precisely predict phages’ lifestyles for short contigs. In this work, we develop PhaTYP (Phage TYPe prediction tool) to improve the accuracy of lifestyle prediction on short contigs. We design two different training tasks, self-supervised and fine-tuning tasks, to overcome lifestyle prediction difficulties. We rigorously tested and compared PhaTYP with four state-of-the-art methods: DeePhage, PHACTS, PhagePred and BACPHLIP. The experimental results show that PhaTYP outperforms all these methods and achieves more stable performance on short contigs. In addition, we demonstrated the utility of PhaTYP for analyzing the phage lifestyle on human neonates’ gut data. This application shows that PhaTYP is a useful means for studying phages in metagenomic data and helps extend our understanding of microbial communities.


Introduction
Bacteriophages (aka phages) are viruses that infect bacteria.They are widely regarded as the most abundant and diverse entities in the biosphere [1] and play an essential role in various ecosystems [2,3].For example, by lysing the bacterial host, phages can regulate both the composition and function of the microbiome.With the in-depth study of phages, there is accumulating evidence revealing phages' significant impacts on various fields, such as dairy production [4,5], phage therapy [6,7], and disease diagnostics [8,9].
Phages' applications depend on annotations of their lifestyles.There are two types of phages based on their lifestyles: virulent phages and temperate phages.Virulent phages infect bacteria and kill their hosts to release their offsprings.But they don't integrate their genomes into the hosts [10].In contrast, temperate phages integrate their genomes into the host chromosome and copy their genomes together with the host [11].They will maintain this living state, which is also called prophage, until induced by appropriate conditions and enter the lytic cycle to kill their hosts [12,13].The lifestyle of the phages can directly affect their usages.For example, virulent phages are required in phage therapy to kill the antibiotic-resistant bacteria [14,15], while temperate phages can engineer a host's genome [16] and help regulate gene expression and change cell physiology by introducing novel functions [17,18].However, culturing and isolating phages in lab for identifying the lifestyles are usually expensive and time-consuming [19,20], especially for phages infecting anaerobes, such as Clostridioides difficile and Mycobacterium tuberculosis [21,22,23].Metagenomics allows sequencing of uncultured dark matter of the microbial biosphere, which can contain a large number of phages [24].Being able to annotate the lifestyles of phages sequenced from host-associated or natural environments is expected arXiv:2206.09693v1[q-bio.GN] 20 Jun 2022 PHATYP to extend our knowledge about phage composition and their interactions with other microbes.Thus, computational prediction of the phage lifestyles has become an attractive alternative to experimental methods.
There are two main challenges for computational prediction of the lifestyle of phages.First, the number of reference phages with known lifestyle annotations is very limited.According to the latest lifestyle annotation dataset provided by [25], there are 1,290 virulent phages and 577 temperate phages.However, the number of released phages in the RefSeq database is 4,517 in 2021, indicating that over a half of phages have no annotations.An even larger data source for phage is IMG/VR v3 database [26], which contains nearly 2 million uncultivated phage-like genomes in 2021.Because of the limited number of annotated genomes, most phages cannot be classified by sequence match-based methods.Second, as mobile genetic elements, phages usually mobilize host genetic material and incorporate it into their own genomes [27], leading to poor sequence assembly results and incomplete fragments for phages [25].Both the short length of the fragments and the ambiguous regions increase the difficulty of the lifestyle prediction task.

Related work
Given the importance of phages, numerous efforts have been made to computationally predict phages' taxonomic labels [28,29,30] and their hosts [31,32,33,34], and to identify phages from metagenomic data [35,36,36].In addition, there are a handful of tools for phage lifestyle prediction [19,37,25].One type of method uses maker genes to distinguish virulent and temperate phages.For example, integrase and excisionase are two widely accepted marker genes for identifying temperate phages [38].However, only a few genes can be used as marker genes, especially for virulent phages [19].In addition, the fragmented contigs from the metagenomic assembly may not cover such genes, and thus, using a small set of marker genes can lead to a low recall for lifestyle classification.
Instead of relying on handcrafted marker genes, learning-based methods are proposed aiming to automatically learn features from two types of phages' DNA and protein sequences.For example, PHACTS [19] trained a random forest model on protein similarities for virulent and temperate phage classification.BACPHLIP [37] trained another random forest classifier using a set of lysogeny-associated protein domains identified by HMMER3 [39].However, such strategies may not apply to metagenomic data.According to the benchmark results shown in [25], the accuracy of PHACTS decreases on short contigs.For example, PHACTS only achieves an accuracy of ~60% when the phage contigs are below 2kbp.Also, BACPHLIP is only designed for complete phage genomes according to the provided guidelines; it returns errors when short contigs are provided as inputs.Unlike these methods, PhagePred [40] and DeePhage [25] can identify the lifestyle for contigs assembled from metagenomic data.PhagePred calculated the distance between a query and a reference using a learned k-mer frequency-based Markov model.DeePhage [25], which has the best reported performance on lifestyle prediction, can predict contigs as short as 100bp.DeePhage distinguishes the lifestyle of phages by applying a convolutional neural network to learn the motif-related information from DNA sequences.Nevertheless, its best performance on the short contigs is only ~80%.

Overview
In this work, we present a method named PhaTYP to classify the lifestyles of phages.Previous works have shown that the marker genes and protein-protein associations are key features in phage classification and identification [41,28,29,30,34,35].The previous studies also showed that the protein composition and their associations play important roles on phages' lifestyle [42,38].Inspired by these studies, we represent phage contigs using a contextualized embedding model from natural language processing (NLP) for lifestyle classification.Specifically, we adopt Bidirectional Encoder Representations from Transformer (BERT) to learn the protein composition and associations from phage genomes.We evaluated our final model PhaTYP on contigs of different lengths and contigs assembled from real metagenomic data.The benchmark results against the state-of-the-art methods show that PhaTYP not only achieves the highest performance on complete genomes but also improves the accuracy on short contigs by over 10%.

Method
Transformer has emerged as a powerful general-purpose model architecture for representation learning.Because the multi-head mechanism implemented in Transformer can learn the association between tokens, Transformer can be used to generate the semantic representation of the sentences.It outperforms recurrent and convolutional neural networks in several NLP tasks.Inspired by the usage of Transformer in NLP, we propose PhaTYP, which is an eight-layer bidirectional Transformer block based on the original implementation described in [43].In NLP problems, words are the tokens in sentences.We make an analogy between words in NLP and proteins in phage genomes so that we can utilize Transformer to learn protein composition and associations from phage genomes.Although k-mers and motifs have been used as tokens for protein prediction task [44,45], using proteins as tokens can integrate their biological functions into the sentence representation.In addition, the converted sentence is much shorter by using protein-based tokens, making the training much faster with less parameters.Thus, we train PhaTYP on protein-based tokens to separate virulent and temperate phages.
To address the difficulties of classifying incomplete genomes with limited training data, we divide the lifestyle classification into two tasks: a self-supervised learning task (Fig. 1 A) and a fine-tuning task (Fig. 1 B).In the first task, to circumvent the problem that only a limited number of phages have lifestyle annotations, we applied self-supervised learning to learn protein association features from all the phage genomes using Masked Language Model (Masked LM), aiming to recover the original protein from the masked protein sentences.This task allows us to utilize all the phage genomes for training regardless of available lifestyle annotations.In the second task, we will fine-tune the Masked LM on phages with known lifestyle annotations for classification.To ensure that the model can handle short contigs, we apply data augmentation by generating fragments ranging from 100bp to 10,000bp for training.
In the following section, we will first describe how to convert DNA sequences into protein-based sentences.Then, we will briefly introduce the background of eight-layer bidirectional Transformer blocks.Because we adopt a very standard implementation of Transformer, we refer readers to [43] for detailed explanation of the model.Finally, we will show how we train PhaTYP on two different tasks: self-supervised training for Masked LM and the fine-tuning model for lifestyle classification.

Sequence embedding
In order to convert sequences into protein-based sentences, we will first introduce how we construct the token vocabulary.Each token in our model represents a protein cluster containing homologous protein sequences from phages.First, we downloaded all the phages proteins from the RefSeq database.We run an all-against-all similarity search using DIAMOND BLASTP [46] and generate a protein-similarity graph, where the nodes in the graph represent the protein, and the edges connect proteins with significant sequence similarities.The edge weights are the e-values returned from DIAMOND BLASTP.Then, we employ Markov clustering algorithm [47] to group proteins into clusters based on their similarity (e-value).This process resulted in 63,855 protein clusters for constructing the token vocabulary.
We then use the generated vocabulary to convert contigs into protein-based sentences.As shown in Fig. 2, first, Prodigal [48] is adopted for gene finding and translation.Then, DIAMOND BLASTP is applied to measure the similarity between these query proteins and the protein clusters.Each query protein is assigned with the token with the best alignment.Finally, the contigs can be converted into protein-based sentences as shown in Fig. 2. Because the length of the sentences can vary a lot, we follow the design of BERT [49] and choose 300 as the maximum length of the After converting the DNA sequences into protein-based sentences, we will project the 300-dimensional vector into a dense embedding matrix.We employ a learnable embedding layer, which is a neural network, to embed the sentence.There are two main purposes of using the learnable embedding layer.First, the learnable embedding layer will generate a low-dimensional embedding vector than using one-hot encoding as input, which can result in resulting in a R 300×63,855 matrix for each sentence.The deep learning model will suffer from the curse of dimensionality using such a sparse matrix as input [50].Second, As proven in [50], the embedding layer can learn to map associated tokens into similar embedding vectors, and thus assisting the learning process.In addition, as shown in Fig. 2, we embed the position information to represent the position of the token in the sentence, helping the model utilize the sequential information of the sentence.The equations of the embedding layers are listed in Eqn. 1.
E t ∈ R 300×1 represents the token ID sentence and E p ∈ R 300×1 represents the position index vector.W Et ∈ R 63,855×512 and W Ep ∈ R 300×512 are the learnable projection matrices.512 is the default embedding dimension suggested by [43].The function of Embed is a look-up table.Given an ID, it returns the corresponding embedding vector.Thus, the dimension of E t and E p are R 300×512 , which is much smaller than using the one-hot encoding matrix (R 300×63,855 ).We follow the design of BERT [49] and train the embedding layer with the whole model via the end-to-end mode to enlarge the receptive field.Finally, we will apply matrix addition on the two embedded matrices (defined as X in Eqn. 1) and feed it into the eight-layer Transformer structure.

Model structure
The detailed architecture of the Transformer model is shown in Fig. 3.The main function is the multi-head attention mechanism, which can extract association features between tokens [43].The feed-forward network consists of two fully connected layers with a ReLU activation, which is applied to each position separately and identically.To prevent overfitting, residual connections [51], and layer normalization [52] are employed before and after the feed-forward network.The input of the Transformer block is the embedding X in Eqn.After introducing the model structure, we will show how to train PhaTYP on Masked LM and fine-tuning tasks.These two tasks have different but related objective functions.In the following sections, we will first describe the datasets used in each task and then present a detailed explanation of the loss function.

Self-supervised training for Masked LM
Self-supervised learning is a kind of unsupervised learning aiming to learn a general feature expression for downstream tasks.In our design, the downstream task is the lifestyle classification, and the general feature is the protein organizations in phage genomes.Because the number of known lifestyle annotation of phages is far less than the number of known phages in the database, we employ a self-supervised learning method to learn as much prior knowledge as possible.Specifically, we use all the available phage genomes in RefSeq to train the Masked LM task.
Dataset We download all the phage genomes released before 2022 from the NCBI RefSeq database.In total, we have 3,474 phage genomes.To generate more data for training, we cut the complete genomes into fragments with different lengths, including 5kbp, 10kbp, 15kbp, and 20kbp.We randomly sample ten sub-strings from the genomes for each length.Thus, we have 142,434 phage contigs in our dataset.Finally, we will use the method introduced in section sequence embedding to generate protein sentences and embedding as inputs to Transformer.
Masked LM loss The inputs to our self-supervised model are masked sentences, and the aim is to predict the original sentences.For example, as shown in Fig. 1 A, we replace the embedding at the position of protein C with the [MASK] token.Then, we will feed the masked sentence into PhaTYP to predict the marked token.
As described in Section model structure, the shape of output Y will be the same as the input X, which is R 300×512 .Thus, each row i in Y can be interpreted as the latent features of the input token at position i in the sentence, which is a R 1×512 vector.Then, in order to predict the ID of the masked token, Y i is fed into an output layer with SoftMax function over the vocabulary.The loss function is presented in Eqn. 2. ,855 is the learnable projections.The value in Sof tM ax(Y i W d ) ∈ R 1×63,855 represents the prediction probability of each token in the vocabulary.In the training process, 5% words of each sentence will be masked randomly, and we will calculate the cross-entropy loss to update the parameters through backpropagation.

Fine-tuning the model for lifestyle classification
After pre-training the model via the self-supervised learning task, we will then fine-tune PhaTYP to the downstream task: lifestyle classification.

PHATYP
dataset There are two widely used datasets supplied by the previous studies [19,25].In total, we have 1,290 virulent phages and 577 temperate phages.In order to balance the dataset and improve the robustness on short contigs, data augmentation is applied by randomly generating short DNA fragments, ranging from 100bp to 20kbp, from the complete genomes.For each length, we generate 10,000 contigs for temperate and virulent phages, resulting in 160,000 phage contigs in the dataset.Then, all the fragments will be converted into protein sentences and fed into PhaTYP.
Fine-tuning loss Unlike the self-supervised learning task, we use the first row Y 0 ∈ R 1×512 from output Y as latent feature for the lifestyle classification.As shown in Fig. 1 B, we concatenate a fully connected layer to calculate the probability of the input being virulent or temperate.The loss function of the fine-tuning can be found in Eqn. 3 W d ∈ R 512×2 and b p ∈ R 1×2 are the learnable parameters.Then, we will calculate the loss between the real labels and the predictions and update the parameters in the model accordingly.

Model training
In the training process, we employ ten-fold cross-validation for both tasks.The model is trained with 4 GTX 2080Ti GPU units using the Adam optimizer, and we apply a learning rate of 0.001 to update the parameters.Because PhaTYP needs to be trained step-by-step on the two tasks, we first choose the model with the lowest cross-validation loss on self-supervised learning tasks.Then, we fine-tune this pre-trained model on the lifestyle classification task.Finally, we store the parameters of the model that achieve the best performance (AUCROC) in the cross-validation.Both the parameters and the model are available via http://github/KennthShang/PhaTYP/.

Result
In this section, we validate PhaTYPE on both simulated and real sequencing data.We compared PhaTYP against the state-of-the-art methods, including PHACTS [19], DeePhage [25], BACPHLIP [37], and PhagePred [40].Because the authors of BACPHLIP stated that incomplete/partially assembled genomes should not be used as input, we will only evaluate BACPHLIP on complete genomes.It is worth noting that PhagePred and DeePhage allow re-training using customized training data.Thus we re-trained these models with the same training set as PhaTYP for all experiments.PHACTS and BACPHLIP did not provide this function and thus, we used the provided model in all experiments.
Metrics To ensure consistency and a fair comparison, we use the same metrics as the previous works: sensitivity (T P/(T P + F N )), specificity (T N/(T N + F P )), and accuracy ((T P + T N )/(T P + F N + T N + F P )).T P represents the number of correctly classified temperate phages, while T N represents the number of correctly classified virulent phages.Thus, sensitivity and specificity can be interpreted as the recall of classifying temperate and virulent viruses, respectively.In addition, we will present the ROC curve to evaluate the tradeoffs between sensitivity and specificity.

Name Description RefSeq
All the phage genomes released before 2022 from the NCBI RefSeq database.Totally, we have 3,474 phage genomes.This dataset is only used to train the Masked LM task.
Table 1: Detailed information of the dataset used in the experiments.
Datasets As described in Section Self-supervised training for Masked LM and Fine-tuning the model for lifestyle classification, totally we have three datasets for training and validation.In addition, we applied PhaTYP to predict the lifestyle for phage contigs assembled from human neonates' guts.The detailed information of the datasets is listed in Table 1.

Classification performance comparison using ten-fold cross-validation
Performance on the complete genomes We applied ten-fold cross-validation on the combined lifestyle datasets listed in Table 1.When training, we apply the data augmentation method mentioned in section Fine-tuning the model for lifestyle classification on the training set.The phage sequences in the validation set remain complete.The ROC curves derived from the averaged ten folds evaluation from all the methods are shown in Fig. 4. We also show how the self-supervised learning task affects the learning performance by recording the results of training PhaTYP without the self-supervised learning task (without SSL task in Fig. 4).Predicting the lifestyle for complete phages genomes is a relatively easy task.PhaTYP, DeePhage, and BACPHLIP all achieved high accuracy.Nevertheless, the ROC curves in Fig. 4 show that PhaTYP has the best performance.Performance on the test set with low similiairty Cross validation cannot control the similarity between the training and test set.To evaluate whether the learning models can perform well on a harder test set, we constructed a test set containing phage genomes with low similarity with the training set.First, we applied Dashing [54] to estimate the similarity between phages in the lifestyle dataset.Then, we generated a test set by selecting 50 phages with the lowest similarities with their peers from virulent and temperate phages, respectively.In total, there are 100 phages in the test set.The maximum similarity of the virulent phages and temperate phages in the test set with all other phages are 0.049 and 0.052, respectively.The classification results are shown in Table 4 and Fig. 5. Compared to the cross-validation results in Table 2, the similarity of the test genomes affects specificity more than sensitivity, indicating that some virulent phages are misclassified as temperate.This could be caused by lower average similarity for the test virulent phages.In Fig. 5, the value of AUCROC decreases for all methods except BACPHLIP.This is likely caused by the overlap between the test phages and the data used for training BACPHLIP in its provided version.Nevertheless, PhaTYP still has the best results.

Metrics
Performance on the short contigs Considering that metagenomic assembly can produce incomplete phage sequences, it is important to evaluate PhaTYP on phage contigs.Towards this goal, we apply the same method used in [25] to construct the short fragment dataset.Specifically, we generate four groups of fragments with different length ranges, including 100-400bp, 400-800bp, 800bp-1,200bp, and 1,200-1,800bp.These four sets of fragments can cover the length of raw reads and the short contigs from the metagenomic data.We generated 80,000 fragments for each group by random sampling a sub-string from the complete genomes, with 40,000 for temperate and virulent phages, respectively.To remove the potential redundant fragments, we used Dashing [54] to estimate the similarity between fragments.Then, we removed fragments with similarities above 0.8 from the dataset.We applied ten-fold cross-validation to train and validate the performance.Because DeePhage trained four separate models on different length ranges, we followed the same design and trained four PhaTYP models separately.Table 5: Detailed results on the short contigs under the tools' default score cutoff (0.5).
We generate the ROC curve on short contigs to show the tradeoff between FP rate and sensitivity.In each fold, we combine the validation results from the four length ranges.The results are shown in Fig. 6, which shows that our model outperforms the other tools.More detailed results are shown in Table 5.In general, with the increase of the length, the performance of all methods increases.This is expected because longer fragments generally provide more information for prediction.In addition, compared to the results on complete genomes, the self-supervised learning task can help the model achieve higher accuracy on short contigs.A plausible explanation is that pre-training on the self-supervised learning task can help the model learn more generalized embeddings for phage proteins and prevent overfitting.Then, these embeddings with prior knowledge can be leveraged when information for classification is lacking.In addition, we tested whether training contigs with different lengths at the same time influences the performance of PhaTYP.Specifically, we employed the data augmentation methods, which combines the training set of the contigs and their complete genomes to train one PhaTYP model.The comparison between training PhaTYP separately and training with all data at once (PhaTYP augmentation) in Table 5 reveals that training with data augmentation will not affect the PHATYP classification performance.Thus, PhaTYP can predict lifestyles for contigs across different length ranges with one model, which helps save computational resources for users.

Predicting the lifestyle of crAssphages
CrAssphages are an extensive family of tailed bacteriophages discovered through the cross-assembly of human fecal metagenomes [55].Most of them infect Bacteroides and do not integrate into their hosts during replication.Thus, these crAssphages are widely regarded as virulent phages.Despite its ubiquity in human gut, over 80% of the predicted proteins in crAssphage genomes showed no significant similarity to annotated crAssphage protein sequences, hampering their identifications in newly sequenced metagenomic data [56].The low similarity also poses a hard case for current lifestyle classification tools.
In this experiment, we downloaded 33 crAssphages from [53], which are all annotated as virulent phages, and evaluated whether PhaTYP can correctly classify them.First, we remove all the training sequences with high sequence similarity (alignment identity > 50%) from our dataset using BLASTN [57].Then, we retrained PhaTYP, DeePhage, and PhagePred on the 'cleaned' dataset.The results of this experiment are presented in Fig. 7.All the 33 crAssphages can be classified correctly by PhaTYP, further demonstrating its utility.PhagePred ranked the second, showing much improved performance than in the previous experiments.One possible reason is that the k-mer frequency feature adopted by PhagePred is an important feature for crAssphages.

Case study: phage lifestyle analysis for infants
In this section, we apply PhaTYP to investigate early-life viral colonization using the metagenomic data sampled from infant meconium or early stools [53].In the original study, the authors used reference genomes and PHACTS to analyze the phages' lifestyles.Although the similarity search against the reference genomes can be used to annotate the lifestyles, the number of aligned phages is only a small subset of all assembled phage contigs.Then, the authors used PHACTS to predict lifestyle of all the phage contigs.However, according to the previous experiments, the performance of PHACTS is not ideal on short contigs.Thus, we employed PhaTYP to re-investigate this dataset.
The data is sequenced from 20 infants.The stool samples were collected longitudinally at 0-4 days after birth (meconium samples, month 0), month 1, and month 4. Therefore, we have a total of 60 metagenomic samples.In addition, the authors recorded detailed information, including the feeding and delivery type of these infants.Thus, we are able to investigate whether the ages, feeding, and delivery type can affect the composition of phages with different lifestyles.The 60 samples containing the raw reads are public and available via https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA524703; the assembled contigs are available via https://github.com/guanxiangliang/liang2019.
Following the guidelines of [53], we first applied Bowtie2 [58] for reads mapping and then ran BBmap [59] to calculate the reads per million total reads (RPM) for each contig.According to the description in [53], the species were called present if at least ten RPM from one sample aligned to that contig.Therefore, we have 19,943 contigs remained.Second, to ensure consistency, we used the same rules to define phage contigs as in [59]: (1) at least one phage protein per 10kb of the contig and (2) 50% of the predicted open-reading frames (ORFs) are phage ORFs.We used the Prodigal [48] in meta mode to predict the ORFs and aligned them to the phage protein database downloaded from RefSeq.Finally, 2,291 high-confidence phage-like contigs were identified and we ran PhaTYP for lifestyle prediction.Identifying temperate phages containing integrase Because there are no labels for the contigs from the metagenomic data, we rely on the marker genes for labeling the contigs.As we discussed in section Introduction, using marker genes usually has high specificity but low recall.Thus, the contigs labeled using this method have high confidence.
One of the most widely accepted markers, named integrase, is commonly used to identify temperate phages [60,61].Thus, we built an integrase protein database by searching the proteins with the keyword 'integrase' from the phage protein database.Then, we search all 2,291 phage contigs against the integrase database using BLASTX [57].23 contigs having homologous regions (identity > 90%, coverage > 90%, and e-value < 1e-10) were identified out of 2,291 contigs.All these 23 contigs were labeled as temperate phages, and we used them to evaluate PhaTYP and the state-of-the-art tools.
As shown in Fig. 8, both PhaTYP and DeePhage can predict correctly on all 23 contigs.On the other hand, the results show that using the integrase as a marker can only classify ~1% of the metagenomic phages, which remains a low recall for lifestyle prediction.Thus, methods that do not just rely on integrase proteins are in great need of comprehensive prediction.
Comprehensive lifestyle analysis Then, we analyzed all 2,291 phages' lifestyles using PhaTYP.There are three main variables that might affect the dominant lifestyle: the age of the infant, the delivery type, and the feeding type.
Ages First, we group the samples by age and draw violin plot to show the lifestyle distribution in Fig. 9 A. Y-axis represents the percentage of temperate phages in each sample: temperate/(temperate + virulent).We also record the p-value to show whether the two groups differ significantly.Fig. 9 A shows that as the baby grows, the percentage of temperate phages decreases, suggesting that more virulent phages colonize in the infants' gut with the infant's growth.Thus, we further investigate the lifestyle of newly colonized phages for each infant.Specifically, because we have three data samples representing month 0, month 1, and month 4, we can first identify new phage contigs in month 1 by aligning reads of month 0 to the contigs in month 1 via Bowtie2 [58].If a contig in month 1 has no read mapping outputs, this contig is regarded as a newly colonized phage after month 0. Finally, the lifestyle prediction results on these newly colonized phages are shown in Fig. 9 B. The trend and conclusion are the same as Fig. 9 A and the p-value became smaller, indicating that the newly colonized viruses are more likely to be virulent phages.
Our results are consistent with the main conclusions of the original study [53].During the early stage after birth, pioneer bacteria colonize the infant's gut.The prophages induced from these bacteria provide the predominant population.Then, as the infant grows, more phages and bacteria from the environment might reside within human guts too, leading to the change of the lifestyle composition.
Delivery type Because the phage lifestyle on different age groups shows significant differences, we further investigate whether the phage lifestyle composition is influenced by the delivery type.We first group the data by ages.Then, we draw the violin plots for three delivery types in each group.The results are shown in Fig. 10.In group month 0, there is a significant difference between the delivery type and phage lifestyle composition.With the increase of the age, the difference gradually becomes smaller, suggesting that delivery type can influence the initial phage colonization of the infants.According to the explanation in [53], the environmental contamination of different delivery types can affect the gut microbiome at month 0, which might also lead to the difference.
Feeding type Because we only have feeding types for infants at month 1 and month 4, we show the two groups' violin plots in Fig. 11.An interesting finding is that the feeding type does not show much difference at month 1.But with the growth of infants, the proportion of temperate phages in the infants with formula milk decreases more rapidly than those with the mixed feeding type.One possible reason is that breastfeeding can reduce the chance of infection by microbes [62,63,64] and thus, leading to relatively stable environments for the infants with access to breast milk.

Discussion
In this work, we propose a method named PhaTYP for phages' lifestyle prediction.As shown in the experiments, PhaTYP outperforms the available lifestyle prediction methods.We design two different tasks, a self-supervised learning task for learning protein association and a fine-tuning task for lifestyle prediction, to train the model so that PhaTYP achieves more stable performance on short contigs than other methods.In addition, we show an application of using PhaTYP to investigate the early viral colonization in human neonates.The prediction results of PhaTYP yielded several insights, most of which are consistent with recently published studies, suggesting that PhaTYP is a powerful means for studying and analyzing phage composition in metagenomic data.
Although PhaTYP has greatly improved phages' lifestyle prediction, we have several aims to optimize PhaTYP in our future work.One possible extension is to employ the knowledge distillation to reduce the parameter size of PhaTYP.
Because of the stacking of transformer blocks in PhaTYP, it requires large computational resources (four 24Gb GPU units) to accelerate the training process.Distilling the knowledge in a neural network can provide a light-version model with the same performance for users who wish to re-train the model.Another extension is to incorporate phage identification [35] in our software so that users can analyze phages' lifestyles in metagenomic data directly.
In conclusion, based on our tests on both simulated and real sequencing data, PhaTYP achieves the most accurate lifestyle prediction among available tools.PhaTYP can be applied to various metagenomic data for analyzing virus compositions.

Data Availability
All data and codes used for this study are available online or upon request to the authors.The source code of PhaTYP is available via https://github.com/KennthShang/PhaTYP.

Conflict of Interest
There is no competing interest.

Funding
City University of Hong Kong (Project 9678241 and 7005453) and the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA).

Figure 1 :
Figure 1: Two training tasks for PhaTYP using BERT.A: the self-supervised learning task.The input is the masked sentence and the output is the predicted token at the masked position.All phage genomes in RefSeq database to train a Mask LM model.B: the fine-tuning task for lifestyle prediction.The pre-trained model is fine-tuned using phages with known lifestyle annotations.The inputs of the model are protein-based sentences and the outputs are the probabilities of two lifestyle classes: virulent and temperate.

PHATYPFigure 2 :
Figure 2: Sequence embedding method in PhaTYP.The block in "Protein Sentence" represents the ID of the proteinbased token.PCx: a protein cluster x. [CLS]: start token.[SEP]: separation token.E [P Cx] : the embedded vector for protein cluster PCx."+": vector addition

Figure 3 :
Figure 3: The Transformer block in PhaTYP.There are three main units in the transformer block: feed-forward network, residual connections, and multi-head attention mechanism.

Figure 4 :
Figure 4: The ROC curve comparison on the complete phage genomes.The value shown in the legend is the AUCROC score."without SSL task": training PhaTYP without the self-supervised learning task.

Figure 5 :
Figure 5: The ROC curve comparison on low similarity test set.The value shown in the legend is AUCROC score."without SSL task": training PhaTYP without self-supervised learning task.

Figure 6 :
Figure 6: The ROC curve comparison on the short contigs.The value shown in the legend is AUCROC score."without SSL task": training PhaTYP without self-supervised learning task.PhaTYP has the best performance.

Figure 7 :
Figure 7: The classification results on the 33 virulent crAssphages.PhaTYP can correctly predict all the crAssphages as being virulent.

Figure 8 :
Figure 8: The classification results on the contigs that have homologous regions with integrase proteins.Both PhaTYP and DeePhage can correctly predict all the contigs as being temperate.

PHATYPFigure 9 :
Figure 9: Violin plot at different months.(A) Predictions on all 2,291 phages.(B) Predictions on newly colonized phages.Y-axis: the percentage of temperate phages in each sample.

Figure 10 :
Figure 10: Violin plot in different delivery type: C-Section with labor (CS(w/)L), C-Section without labor (CS(w/o)L), and spontaneous vaginal delivery (SVD).To control variables, we group the samples according to the ages.The Y-axis represents the percentage of temperate phages in each sample.

Figure 11 :
Figure 11: Violin plot in different feeding types.Formula: the infants were fed with formula milk.Mixed: infants were fed with both formula and breast milk.To control variables, we group the samples according to the ages.The Y-axis represents the percentage of temperate phages in each sample.

Table 2 :
Detailed results on the complete genomes under each tools' default score cutoffs (0.5).

Table 2 .
The results reveal that PhaTYP can identify temperate phages with higher specificity than other tools.The ablation study also shows that training with the self-supervised learning task can improve classification accuracy.

Table 3 :
The elapsed time to make predictions for the ten-fold cross-validation.All the methods are run on Intel ® Xeon ® Gold 6258R CPU and 2080Ti GPU.PHATYP we also recorded the running time of the five methods on the ten-fold cross-validation in Table3.PhaTYP is not the fastest tool because ∼85% running time is used to run DIAMOND BLASTP as described in Section Sequence embedding.DeePhage requires less running time because it only uses k-mer features, while other methods also incorporate alignment features for lifestyle prediction.

Table 4 :
The performance comparison on the low similarity test set under the tools' default score cutoff (0.5).