Bacteriophage classification for assembled contigs using graph convolutional network

Abstract Motivation Bacteriophages (aka phages), which mainly infect bacteria, play key roles in the biology of microbes. As the most abundant biological entities on the planet, the number of discovered phages is only the tip of the iceberg. Recently, many new phages have been revealed using high-throughput sequencing, particularly metagenomic sequencing. Compared to the fast accumulation of phage-like sequences, there is a serious lag in taxonomic classification of phages. High diversity, abundance and limited known phages pose great challenges for taxonomic analysis. In particular, alignment-based tools have difficulty in classifying fast accumulating contigs assembled from metagenomic data. Results In this work, we present a novel semi-supervised learning model, named PhaGCN, to conduct taxonomic classification for phage contigs. In this learning model, we construct a knowledge graph by combining the DNA sequence features learned by convolutional neural network and protein sequence similarity gained from gene-sharing network. Then we apply graph convolutional network to utilize both the labeled and unlabeled samples in training to enhance the learning ability. We tested PhaGCN on both simulated and real sequencing data. The results clearly show that our method competes favorably against available phage classification tools. Availability and implementation The source code of PhaGCN is available via: https://github.com/KennthShang/PhaGCN.


Introduction
Bacteriophages (or phages), which mainly infect bacteria, are among the most common and diverse biological entities in the biosphere [1]. They regulate the actions of the ecosystem through killing, metabolic reprogramming, or gene transfer [2,3]. As a major agent of horizontal gene transfer between bacteria, phages can change the virulence of bacteria and indirectly cause human diseases. There are active studies that use phages for applications such as phage therapy [4], disease diagnostics [5,6], and antimicrobial drug discovery [7].
Despite important functions of phages, our understanding of them is still very limited. Metagenomic sequencing, which allows us to obtain total genomic DNA directly from host-associated and environmental samples, has contributed significantly to new phage discovery [8,9,10,11]. In particular, metagenomic sequencing allows sequencing of uncultured dark matter of the microbial biosphere, which can contain a large amount of phages [12]. The advancements of high-throughput sequencing, assembly, and contig scaffolding have led to phage-like contigs or genomes from different types of samples. According to the RefSeq database supported by the National Center for Biotechnology information (NCBI) , the number of identified phages changed from 1,468 in 2015 to 3,852 in 2020 in the RefSeq database, which is more than twice of increase. Despite the increase, known phages is just the tip of the iceberg of the virome on the planet [13]. How to automatically and accurately mine phages and assign their taxonomic groups from vast amount of sequencing data remains a challenging problem.
There are two specific challenges for phage classification. First, the phages with known taxa are very limited. The International Committee on Taxonomy of Viruses (ICTV) is responsible for the official virus taxonomy and organizes viruses in order, family, subfamily, genera and species. Current ICTV classification procedures cannot catch up with new phage discovery. For example, one of the phage order named Caudovirales has 3,691 reference genomes. However, there are more than 1,800 new Caudovirales sequences found in 2020 that are unclassified into families. Limited labeled genomes pose challenges for both alignment-based and learning-based classification. Second, many phages in different taxa can share protein homologs, which adds ambiguity for alignment-based taxonomic classification. For example, more than 7,616 (~10%) proteins in all annotated phage proteins are shared by phages in different families under Caudovirales. In addition, more than 18,970 (~27%) pairs of highly similar proteins (E-value of BLASTP result < 10 −50 ) are encoded by phages in different families. Therefore, using homology search alone can return ambiguous classification.
In this work, we present a method that automates taxonomic classification for contigs, which are the outputs of assembly. Although taxonomic classification can be conducted on both reads and contigs [14], recombination in viruses can make read-level taxonomic classification difficult. In addition, more distinctive features can be derived from contigs and thus can lead to improved classification accuracy. Current metagenomic assembly tools, such as MEGAHIT [15], have been extensively tested and can produce quality contigs from complex datasets. Thus, our tool accepts contigs as input. In order to address the aforementioned challenges, we developed a semi-supervised learning framework that incorporated the automatically learned features for each contig via a CNN, the protein sequence similarity, and the gene-sharing features between contigs/genomes. Both the unlabeled and labeled sequences were utilized for training in a graph convolutional neural network (GCN). We will demonstrate that the features from the unlabeled sequences (contigs) improve the learning ability and accuracy for phage classification. Below we summarize related work for phage classification.

Related work
Many attempts have been made for phage taxonomic classification. They can be roughly divided into two groups: alignment-based [16,17,18] and learning-based [19,20,21,22]. Alignment-based methods utilize either nucleotidelevel or protein-level homology search between query contigs and reference genomes for assigning the taxon for the query. ClassiPhage [18] and Phage Orthologous Groups (POGs) [16] are two representative alignment-based phage classification tools. POGs extract taxon-specific marker genes and align query sequences against the marker genes using BLASTP. If there are statistical significant alignment for the contigs, the label of the best-aligned marker gene will be assigned to the contigs. ClassiPhage builds a profile Hidden Markov Model (pHMM) for each phage taxonomic group and apply HMM-based alignment for classification. There are two limitations with alignment-based method. First, as genes or proteins can be shared by different taxa, alignment-based method may lead to ambiguous label assignment or return a label with a higher rank using the lowest common ancestor (LCA) in the phylogenetic tree. Second, as phages are highly abundant and diverse, alignment-based methods are not able to assign taxa for new species that harbor novel proteins or lack quality alignments with the references. For example, under the Caudovirales order, 187,006 proteins are named as hypothetical proteins without known family labels. 13,382 proteins from phages released in 2020 do not have BLASTP results with the phages released before 2020. Thus, using only sequence similarity cannot provide ideal resolution.
There are a number of learning-based tools for microbe classification such as the Naïve Bayes classifier [23] and CNN [24]. They use either manually derived or automatically learned features to predict taxonomic labels for bacteria or RNA viruses. The most relevant learning-based tool to phage classification is vConTACT 2.0 [22], which applies a graph clustering algorithm to assign labels for unknown contigs. In order to leverage gene organization conservation for phage classification, vConTACT utilizes a clustering algorithm to construct a gene-sharing network [21,22]. If the reference genomes and contigs are in the same cluster, the labels of the reference genomes will be assigned to those contigs. While these gene-sharing network methods present satisfactory performance on classification of complete genomes, the classification accuracy decreases as the length of the contigs becomes shorter. The decreased performance stems from the fact that short contigs do not contain many proteins and thus do not lead to valid edges in the gene-sharing network. As a result, the clustering algorithms fail to group contigs and reference genomes in the same cluster. Then, no labels will be assigned to these contigs.
Given the enormous diversity of phages and the sheer amount of unlabeled phages, we formulate the phage classification problem as a semi-supervised learning problem. We choose the GCN as our learning model and combine the strength of both the alignment-based and the learning-based methods. First, we utilize DIAMOND-derived sequence similarities between contigs and references [25] to improve the edge construction process in the gene-sharing network. Second, to handle the situation that short contigs lack gene organization-related features, a CNN-based model is adopted to encode nucleotide information from the sequence. The GCN model allows us to utilize features from both labeled and unlabeled samples in training and thus lead to more accurate and sensitive phage classification. We compared our tool (named PhaGCN) with three state-of-the-art models specifically designed for phage classification: Phage Orthologous Groups (POG) [16], vConTACT 2.0 [22], and ClassiPhage [18]. The experimental results demonstrated that PhaGCN outperforms other popular methods. Figure 1: The pipeline of PhaGCN. A1: cut the contigs into 2kbp segments. A2: feature learning from the inputs using CNN. A3: construct nodes using encoded vectors. B1: contig translation using 6 reading frames. B2: filter short translations (12 amino acids). B3: align contigs against reference database using the DIAMOND BLASTP command. B4: choose the best translated frame for the BLASTP result. B5: use the BLASTP result to construct protein clusters. B6 and B7: define edges based on the sum of the E weight and P weight . C1: construct the knowledge graph for GCN.

Methods
Semi-supervised learning is a machine learning approach that combines a small amount of labeled data with a large amount of unlabeled data during training. The main purpose of using the unlabeled data is to utilize their conjunction information with the labeled data to improve the classification accuracy. Because the number of reference (labeled) phage genomes is small and new (unlabeled) phage contigs are increasing quickly, we formulate the phage classification problem as a semi-supervised learning problem.
One of the semi-supervised learning approaches, named GCN, is based on deep learning. The basic idea of GCN is to apply a convolutional layer on a graph to utilize the features on non-Euclidean structure [26]. The purpose of the graph convolutional layer is to automatically learn the topological features from the knowledge graph. Then, unlabeled samples/nodes can be represented as the weighted sum of their neighbor samples/nodes features. In biological data analyses, there exist many non-Euclidean structures such as protein topology graph on the supersecondary structure, gene-sharing network, and diseases-gene relationship graph. GCN is expected to render high classification performance by employing the structural data. For example, GCN has shown promising results in finding the relationship between long non-coding RNAs and diseases [27,28]. In phage classification, different phage genomes and contigs can share genes or proteins, which can be encoded in the graph of GCN. In addition, the nodes in GCN can embed automatically learned feature from nucleotide sequences. During the training, convolution is conducted for each node and its neighbors defined by the graph. The learned topological features will then be applied for classifying samples without labels.
The input to our GCN model is a knowledge graph. There are two key components in the knowledge graph: node encoding and edge construction. The node is a numerical vector learned from contigs using a CNN. The edge encodes features from both the sequence similarity and the organization of genes. Fig. 1 contains the major components for node and edge construction. To encode a sequence using a node, a pre-trained convolutional neural network (CNN) is adopted to capture features from the input DNA sequence (A1-A3). The CNN model is trained to convert proximate substrings into vectors of high similarity. The edge construction consists of several steps. We employ a greedy search algorithm to find the best BLASTP results (E-value less than 1e-5) between the 6-frame translations of the contigs and the database (B1-B4). Then the Markov clustering algorithm (MCL) is applied to generate protein clusters from the BLASTP result (B5) [22]. Based on the results of BLASTP (sequence similarity) and MCL (shared proteins), we define the edges between sequences (contigs and reference genomes) using two metrics: P weight and E weight (B6-B7). By combining the node's features and edges (C1), we construct the knowledge graph and feed it to the GCN to classify new phage contigs.

Using CNN to encode input sequences in the knowledge graph
CNN can automatically learn motif-related features for sequence classification [29,30]. Although CNN can be directly applied to phage classification, our experiments will show that using CNN alone cannot render the best classification performance. Thus, we only train the CNN for encoding input contigs.
As shown in Fig. 2, there are two slightly different network structures in the CNN for "train mode" and "encoding mode", respectively. In the train mode ( Fig. 2(A)), we use the reference database to train the CNN model. In the encoding mode ( Fig. 2(B)), the output of the first dense layer in the pre-trained CNN will be used to encode sequences into numerical vectors.
Train mode: Because the CNN model can only handle fixed length input, all the inputs will be cut into 2kbp segments with user-specified stride value (default 50). The segment has the same label as the underlying genome according to the ICTV taxonomic classification.
The CNN model contains three different parts: embedding layer, convolutional layer, and dense layer. The embedding layer is used to convert the DNA sequence into numerical inputs for convolution. There are two major methods for the embedding layer: one-hot embedding and skip-gram embedding [31]. As shown in our previous work of using CNN for classifying RNA viruses [24], the skip-gram based embedding can improve CNN's learning ability. Thus, in this work, we implemented a skip-gram embedding layer that can map proximate k-mers into highly similar vectors. We trained the embedding layer using k-mers and their neighboring (proximate) k-mers so that the embedding layer can learn their adjacent relationship. Specifically, in order to train the embedding layer, we use a 3-mer at position i as input and 3-mers located at i + j as output, where −m ≤ j ≤ m. m is the hyperparameter that can be specified for the skip-gram model. We employ 100 hidden units in the embedding layer to encode the 3-mers and the output of the embedded vector has 100 dimensions. Thus, each 2kbp segment is converted into embedded matrix M ∈ R 2,000×100 .
Then, the embedded matrix M will be fed into the convolutional layer. Eq. 1 is the convolution function. b is the bias term; d 1 and d 2 are the filter sizes. Since the embedded vector has 100 dimensions, defines a 2D window size of d 1 × d 2 of the embedded matrix M . ReLU is the activation function. The convolutional filters w conv contain n conv 2D matrices and w j conv is the j-th filter. We applied filters repeatedly to each possible window of the input embedded matrix to produce a feature map. Then the dense layer is applied to compress the features captured by the convolutional layer as shown in Eq. 2 and Eq. 3. First, max-pooling (Eq. 2) is applied to the feature map to capture the most useful information from the convolutional layer. Second, we use two dense layers with ReLU activation function (Eq. 3) to learn and compress the feature map. H (l) is the feature map in hidden layer l and w (l) is the weight parameters in the l-th hidden layer. Since we only has two dense layers, l ∈ {0, 1}. Finally, the SoftMax function (Eq. 4) is adopted to generate the prediction. As shown in Fig. 2, in the train mode, we employ . R1,R2,R3 are the reference genomes used for training. C1,C2,C3 are the contigs that need to be encoded. In train mode, sequences will be fed to CNN to update parameters during back propagation. In encoding mode, The pre-trained CNN will be used to encode the sequences into numerical vectors. Then these vectors will be adopted as node features in the knowledge graph.
CrossEntropyLoss to calculate the error between prediction and real label and backpropagate the loss to update the parameters in the model. The detailed parameters are listed in our Github repository.
Encoding mode: After training the CNN model, we utilize the pre-trained parameters to convert contigs into numerical vectors. The main difference in the encoding mode is that we only use the output of the first dense layer as the learned feature rather than using the SoftMax function for prediction. Eq. 5 shows the equation to convert x (an input 2kbp segment) into the output of the first dense layer. If a contig is cut into multiple segments of length 2kbp, we will conduct vector addition for all the segments' outputs and divide it by the number of segments. Thus, contigs of different lengths are always converted into vectors of the same size (determined by the units of the dense layer, default 512).

Construction of the edge in GCN
The edges connect nodes that are likely in the same taxonomic group. We define the edge by incorporating both the number of shared protein clusters and also the average protein similarity between two sequences. Intuitively, if two sequences share a large number of common protein clusters with high similarity, they tend to belong to the same taxa. In order to quantify the significance of two sequences sharing c common proteins, we first define protein clusters. A pair of proteins from two sequences is called a shared protein if they are in the same protein cluster.

Construction of the protein cluster
We follow the idea in [21,22] to construct protein clusters. We start by extracting proteins from all sequences. For the genomes in the reference database, proteins are downloaded from NCBI RefSeq. For the input contigs, DNA sequences are translated into amino acid sequences using 6 reading frames. We employ DIAMOND to conduct all-against-all pairwise alignment between contigs' 6-frame translations and proteins encoded by the genomes. If there are multiple alignments for different reading frames of a contig, only the best frame is kept. Then we create a weighted graph where the node is a protein sequence in the contig or genome and the edge represents an alignment with E-value less than a threshold. The edge weight is the E-value. Then protein clusters are subsequently identified using the Markov clustering algorithm (MCL). Finally, clusters that contain at least two proteins will be kept.

Definition of the edges
The edge is defined by computing two metrics: P weight and E weight . P weight is adopted to calculate the expected number of sequences sharing at least an observed number of common proteins (i.e. c proteins). Following vConTACT [21], by assuming that each of the n protein clusters has the same chance to be chosen, we compute the probability that any two sequences containing a and b protein clusters share at least c clusters in Eq. 6. Eq. 7 then computes the expected number of sequence pairs with at least c common proteins out of N 2 sequence pairs, where N is the number of sequences (contigs and reference genomes). With increase of c, P in Eq. 6 becomes small enough to return a positive P weight .
While P weight is used to evaluate whether two sequences share a significant number of common proteins, E weight is adopted to calculate the sequence similarity using alignments' E-values.
Edges in the knowledge graph: The final edge in the knowledge graph is defined based on the sum of P weight and E weight . An edge is defined when the sum is above a threshold τ , which is 1 by default (Eq. 9). It connects two sequences with enough common proteins of high similarity. Usually, as long contigs share more proteins with the reference genome database, P weight tends to big enough for creating edges between the long contigs and the knowledge graph. However, short contigs have fewer shared proteins and thus we use E weight to examine whether the shared proteins have significant similarities with references for creating an edge. If a contig has no edge connecting to the knowledge graph, PhaGCN will not output a prediction. Only contigs in the knowledge graph will be fed to the GCN for training and prediction.

The GCN model
After constructing the knowledge graph, we train a GCN to assign labels for all unlabeled contigs.
The basic concept of graph convolutional layer is shown in Eq. 10. Suppose we have N sequences (nodes) in the knowledge graph. G is the R N ×N adjacency matrix of the knowledge graph and I N is an R N ×N identity matrix. G is calculated withG = G + I N .D is the R N ×N diagonal matrix calculated with D ij = jG ij . H (l) is the feature map in the l-th hidden layer; H (0) is the node feature matrix; and W (l) is a matrix of weight parameters. After the graph convolutional layer, we apply a dense layer and use the SoftMax function to calculate the output matrix Out ∈ R N ×n label (Eq. 11). Because we have two graph convolutional layers and one dense layer in our model, l ∈ {0, 1}. The output dimension n label is decided by the number of classes in the database. As shown in Fig. 3, only the labeled samples will be used to calculate the loss in the training process. We adopt L2 loss to calculate the error between prediction and the labeled samples and back propagate the loss to update the weight parameters. After training the GCN model, we freeze the parameters and use the SoftMax value of the unlabeled samples to assign their labels in the test mode. The detailed parameters are listed in our Github repository. Both feature from labeled (green, red, and blue color) and unlabled samples (grey color) will be used in training and prediction. In train mode, only labeled samples will be utilize to calculate the loss and update the parameters. In test mode, the SoftMax function will be applied to generate prediction for unlabeled samples.

Data and performance metrics
We demonstrate the performance of PhaGCN on classifying contigs in families under Caudovirales, which is an order containing the majority of known phages from RefSeq (95.8% of total phage reference genomes). We downloaded the Caudovirales reference genomes from the NCBI RefSeq database. As shown in Table 1, there are 3,639 genomes from 8 different families. As the lower ranks contain few genomes in each group, we focus on family-level classification in the experiment.

Data and experiment design
PhaGCN was tested on both simulated and real sequencing data. For the simulated data, we applied two different methods to generate contigs with known labels: 1. randomly sample contigs from the reference genome; 2. simulate reads with ART-Illumina [32] and run MEGAHIT [15] to assemble contigs. After validating PhaGCN on simulated data with known ground truth, we downloaded two real sequencing datasets from NCBI SRA and evaluated PhaGCN on assembled contigs. As phages are highly abundant in marine environment samples [12], we tested PhaGCN on virus-like contigs from 71 metagenomic data sets that are sequenced from oyster. We recorded macro-accuracy, macro-recall, and macro-precision for each experiment (Eq.12, Eq. 13, and Eq. 14) when the ground truth can be derived. N class is the total number of classes. T P is the True positive, T N is the True negative, and F N is the false negative. Acc i is the accuarcy of class i. Except for CNN, each tool can output either a family label or no label at all (no prediction). For each class, if its positive samples have no predictions, they are counted as F N . If its negative samples have no predictions, they are counted as T N . As macro-average will compute each metric independently for each class and then take the average, these metrics treat all classes equally.
The main purpose of PhaGCN is to classify new phages that do not have reference genomes in the training data. When the training and testing data share common genomes, high accuracy may be attributed to memorization rather than learning. Thus, in all the experiments conducted using PhaGCN, we use genome-masking, meaning that the genomes in the testing data will be removed from the training data so that they do not share any genomes. The test contigs thus represent novel phages.
We compare our results with three representative and widely used pipelines: vConTACT 2.0, Phage Orthologous Groups (POGs), and ClassiPhage. In addition, as CNN itself can conduct classification, we also compared with the CNN model we trained for PhaGCN.

Experiments using simulated contigs
In this experiment, we randomly chose 20 genomes from each family as the testing species. The remaining genomes were used as the training set. Each contig is generated using a random start position and a random length within each range. For each of the length range, we generated 10 contigs. Thus we have 1,600 contigs for testing. Finally, we repeated this experiment for three times and recorded the average performance of the three experiments. In total, for each length range, 4,800 contigs from 480 genomes were tested. We also recorded the results of using complete genomes as the test data. To have a fair comparison, we applied the same method to construct the training set (or reference database) and the test set for vConTACT 2.0, POGs, and ClassiPhage. Fig. 4 shows that PhaGCN outperforms other state-of-the-art tools across different length range. With the increase of contig length, the performance of all pipelines increases. This is expected because longer contigs contain more proteins, which can lead to better classification performance. We also evaluated the classification performance of only using the CNN model in PhaGCN. Both the recall and precision of using GCN is better than using only CNN, showing that the knowledge graph enhances the learning ability of the model. In addition, the classification performance of PhaGCN is stable with the change of the contig length, making it useful for classifying short contigs. Although vConTACT 2.0 can assign the correct labels to most of the predicted contigs as shown in Fig. 5, it only generated predictions for a small number of the contigs in two families (Podoviridae and Siphoviridae). For families without any prediction, the precision is 0. Thus, the macro-precision of vConTACT is small.

Experiments using simulated reads
In this experiment, we downloaded all newly released Escherichia coli phages under Caudovirales in 2020 from NCBI RefSeq and used them as the testing species (a total of 99 species are downloaded). And we downloaded all Caudovirales phages released in 2019 from NCBI RefSeq and used them as the training set. Consequently, these newly released Escherichia coli phages can be treated as unknown phages for our model. Then we applied ART-Illumina to simulate reads from the testing sequences. The parameters used for generating reads are -p, -l 150, -ss HS25, -f 20, -m  200 and -s 10. The output contains 150bp paired-end reads simulated under HiSeq 2500. We mixed all the simulated reads in one dataset and run MEGAHIT to assemble them. In order to quantify the performance of different tools, we determine the correct label of contigs by aligning them against the test genomes using BLAST in glocal mode. Only contigs of length above 2kbp with taxon-specific alignment results and query coverage > 85% will be kept. As a result, a total of 301 contigs were used as input for comparison.
As shown in Fig. 6 and Fig. 7, PhaGCN outperforms other tools across different length, which is consistent with the conclusion in Section 3.2. We also find that when the length of the contigs becomes shorter ([2kbp, 4kbp]), PhaGCN can still achieve over 80% accuracy. Although the reads simulated by ART-Illumina contain sequencing error, the performance of PhaGCN still achieves high accuracy (100% for contigs over 8kbp). Because we only tested newly released Escherichia coli phages in 2020, the classification performance of all methods were slightly better than the results in Section 3.2.

Running time comparison
The most resource demanding component in PhaGCN is the sequence alignment. We used it to produce the protein clusters by conducting pairwise alignments between contigs and reference sequences. PhaGCN produces protein clusters for each set of input contigs by assuming that they may contain novel proteins. In addition, alignment is also conducted for defining the edge in the knowledge graph.    Table 4: Contigs assembled by MEGAHIT and their family labels derived from their alignments against the species in Table 3.
In this experiment, we searched for real sequencing data that contain Caudovirales at NCBI SRA and downloaded two datasets, SRR12949983 and SRR13132427. Then we used MEGAHIT to assemble reads into contigs on these two datasets separately. To quantify the performance of phage classification on these two datasets, we used the provided read-level taxonomic analysis by NCBI SRA as the ground truth. The phages in the two datasets provided by NCBI SRA are listed in Table 3. We used the same method introduced in Section 3.3 to label the contigs and removed all these genomes in Table 3 from the reference database before training PhaGCN. The contigs in the test data are listed in Table 4.
We compared the phage labels assigned by PhaGCN, vConTACT, POGs, and ClassiPhage on the assembled contigs. Because the real sequencing data might contain bacteria sequences, we run DeepVirFinder [33] to reject the contigs that belong to bacteria. As shown in Fig. 8, there is only one bacterial contig in each dataset. All other contigs (43 contigs) were fed into the four tools for classification. Fig. 8 reveals that PhaGCN achieves better classification performance than vConTACT, POGs, and ClassiPhage. PhaGCN has 100% accuracy in both datasets. Many contigs could not be classified by vConTACT 2.0 (unlabeled in Fig.  8). POGs and ClassiPhage are able to assign labels for all contigs but a number of them have wrong family labels.
In the dataset SRR121949983, there is one contig lacking ground truth because the contig cannot be aligned to any reference genome in Table 3. By extending the reference database to the Nucleotide Collection (nr/nt) database, BLAST shows that this contig belongs to a phage in Siphoviridae. We input this contig to the four tools and only PhaGCN assigned the correct label. It is worth noting that the reference genome is not in the training data of PhaGCN. The performance of alignment-based approaches heavily relies on the reference database while PhaGCN can learn the features for classifying new contigs.
Furthermore, we showed the composition of these two datasets before and after using PhaGCN in Fig. 9. The results revealed that our model can greatly improve the composition analysis for the dark matter. Thus, PhaGCN can benefit metagenomic analysis.

Phage classification in contigs produced from oyster metagenomic data
After validating PhaGCN on simulated datasets and two real sequencing datasets, we applied PhaGCN to contigs assembled from metagenomic data of oyster samples. The samples were collected by the co-author Dr. Jiang between April, 2016 to July, 2019 from various sites along the coast of South China Sea. Metagenomic sequencing was conducted from samples in the gill, visceral mass, and mantle tissues of oyster using viral-like particle enrichment and protocols in [34,35]. There are about 2.5 billions of raw reads from 71 libraries. Podoviridae Siphoviridae unclassified  vConTACT  0  0  0  0  0  50  116  102  22698  PhaGCN  150  1727  301  74  32  5880  2682  6173  5767   Table 5: Prediction results of PhaGCN and vConTACT for contigs produced from the oyster metagenomic data.

Ackermannviridae Autographiviridae Demerecviridae Drexlerviridae Herelleviridae Myoviridae
After applying standard quality control and MEGAHIT with the default setting, there are about 3,375,091 contigs of length above 500bp. After removing contigs that can be aligned to bacteria, archaea, eukaryota, we kept 22,966 contigs with length above 4,000bp as input to PhaGCN. Of them, 17,199 contigs can be assigned to Caudovirales by PhaGCN.
When users apply composition analysis for their samples, precision is important for generating valid hypothesis. Because vConTACT 2.0 has higher accuracy for what they can predict than ClassiPhage and POGs based on our experiments, we compared PhaGCN with vConTACT 2.0 in this experiment and summarized the results in Table 5.
Although we don't have the ground truth for this large-scale metagenomic sequencing data, the numbers of predicted contigs are consistent with the results of experiments on simulated and real sequencing datasets. The contigs vConTACT 2.0 can classify are significantly less than PhaGCN.~74.8% contigs are predicted by PhaGCN while~1.1% contigs are predicted by vConTACT 2.0. The contigs predicted by vConTACT is a subset of PhaGCN. We found that the clustering algorithm in vConTACT failed to group contigs and reference genomes in the same cluster. There exist many clusters containing only unlabeled contigs and thus, no labels will be assigned to these contigs. Also, PhaGCN can identify more families from the dataset. Because the classification accuracy of PhaGCN is more than 92% when the length of contigs is over 4kbp, the result can provide useful family-level composition analysis for the oyster metagenomic data.

Extension of PhaGCN
As Caudovirales is the order with the most number of sequenced phages from RefSeq, we validated PhaGCN on classifying families in this order. But PhaGCN can be conveniently extended to other taxa. We extended PhaGCN by adding families that contain at least 10 genomes. Using this criterion, three families (Rudiviridae, Microviridae, and Inoviridae) were added. We used the leave-one-genome-out method to choose testing genomes from these three families and use the same method introduced in Section 3.2 to generate contigs. The classification results in Fig. 10(A) show that PhaGCN can correctly classify almost all of the contigs from the extended families. Using knowledge graph enables PhaGCN to detect targeted phage families, which is useful for applications where only some phages are of interest. Specifically, phages that are not in the training families usually won't form edges with the nodes in the graph and thus will not be mis-classified into Caudovirales. We validated the detection ability of PhaGCN by testing whether PhaGCN can reject contigs that do not belong to Caudovirales. We downloaded 4,218 phage genomes that do not belong to Caudovirales from RefSeq according to the ICTV taxonomic affiliation information. For each genome, we apply the same method introduced in Section 3.2 to generate 10 contigs for each of them. Thus, a total of 42,180 contigs were tested. As shown in Fig. 10(B), only 3 of them are accepted (predicted) by PhaGCN. This experiment demonstrates that PhaGCN can be applied for targeted phage detection.

Discussion
As shown in the experiments, the performance of alignment-based approaches, such as POGs and BLAST, heavily relies on the reference database. The ambiguous hits or lack of reference genomes for highly divergent or novel phages can decrease the classification accuracy. Existing learning-based tools like vConTACT 2.0 cannot achieve good performance on short contigs. In this work, we demonstrate that PhaGCN can render better performance for novel phage classification. The major improvement of our method stems from combined strength of the reference-based model and the learning-based model using the knowledge graph: the nodes contain automatically learned features from nucleotide sequences and the edges are created by protein-based alignment. Then the semi-supervised GCN is applied on the knowledge graph to utilize both labeled and unlabeled data for training.
Although PhaGCN has greatly improved phage contig classification, we have several goals to optimize or extend PhaGCN in our future work. First, we simplified the edge weight computation by assuming that all protein clusters can be chosen with the same probability. We will investigate whether incorporating protein cluster size and the cluster's entropy in label distribution can render more accurate edges. Second, PhaGCN can reject non-relevant phages with high accuracy. Thus we will extend it to phage detection and compare it with DeepVirFinder. Third, although we have demonstrated that PhaGCN can be easily extended to more families, it is still hard to predict classes with only a few training samples (less than 10). We will incorporate relevant learning methods to improve the classification accuracy for small families. Finally, we will explore whether we can incorporate bacteria in our knowledge network for phage host detection. This can be used to further validate the classification results on the oyster metagenomic data.

Funding
This work was supported by Hong Kong RGC GRF 9042828 and HKIDS (9360163) and NSF of China (31972847).