Abstract

Motivation

Although the set of currently known viruses has been steadily expanding, only a tiny fraction of the Earth’s virome has been sequenced so far. Shotgun metagenomic sequencing provides an opportunity to reveal novel viruses but faces the computational challenge of identifying viral genomes that are often difficult to detect in metagenomic assemblies.

Results

We describe a MetaviralSPAdes tool for identifying viral genomes in metagenomic assembly graphs that is based on analyzing variations in the coverage depth between viruses and bacterial chromosomes. We benchmarked MetaviralSPAdes on diverse metagenomic datasets, verified our predictions using a set of virus-specific Hidden Markov Models and demonstrated that it improves on the state-of-the-art viral identification pipelines.

Availability and implementation

MetaviralSPAdes includes ViralAssembly, ViralVerify and ViralComplete modules that are available as standalone packages: https://github.com/ablab/spades/tree/metaviral_publication, https://github.com/ablab/viralVerify/ and https://github.com/ablab/viralComplete/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

In the last few years, metagenomic sequencing greatly expanded our knowledge of the Earth’s virome (Paez-Espino et al., 2016; Roux et al., 2016). However, since extracting complete sequences of viral genomes from metagenomic assemblies remains challenging, many viruses evade identification even though metagenomic datasets contain reads sampled from these viruses (Dutilh et al., 2014).

Previous studies, aimed at the discovery of novel viruses, often focused on viral contigs in metagenomic assemblies and thus missed an opportunity to sequence complete viral genomes by switching from the contig-based to the assembly graph-based analysis. Since a recent study (Roux et al., 2017) reported that MetaSPAdes (Nurk et al., 2017) resulted in the most contiguous viral assemblies, we extended MetaSPAdes into MetaviralSPAdes that attempts to sequence complete viral genomes rather than fragmented viral contigs.

Identifying viral genomes in metagenomic datasets is not unlike identifying plasmids since both viruses and plasmids form small subgraphs of the metagenomic assembly graphs. However, in difference from plasmid sequencing where multiple plasmid identification tools have been developed (Antipov et al., 2016, 2019; Pellow et al., 2020; Rozov et al., 2017), there is still no specialized viral assembler. MetaviralSPAdes modifies various steps of the MetaplasmidSPAdes tool (Antipov et al., 2019) to make it applicable to viral sequencing. Below we describe the MetaviralSPAdes pipeline and apply it for virus discovery in diverse metagenomic datasets.

2 Materials and methods

MetaviralSPAdes pipeline consists of three independent steps–ViralAssembly for finding putative viral subgraphs in a metagenomic assembly graph and generating contigs in these graphs, ViralVerify for checking whether the resulting contigs have viral origin and ViralComplete for checking whether these contigs represent complete viral genomes.

2.1 Assembling viral sequences (viralAssembly)

To assemble viral sequences, MetaviralSPAdes modifies approaches implemented in MetaSPAdes (Nurk et al., 2017) and MetaplasmidSPAdes (Antipov et al., 2019). First, it uses MetaSPAdes to construct the assembly graph. Since various viral strains are often highly variable (Shapiro and Putonti, 2018), and since we focus on species-level viral assembly, ViralAssembly modifies the bulge removal procedure as compared to MetaSPAdes (Nurk et al., 2017). Specifically, it collapses long and similar (with respect to the edit distance) parallel edges in the assembly graph that are shorter than maxBulgeSize (the default value 1000 nucleotides) and that differ from each other by less than maxDivergence (the default value 0.2). The divergence between two sequences is defined as the edit distance between them divided by the length of the shorter sequence.

Since the vast majority of plasmids are circular, MetaplasmidSPAdes is based on identifying high-coverage cycles in the assembly graph, i.e. cycles with coverage by reads exceeding the coverage of neighboring edges in the assembly graph. In contrast, since many viruses are linear [50% of DNA viruses in the RefSeq (O’Leary et al., 2016) database], MetaviralSPAdes searches for both high coverage cycles and high coverage paths that start from a vertex of in-degree 0 (source vertex) and end in a vertex of out-degree 0 (sink vertex) of the assembly graph. We classify such a path as long if its length exceeds a threshold Length (the default value 1000 nucleotides) and high coverage if its coverage exceeds a threshold Coverage (the default value 5×). Long high-coverage paths represent putative sequences of linear viruses.

Linear DNA viruses that have terminal repeats (Casjens and Gilcrease, 2009; Deng et al., 2012) form small subgraphs (rather than isolated paths) in the assembly graphs. Sequences of 377 out of 2584 linear DNA viruses in the RefSeq database have terminal repeats with a length exceeding the typical length of k-mers used for constructing assembly graphs (the default value k =55 for MetaSPAdes). Hundred and sixty-eight of such viruses can be represented as sequences ARBR or RAR, where R is a terminal repeat of length >55 bp and R is its reverse complement (Supplementary Fig. S1).

To identify linear viruses with terminal repeats MetaviralSPAdes considers small connected components (i.e. components with at most five edges) of the assembly graph. We refer to a path in a graph as a post-man tour if it visits each edge of the graph and the total length of its unique edges (i.e. edges that are visited just once) exceeds half of the total path length. Since there exists a unique post-man tour in the components of the types ARBR and RAR, ViralAssembly outputs the complete sequence (instead of the corresponding subgraph) in such components.

To give users an option to examine both complete viral sequences (identified based on analyzing small subgraphs of the MetaSPAdes assembly graphs) and partial viral sequences (corresponding to MetaSPAdes contigs), the ViralAssembly output is combined with the regular MetaSPAdes output.

2.2 Identifying viral contigs (viralVerify)

The ViralVerify module checks whether contigs found by ViralAssembly indeed represent viruses. The popular virus identification tools, such as the HMM-based VirSorter (Roux et al., 2015) and the k-mer-based VirFinder (Ren et al., 2017) have limitations: they are sometimes confuse phages with plasmids and are rather conservative, thus missing putative novel viruses. We thus developed the ViralVerify tool that examines the gene content of a contig and classifies it as viral/bacterial/uncertain using a Naive Bayesian classifier. It can be used as a standalone tool to predict contigs of viral origin in any assembled metagenome.

The ViralVerify step in MetaviralSPAdes is designed similarly to the PlasmidVerify step in MetaplasmidSPAdes (Antipov et al., 2019). To construct a set of viral HMMs, we selected all 10 544 viruses from the RefSeq database and split them into the training and validation datasets (7381 and 3163 viruses, respectively). We predicted genes with Prodigal v2.6.3 (Hyatt et al., 2010) and ran hmmsearch (part of HMMER 3.1b2, http://hmmerorg/) for all HMM’s from Pfam-A database v. 30.0 (El-Gebali et al., 2018). Afterwards, we counted the frequencies of matches to the training dataset, and used them to train a Naive Bayesian classifier (Friedman et al., 2001), along with frequencies from non-ViralDatabase (combined PlasmidDatabase and non-PlasmidDatabase from Antipov et al., 2019). Supplementary Table S1 lists the HMM frequencies in the training dataset.

Given a contig, ViralVerify predicts genes in this contig using Prodigal in the metagenomic mode, runs hmmsearch on the predicted proteins and calculates the score as the ratio of log probabilities. If the absolute score is less than a scoreThreshold (the default value is 3), a contig is classified as ‘uncertain’, otherwise it is classified as ‘viral’ (score > scoreThreshold) or ‘bacterial’ (score<scoreThreshold).

To help analyze the rapidly growing amount of novel data, we have added a script that allows users to construct their own training database from a set of viral, chromosomal and plasmid contigs, as well as a custom HMM database.

2.3 Viral completeness verification (viralComplete)

If a newly constructed viral contig is complete and belongs to a known family of viruses then its gene content is likely to be similar to the gene content of a known virus. We thus compute the ‘similarity’ of a given contig (based on the Naive Bayesian Classifier) to each known virus from the RefSeq database, and check whether the most similar known virus have length similar to the viral contig length. This comparison includes the following steps:

  • Predict genes and proteins in a given contig using Prodigal.

  • Match each predicted protein P against all N viral proteins from the RefSeq database using BLAST (with e-value cutoff = 1e-6) and define number(P) as the number of viral proteins matching P. We say that a virus V matches a protein P if one of the proteins in this virus matches P.

  • If a virus V matches a protein P, we define Prob(V|P)=1/number(P)ϵ, (the default value ϵ=1e-6). If a virus V does not match P, we define Prob(V|P)=ϵ·number(P)/(Nnumber(P)). Thus, each virus that matches P is assigned the same (large) probability Prob(V|P) and each virus that does not match P is assigned the same small probability.

  • If a given contig has predicted proteins P1,P2,,Pk, we assume that they all are pairwise conditionally independent and define Prob(V|P1,P2,,Pk) as i=1kProb(V|Pi). A most probable virus V* is defined as a virus maximizing this probability.

  • Check whether the given contig and the virus V* have similar lengths, i.e. if the length of V* falls in the range (0.9·length(contig),length(contig)/0.9).

3 Results

3.1 Datasets

We used both simulated metagenomes and real metagenomes/metaviromes to benchmark MetaviralSPAdes:

3.1.1 Simulated metagenomes

We simulated five metagenomic datasets using CAMISIM (Fritz et al., 2019). For each metagenome, 15 bacterial and 15 viral genomes were drawn from the test datasets. The abundance distribution followed the log-normal distribution with μ = 1 and σ=0.5. The total abundance of the viral genomes was set to be 10 times higher than the abundance of the microbial genomes, to model high abundances of viruses in real metagenomic datasets (Supplementary Table S2).

3.1.2 Real metagenomes

We selected 18 diverse metagenomic datasets described in Supplementary Table S3 to benchmark MetaviralSPAdes. Two out of these eighteen datasets represent metavirome datasets originating from marine samples that were size-selected for viruses. Additionally, we used sequences of known origin from the RefSeq database for benchmarking ViralVerify and ViralComplete. As the true negative test datasets, we used the PlasmidDatabase dataset (2387 plasmids) and 9890 randomly selected fragments from non-PlasmidContigs dataset (80 681 chromosome fragments) described in Antipov et al. (2019). Since VirSorter and VirFinder are designed for DNA viruses, for a fair comparison, we selected only double-stranded DNA viruses (total 1368) from the ViralVerify validation dataset as the true positive test dataset. Additionally, we trained and benchmarked ViralVerify on small RNA viruses (Supplementary Table S7). The same true positive test dataset was used to benchmark ViralComplete.

3.2 ViralAssembly benchmarking

Since there are still no specialized assembly tools that identify viral genomes in metagenomic datasets, we compared ViralAssembly against MetaSPAdes on 18 real datasets described in Supplementary Table S3. We analyzed only complete (i.e. circular contigs or linear contigs starting in sources and ending in sinks) and high-coverage (>5×) sequences for benchmarking (ViralAssembly and MetaSPAdes report the same set of partial contigs). We used three different contig length cutoffs (0.5 kb, 3 kb and 10 kb) and checked the viral origin of the contigs using ViralVerify. Since the contig number may reflect an increase in the number of fragmented contigs (rather than complete viruses), we also checked the completeness of the predicted viral contigs using ViralComplete. ViralAssembly outperformed MetaSPAdes in the number of assembled viral contigs on 12 out of 18 real datasets (Supplementary Table S6).

3.3 viralVerify benchmarking

We benchmarked ViralVerify, VirSorter (Roux et al., 2015), VirFinder (Ren et al., 2017) and VirMine (Garretto et al., 2019) on 1368 dsDNA viruses from Refseq database as the true positive dataset, and two true negative datasets–plasmids from the RefSeq database and the set of 10 kb-long randomly selected fragments of bacterial chromosomes, to mimic a real output of a metagenomic assembly (Table 1). Researchers are usually interested in viruses distantly related to known ones, or in the contigs of unknown origin, referred to as the ‘dark matter’ of a metagenome. We thus took into account contigs that cannot be certainly attributed to viruses or bacterial chromosomes but deserve manual inspection (category 3 in the VirSorter output, the ‘Uncertain’ category in the ViralVerify output and the ‘Unknown’ category in VirMine output). Since BLASTN and BLASTX (Altschul et al., 1990) are often used as virus detection tools, we also included them in our benchmarking analysis. Although BLASTX and ViralVerify showed similar results, BLASTX is two orders of magnitude slower than ViralVerify. For the true negative dataset (9890 chromosomal fragments), the running time of ViralVerify and BLASTX was 279 and 36 364 min, respectively. Also, since we randomly split the entire dataset into the training and testing datasets, the training dataset is likely to contain viruses from the same taxonomic groups as the testing dataset. To test the performance in a more difficult case of novel taxonomic groups, we excluded the entire Podoviridae family from the training dataset and compared ViralVerify and BLASTX results on the members of this family. Afterward, to compare performance on higher taxonomic level, we trained classifier on Caudovirales order (tailed bacteriophages) and compared ViralVerify with BLASTX on the non-Caudovirales phages. Supplementary Table S8 illustrates that ViralVerify improves on BLASTX in this more difficult test, likely because the HMM-based approach is more sensitive than the local alignment approach.

Table 1.

Benchmarking various viral detection approaches

True positive,True negative,True negative,
dsDNA viruses10k chunksplasmids
Total 1368 9890 2387 
VirSorter 758 151 441 
(1–2 categories) 55.4% 15.5% 18.5% 
VirSorter 766 200 677 
(1–2–3 categories) 56% 15.5% 28.4% 
VirFinder 866 117 205 
 63.3% 1.1% 8.6% 
ViralVerify 1277 118 79 
(Virus only) 93.3% 1.2% 3.3% 
ViralVerify 1319 245 149 
(Virus + Uncertain) 96.42.5% 6.2% 
VirMine 1,176 39 14 
(Virus only) 85.9% 0.40.6% 
VirMine 1229 42 15 
(Virus + Unknown) 89.8% 0.4% 0.6% 
BLASTN 1069 42 8 
 78.1% 0.4% 0.3% 
BLASTX 1258 55 17 
 91.9 % 0.6% 0.7% 
True positive,True negative,True negative,
dsDNA viruses10k chunksplasmids
Total 1368 9890 2387 
VirSorter 758 151 441 
(1–2 categories) 55.4% 15.5% 18.5% 
VirSorter 766 200 677 
(1–2–3 categories) 56% 15.5% 28.4% 
VirFinder 866 117 205 
 63.3% 1.1% 8.6% 
ViralVerify 1277 118 79 
(Virus only) 93.3% 1.2% 3.3% 
ViralVerify 1319 245 149 
(Virus + Uncertain) 96.42.5% 6.2% 
VirMine 1,176 39 14 
(Virus only) 85.9% 0.40.6% 
VirMine 1229 42 15 
(Virus + Unknown) 89.8% 0.4% 0.6% 
BLASTN 1069 42 8 
 78.1% 0.4% 0.3% 
BLASTX 1258 55 17 
 91.9 % 0.6% 0.7% 

Notes: Results of the viral detection benchmarking on the true positive and the true negative test datasets. The numbers and percentages represent sequences identified as viral. VirFinder was launched with the score at least 0.7 and P-value below 0.05, BLASTN and BLASTX were launched against the database from the ViralVerify training dataset, with the E-value threshold 0.001 (top hit was selected).

Table 1.

Benchmarking various viral detection approaches

True positive,True negative,True negative,
dsDNA viruses10k chunksplasmids
Total 1368 9890 2387 
VirSorter 758 151 441 
(1–2 categories) 55.4% 15.5% 18.5% 
VirSorter 766 200 677 
(1–2–3 categories) 56% 15.5% 28.4% 
VirFinder 866 117 205 
 63.3% 1.1% 8.6% 
ViralVerify 1277 118 79 
(Virus only) 93.3% 1.2% 3.3% 
ViralVerify 1319 245 149 
(Virus + Uncertain) 96.42.5% 6.2% 
VirMine 1,176 39 14 
(Virus only) 85.9% 0.40.6% 
VirMine 1229 42 15 
(Virus + Unknown) 89.8% 0.4% 0.6% 
BLASTN 1069 42 8 
 78.1% 0.4% 0.3% 
BLASTX 1258 55 17 
 91.9 % 0.6% 0.7% 
True positive,True negative,True negative,
dsDNA viruses10k chunksplasmids
Total 1368 9890 2387 
VirSorter 758 151 441 
(1–2 categories) 55.4% 15.5% 18.5% 
VirSorter 766 200 677 
(1–2–3 categories) 56% 15.5% 28.4% 
VirFinder 866 117 205 
 63.3% 1.1% 8.6% 
ViralVerify 1277 118 79 
(Virus only) 93.3% 1.2% 3.3% 
ViralVerify 1319 245 149 
(Virus + Uncertain) 96.42.5% 6.2% 
VirMine 1,176 39 14 
(Virus only) 85.9% 0.40.6% 
VirMine 1229 42 15 
(Virus + Unknown) 89.8% 0.4% 0.6% 
BLASTN 1069 42 8 
 78.1% 0.4% 0.3% 
BLASTX 1258 55 17 
 91.9 % 0.6% 0.7% 

Notes: Results of the viral detection benchmarking on the true positive and the true negative test datasets. The numbers and percentages represent sequences identified as viral. VirFinder was launched with the score at least 0.7 and P-value below 0.05, BLASTN and BLASTX were launched against the database from the ViralVerify training dataset, with the E-value threshold 0.001 (top hit was selected).

Additionally, we benchmarked VirSorter, VirFinder, VirMine and ViralVerify on simulated metagenomes. Since some bacterial chromosomes in this simulation may carry prophages, we needed to separate contigs that belong to reference viruses from those of prophage origin.

To identify contigs of viral origin (true positives), we compared them with the reference viruses using minimap2 (Li, 2018) and considered sequences with the nucleotide identity exceeding 95% as viral. To account for possible prophage sequences in the reference chromosomes, we aligned all contigs that were unaligned on the previous step against the viral RefSeq database. All contigs that mapped to any virus not used for simulation with identity above 80% and span above 50% were removed from comparison (Supplementary Table S2).

Although all tools except VirMine showed similar precision, ViralVerify improved on all tools in terms of recall. Low precision of all tools except for VirMine (below 64%) can be explained by many identified prophage sequences that are absent in the viral RefSeq database. Figure 1 and Supplementary Table S4 illustrate that performance of all tools improves with the increase in the contig lengths.

Fig. 1.

Average precision and recall on five simulated datasets for ViralVerify (blue), VirFinder (violet), VirMine (red) and VirSorter (green). Precision and recall were calculated separately for contigs longer than 0.5 kb, 3 kb and 10 kb. (Color version of this figure is available at Bioinformatics online.)

Fig. 1.

Average precision and recall on five simulated datasets for ViralVerify (blue), VirFinder (violet), VirMine (red) and VirSorter (green). Precision and recall were calculated separately for contigs longer than 0.5 kb, 3 kb and 10 kb. (Color version of this figure is available at Bioinformatics online.)

For the real datasets, we also analyzed the results of the MetaSPAdes assembly for the 18 real metagenomic samples. We compared ViralVerify with VirSorter, VirFinder and the results of BLAST alignment to viral RefSeq and metagenomic viral contigs (mVCs) from Paez-Espino et al. (2016). Although the ground truth in this computational experiment is unknown, VirFinder and ViralVerify predicted significantly more sequences than VirSorter for most samples (Supplementary Table S5).

3.4 viralComplete benchmarking

To benchmark ViralComplete, we randomly split the test dataset of 1368 dsDNA viruses from RefSeq in two equal parts, and cut one of these parts into fragments of size x% of their original length, resulting in a true negative dataset (x% is selected uniformly at random between 10% and 90%). ViralComplete shows 12.1% completeness (83 out of 684 viral fragments) for the true negative dataset (fragmented viruses) and 86.8% completeness (594 out of 684 complete viruses) for the true positive dataset (Table 2).

Table 2.

Benchmarking ViralComplete on the true positive and the true negative datasets

684 fragmented684 complete
dsDNA virusesdsDNA viruses
Complete 83 (12.1%) 594 (86.8%) 
Partial 601 (87.9%) 90 (13.2%) 
684 fragmented684 complete
dsDNA virusesdsDNA viruses
Complete 83 (12.1%) 594 (86.8%) 
Partial 601 (87.9%) 90 (13.2%) 
Table 2.

Benchmarking ViralComplete on the true positive and the true negative datasets

684 fragmented684 complete
dsDNA virusesdsDNA viruses
Complete 83 (12.1%) 594 (86.8%) 
Partial 601 (87.9%) 90 (13.2%) 
684 fragmented684 complete
dsDNA virusesdsDNA viruses
Complete 83 (12.1%) 594 (86.8%) 
Partial 601 (87.9%) 90 (13.2%) 

3.5 Exploring novel viruses assembled by metaviralSPAdes

We checked whether some of the assembled viral contigs represent crAssphages, wide-spread and abundant phages in the human microbiome that however evaded all virus detection tools until recently (Dutilh et al., 2014). Yutin et al. (2018) revealed a previously unknown family of crAssphage-like viruses, represented in many genomic and metagenomic databases as misclassified bacterial contigs or uncultured viruses. These crAssphage contigs avoided detection because over 80% of the predicted proteins in these contigs showed no significant similarity to known protein sequences. Also, even though the length of the previously known crAssphage genomes is 90–100 kbp, the lengths of these contigs were significantly shorter, likely representing incomplete phage genomes. However, based on a conserved gene content, Yutin et al. (2018) identified a distinct crAssphage group and a group of similar crAssphage-like viruses.

MetaviralSPAdes assembled seven complete or near-complete phages from the crAssphage family, including members of the crAssphage group, in various metagenomes (Supplementary Table S3). Supplementary Figure S2 presents a phylogenetic tree of major capsid proteins of the fully assembled viruses from the crAssphage family.

4 Discussion

We demonstrated that MetaviralSPAdes improves identification of complete viruses from metagenomic datasets. Our analysis of newly sequenced phages from the crAssphage family illustrates that MetaviralSPAdes has a potential to transform metagenomics-based assembly of novel viruses from a challenging task into a routine procedure.

However, many viruses in metagenomic samples still remain unassembled (or only partially assembled), indicating that we may be close to reaching the limits of viral sequencing using short-read technologies. Kolmogorov et al. (2019) recently demonstrated that long-read technologies recover more viruses from metagenomic datasets than short-read technologies. However, the accuracy of viral sequences recovered from long-read metagenomic datasets is often inferior, especially for viral genomes with coverage below 30x. We thus argue that integration of long-read and short-read metagenomic datasets is a promising approach for recovering new viruses.

Acknowledgement

The authors wish to thank Natalia Yutin and Eugene Koonin for helpful comments and discussion.

Funding

This work was supported by the Russian Science Foundation (grant 19-14-00172) and Saint Petersburg State University (grant 51555639). PAP was supported by NSF/MCB-BSF grant 1715911.

Conflict of Interest: none declared.

References

Altschul
S.F.
 et al.  (
1990
)
Basic local alignment search tool
.
J. Mol. Biol
.,
215
,
403
410
.

Antipov
D.
 et al.  (
2016
)
plasmidSPAdes: assembling plasmids from whole genome sequencing data
.
Bioinformatics (Oxford, England)
,
32
,
3380
.

Antipov
D.
 et al.  (
2019
)
Plasmid detection and assembly in genomic and metagenomic data sets
.
Genome Res
.,
29
,
961
968
.

Casjens
S.R.
,
Gilcrease
E.B.
(
2009
)
Determining DNA packaging strategy by analysis of the termini of the chromosomes in tailed-bacteriophage virions
.
Methods Mol. Biol
.,
502
,
91
111
.

Deng
Z.
 et al.  (
2012
)
Telomeres and viruses: common themes of genome maintenance
.
Front. Oncol
.,
2
,
201
.

Dutilh
B.E.
 et al.  (
2014
)
A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes
.
Nat. Commun
.,
5
,
4498
.

El-Gebali
S.
 et al.  (
2018
)
The PFAM protein families database in 2019
.
Nucleic Acids Res
.,
47
,
D427
D432
.

Friedman
J.
 et al.  (
2001
)
The Elements of Statistical Learning
, Vol. 1.
Springer Series in Statistics
,
New York
.

Fritz
A.
 et al.  (
2019
)
Camisim: simulating metagenomes and microbial communities
.
Microbiome
,
7
,
17
.

Garretto
,
A.
 et al.  (
2019
)
virMine: automated detection of viral sequences from complex metagenomic samples
.
PeerJ
,
7
,
e6695
.

Hyatt
D.
 et al.  (
2010
)
Prodigal: prokaryotic gene recognition and translation initiation site identification
.
BMC Bioinformatics
,
11
,
119
.

Kolmogorov
M.
 et al.  (
2019
)
metaFlye: scalable long-read metagenome assembly using repeat graphs
.
bioRxiv
,
637637
.

Li
H.
(
2018
)
Minimap2: pairwise alignment for nucleotide sequences
.
Bioinformatics
,
34
,
3094
3100
.

Nurk
,
S.
 et al.  (
2017
)
metaSPAdes: a new versatile metagenomic assembler
.
Genome Research, Genome Res
,
27
,
824
834
.

O’Leary
N.A.
 et al.  (
2016
)
Reference sequence (refseq) database at NCBI: current status, taxonomic expansion, and functional annotation
.
Nucleic Acids Res
.,
44
,
D733
D745
.

Paez-Espino
D.
 et al.  (
2016
)
Uncovering earth’s virome
.
Nature
,
536
,
425
430
.

Pellow
D.
 et al.  (
2020
)
Scapp: an algorithm for improved plasmid assembly in metagenomes
.
bioRxiv
.

Ren
J.
 et al.  (
2017
)
VirFinder: a novel k-mer-based tool for identifying viral sequences from assembled metagenomic data
.
Microbiome
,
5
,
69
.

Roux
S.
 et al.  (
2015
)
VirSorter: mining viral signal from microbial genomic data
.
Peer J
.,
3
,
e985
.

Roux
S.
 et al. ; Tara Oceans Coordinators. (
2016
)
Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses
.
Nature
,
537
,
689
693
.

Roux
S.
 et al.  (
2017
)
Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity
.
Peer J
.,
5
,
e3817
.

Rozov
R.
 et al.  (
2017
)
Recycler: an algorithm for detecting plasmids from de novo assembly graphs
.
Bioinformatics
,
33
,
475
482
.

Shapiro
J.W.
,
Putonti
C.
(
2018
)
Gene co-occurrence networks reflect bacteriophage ecology and evolution
.
MBio
,
9
,
e01870
e01917
.

Yutin
N.
 et al.  (
2018
)
Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut
.
Nat. Microbiol
.,
3
,
38
46
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Peter Robinson
Peter Robinson
Associate Editor
Search for other works by this author on:

Supplementary data