-
PDF
- Split View
-
Views
-
Cite
Cite
Sergey Knyazev, Lauren Hughes, Pavel Skums, Alexander Zelikovsky, Epidemiological data analysis of viral quasispecies in the next-generation sequencing era, Briefings in Bioinformatics, Volume 22, Issue 1, January 2021, Pages 96–108, https://doi.org/10.1093/bib/bbaa101
- Share Icon Share
Abstract
The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.
Introduction
Due to error-prone replication, RNA viruses mutate at rates estimated to be as high as |${10}^{-3}$| substitutions per nucleotide per replication cycle [1]. Since mutations are generally well tolerated, such viruses exist in infected hosts as ‘quasispecies’—a term used by virologists to describe populations of closely related genomic variants [2–5]. Genetic heterogeneity of viral quasispecies has major biological implications, contributing to the efficiency of virus transmission, tissue tropism, virulence, disease progression and the emergence of drug/vaccine-resistant variants [6–10].
With the advent of next-generation sequencing (NGS) technologies, molecular epidemiology and virology are undergoing a fundamental transformation that promises to revolutionize our approach to epidemiological data analysis, disease prevention and treatment [11–14]. NGS has already shown its potential to advance epidemiological practices and it is steadily moving into clinical practices. There are numerous examples of successful applications of NGS for studying viruses such as coronavirus [15], influenza [16–21], HIV [22–27], hepatitis [28–32], Ebola [33, 34], Zika [35] and other viruses [36].
NGS allows sequencing with the unprecedented coverage, which is crucial for characterizing intra-host viral population complexity. However, inferring and analyzing the viral population from NGS data are computationally challenging and require specialized, highly sophisticated computational tools [37]. Even for NGS technologies offering very deep coverage, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Additionally, low intra-host viral diversity complicates assembling whole-genome sequences that are necessary for the unique identification of viral haplotypes. Therefore, the analysis of heterogeneous virus populations was complemented by technological developments.
The viral population reconstructed from NGS data can be further used for the detection of drug resistance in the patients’ samples as well as the age of infection. The importance of this detection is constantly growing [38], especially for influenza [39], hepatitis C virus (HCV) [40] and HIV [41, 42], because of the high prevalence of these diseases in the population. As for HIV, there is an additional problem. Since HIV has no cure, its treatment can only slow down its progression, and the development of drug resistance creates the risk of losing a drug forever as a treatment option for the patient. This is further complicated by the increasing longevity of HIV patients and the prevalence of the disease among the general population. Since viruses exist as a swarm of haplotypes, it is crucial to detect minority drug-resistant populations.
The haplotypes inferred from NGS data can also be very effective for outbreak investigation. Millions of viral variants that are carried in the samples of thousands of infected individuals can be analyzed with the help of NGS. Molecular data collected from densely sampled outbreaks in large high-risk communities are of particular interest since it allows for the first time to study the evolution of heterogeneous intra-host viral populations within a single evolutionary space under frequent transmissions between hosts [43–45]. The growing knowledge about social network structures and progress in the development of methods for the collection of large volumes of socio-behavioral and geographic data gives us new information about the conditions of disease spread [46–48]. The availability of such large-scale datasets provides a new opportunity to implement massive molecular surveillance and forecasting of viral diseases [49–55]. Deployment of massive molecular surveillance programs intends to facilitate our understanding of virus evolution, which may enable the development of more effective public health intervention strategies. To be effective, molecular surveillance and forecasting should analyze unprecedented amounts of heterogeneous biomedical data. This requires extensive computational methods for processing, integrating and analyzing big data, i.e. both epidemiological and molecular. In addition, this requires new mathematical models that allow for describing, understanding and predicting complex multidimensional-linear disease dynamics.
The remainder of the review will discuss the pipeline of software tools for primary and secondary NGS data analysis constituting a sequencing-based molecular surveillance system (see Figure 1). The primary NGS data analysis consists of error correction, consensus assembly/selection, read alignment and inference of intra-host viral population including single nucleotide variant calling and haplotype reconstruction. The secondary NGS data analysis includes intra-host analysis such as detection of drug resistance and estimations of the age of infection as well as inter-host analysis such as outbreak detection and investigation. Finally, we review existing molecular surveillance systems that integrate all the above analyses.

A molecular surveillance pipeline for software tools for primary and secondary viral NGS data analysis.
Primary analysis of viral NGS data
Primary analysis can be partitioned into two major steps: (i) basic primary analysis which starts with error correction followed by identification of the consensus sequence and read mapping and (ii) characterization of the intra-host viral population complexity by calling SNVs and haplotype variants in the viral sample.
Basic primary analysis
The error correction of viral sequencing reads is a notoriously difficult task. The standard error correction tools tuned to correct reads from a human genome do not perform well for viral genomes since viral haplotypes differ only slightly between themselves [56]. There are several error-correction tools that have been proposed specifically to handle viral sequencing samples [57–59]. A Bayesian probabilistic clustering approach [57] integrates error correction with SNV and haplotype calling, while KEC [58] is a k-mer counting-based approach that identifies erroneous k-mers by analyzing the distributions of k-mer frequencies. A more sophisticated random forest classifier MultiRes [59] can be used to distinguish between erroneous and rare k-mers.
Identification of the consensus sequence can be either picked from existing reference genomes or de novo assembled to avoid reference biases. The reference-based identification of the consensus relies on the existence of closely related genomic sequences. NGS reads are aligned to the reference sequence with a significant number of mismatches. To avoid reference biases, the aligned reads are used for updating each position of the reference genome with the base most frequent in reads and re-aligning reads to the consensus [60, 61]. The drawback of this approach is that selecting the reference genome is not a well-formalized procedure.
De novo assemblers are based on de Bruijn graphs such as VICUNA and overlap graphs such as SAVAGE [26, 62–65]. SAVAGE constructs an overlap graph with vertices representing reads and/or contigs and edges connecting two reads/contigs belonging to the same haplotypic sequence. Statistically, well-calibrated groups of reads/contigs are then efficiently used for reconstruction of the individual haplotypes from this overlap graph. SAVAGE has an additional advantage over VICUNA since it builds multiple haplotype contigs rather than a single consensus. De novo assemblers require much higher memory and time resources than reference-based identification of the consensus.
A recent tool, SHIVER [66], combines the reference-based and de novo approaches by using both reads and contigs assembled from those reads for HIV sequencing. Contigs are compared with the existing references, wherein some are spliced and some are removed as contaminants. After the closest existing reference is identified, it is updated to the consensus by well-mapped reads that do not match contaminants.
Single nucleotide variant calling
The natural advantage of NGS versus Sanger sequencing is its ability to identify low-frequency mutations (i.e. <20%) that are particularly relevant in the context of drug resistance [67–69]. The main challenge for SNV calling is to distinguish between sequencing errors and low-frequency true SNVs. All existing methods apply a particular error model to estimate the probability that an observed mismatch with the consensus is an error and qualify it as an SNV if this probability is low enough.
Below, we briefly describe widely known tools [37] and recently developed tools. VarScan [70] reports SNVs that are deeply covered by the reads with high quality. A similar approach with improved codon-based filtration is introduced by VirVarSeq [71] of SNV. The method LoFreq [72] derives sequencing error probability from a Phred-scaled quality value and optimizes estimation of P-value. V-Phaser [73, 74] introduces a basic primary analysis and error model, which takes into account the simultaneous occurrence of pairs of SNV in the same reads. V-Phaser 2 [74] specifies this model for Illumina reads. Pairs of mutations are explored by CoVaMa [75] using a linkage disequilibrium model. An accurate analysis of linked SNV pairs independent of error rate is proposed by CliqueSNV [76], which also contains an efficient implementation of the SNV-pair analysis. ViVan [77] and ViVaMBC [78] are based on maximum likelihood models. MinVar [79] and SiNPle [80] utilize the Poisson–Binomial distribution and Bayesian model respectively. Validation of MinVar on Illumina Miseq samples and shows that SNVs with the frequency of at least 5% are reliably identified without introducing false positives. PASeq [81] and Hydra Web [22] are web-based publicly available tools that are thoroughly tested for identifying mutations with frequencies 20% and 5%. Interestingly, SNV calling for viral data is very similar to somatic mutation calling and the quality of algorithms for both problems can be compared [80].
Table 1 describes the list of tools analyzing viral NGS data for SNV calling. For each tool, we specify the SNV detection method and whether it requires a reference.
SNV calling tools . | Year . | System . | De novo/Ref based . | Pair-end reads . | SNV detection method . | Tool availability . |
---|---|---|---|---|---|---|
VarScan [70] | 2009 | Java | Ref | + | Read coverage | http://varscan.sourceforge.net/ |
LoFreq [72] | 2012 | Linux | Ref | + | Poisson binomial distribution | https://csb5.github.io/lofreq/ |
Vphaser [73] | 2012 | Linux | Ref | − | Bernoulli phasing model | https://www.broadinstitute.org/viral-genomics/v-phaser |
Vphaser2 [74] | 2013 | Linux | Ref | + | Bernoulli phasing model | https://www.broadinstitute.org/viral-genomics/v-phaser-2 |
ViVan [77] | 2015 | — | Ref | + | Maximum likelihood | http://www.vivanbioinfo.org |
ViVaMBC [78] | 2015 | R | Ref | + | Maximum likelihood | https://sourceforge.net/projects/vivambc/ |
VirVarSeq [71] | 2015 | Linux | Ref | + | Codon-level quality filtration | https://sourceforge.net/projects/virtools/?source=directory |
CoVaMa [75] | 2015 | Python | Ref | + | Linkage disequilibrium | https://sourceforge.net/projects/covama/ |
MinVar [79] | 2017 | Python | Ref | + | Poisson binomial distribution | http://git.io/minvar |
MultiRes [59] | 2017 | Linux | De novo | + | Frame-based model | https://github.com/raunaq-m/MultiRes |
CliqueSNV [76] | 2018 | Java | Ref | + | Linkage of SNV pairs | https://github.com/vtsyvina/CliqueSNV |
SiNPle [80] | 2019 | Linux | Ref | + | Bayesian model | https://mallorn.pirbright.ac.uk:4443/gitlab/drcyber/SiNPle |
PASeq | Web | https://paseq.org/ | ||||
Hydra Web | Web | https://hydra.canada.ca/pages/home?lang=en-CA | ||||
SmartGen | Web | https://www.smartgene.com/mod_hiv.html |
SNV calling tools . | Year . | System . | De novo/Ref based . | Pair-end reads . | SNV detection method . | Tool availability . |
---|---|---|---|---|---|---|
VarScan [70] | 2009 | Java | Ref | + | Read coverage | http://varscan.sourceforge.net/ |
LoFreq [72] | 2012 | Linux | Ref | + | Poisson binomial distribution | https://csb5.github.io/lofreq/ |
Vphaser [73] | 2012 | Linux | Ref | − | Bernoulli phasing model | https://www.broadinstitute.org/viral-genomics/v-phaser |
Vphaser2 [74] | 2013 | Linux | Ref | + | Bernoulli phasing model | https://www.broadinstitute.org/viral-genomics/v-phaser-2 |
ViVan [77] | 2015 | — | Ref | + | Maximum likelihood | http://www.vivanbioinfo.org |
ViVaMBC [78] | 2015 | R | Ref | + | Maximum likelihood | https://sourceforge.net/projects/vivambc/ |
VirVarSeq [71] | 2015 | Linux | Ref | + | Codon-level quality filtration | https://sourceforge.net/projects/virtools/?source=directory |
CoVaMa [75] | 2015 | Python | Ref | + | Linkage disequilibrium | https://sourceforge.net/projects/covama/ |
MinVar [79] | 2017 | Python | Ref | + | Poisson binomial distribution | http://git.io/minvar |
MultiRes [59] | 2017 | Linux | De novo | + | Frame-based model | https://github.com/raunaq-m/MultiRes |
CliqueSNV [76] | 2018 | Java | Ref | + | Linkage of SNV pairs | https://github.com/vtsyvina/CliqueSNV |
SiNPle [80] | 2019 | Linux | Ref | + | Bayesian model | https://mallorn.pirbright.ac.uk:4443/gitlab/drcyber/SiNPle |
PASeq | Web | https://paseq.org/ | ||||
Hydra Web | Web | https://hydra.canada.ca/pages/home?lang=en-CA | ||||
SmartGen | Web | https://www.smartgene.com/mod_hiv.html |
SNV calling tools . | Year . | System . | De novo/Ref based . | Pair-end reads . | SNV detection method . | Tool availability . |
---|---|---|---|---|---|---|
VarScan [70] | 2009 | Java | Ref | + | Read coverage | http://varscan.sourceforge.net/ |
LoFreq [72] | 2012 | Linux | Ref | + | Poisson binomial distribution | https://csb5.github.io/lofreq/ |
Vphaser [73] | 2012 | Linux | Ref | − | Bernoulli phasing model | https://www.broadinstitute.org/viral-genomics/v-phaser |
Vphaser2 [74] | 2013 | Linux | Ref | + | Bernoulli phasing model | https://www.broadinstitute.org/viral-genomics/v-phaser-2 |
ViVan [77] | 2015 | — | Ref | + | Maximum likelihood | http://www.vivanbioinfo.org |
ViVaMBC [78] | 2015 | R | Ref | + | Maximum likelihood | https://sourceforge.net/projects/vivambc/ |
VirVarSeq [71] | 2015 | Linux | Ref | + | Codon-level quality filtration | https://sourceforge.net/projects/virtools/?source=directory |
CoVaMa [75] | 2015 | Python | Ref | + | Linkage disequilibrium | https://sourceforge.net/projects/covama/ |
MinVar [79] | 2017 | Python | Ref | + | Poisson binomial distribution | http://git.io/minvar |
MultiRes [59] | 2017 | Linux | De novo | + | Frame-based model | https://github.com/raunaq-m/MultiRes |
CliqueSNV [76] | 2018 | Java | Ref | + | Linkage of SNV pairs | https://github.com/vtsyvina/CliqueSNV |
SiNPle [80] | 2019 | Linux | Ref | + | Bayesian model | https://mallorn.pirbright.ac.uk:4443/gitlab/drcyber/SiNPle |
PASeq | Web | https://paseq.org/ | ||||
Hydra Web | Web | https://hydra.canada.ca/pages/home?lang=en-CA | ||||
SmartGen | Web | https://www.smartgene.com/mod_hiv.html |
SNV calling tools . | Year . | System . | De novo/Ref based . | Pair-end reads . | SNV detection method . | Tool availability . |
---|---|---|---|---|---|---|
VarScan [70] | 2009 | Java | Ref | + | Read coverage | http://varscan.sourceforge.net/ |
LoFreq [72] | 2012 | Linux | Ref | + | Poisson binomial distribution | https://csb5.github.io/lofreq/ |
Vphaser [73] | 2012 | Linux | Ref | − | Bernoulli phasing model | https://www.broadinstitute.org/viral-genomics/v-phaser |
Vphaser2 [74] | 2013 | Linux | Ref | + | Bernoulli phasing model | https://www.broadinstitute.org/viral-genomics/v-phaser-2 |
ViVan [77] | 2015 | — | Ref | + | Maximum likelihood | http://www.vivanbioinfo.org |
ViVaMBC [78] | 2015 | R | Ref | + | Maximum likelihood | https://sourceforge.net/projects/vivambc/ |
VirVarSeq [71] | 2015 | Linux | Ref | + | Codon-level quality filtration | https://sourceforge.net/projects/virtools/?source=directory |
CoVaMa [75] | 2015 | Python | Ref | + | Linkage disequilibrium | https://sourceforge.net/projects/covama/ |
MinVar [79] | 2017 | Python | Ref | + | Poisson binomial distribution | http://git.io/minvar |
MultiRes [59] | 2017 | Linux | De novo | + | Frame-based model | https://github.com/raunaq-m/MultiRes |
CliqueSNV [76] | 2018 | Java | Ref | + | Linkage of SNV pairs | https://github.com/vtsyvina/CliqueSNV |
SiNPle [80] | 2019 | Linux | Ref | + | Bayesian model | https://mallorn.pirbright.ac.uk:4443/gitlab/drcyber/SiNPle |
PASeq | Web | https://paseq.org/ | ||||
Hydra Web | Web | https://hydra.canada.ca/pages/home?lang=en-CA | ||||
SmartGen | Web | https://www.smartgene.com/mod_hiv.html |
Viral haplotype variant calling
Rather than determining variation in a single position, the haplotype calling is required to find the haplotypes spanning the entire viral genome or amplicons of special interest. The haplotypes and their frequencies are more informative than SNVs for detecting drug resistance that can non-linearly depend on accumulated SNVs. Haplotypes are also used for significantly more accurate detection of transmission clusters and outbreak sources.
Note that haplotype frequency reconstruction is considered to be a simpler problem as soon as haplotypes are inferred. The expectation–maximization algorithm based on the estimation of the probability that a given read has been emitted by a given haplotype has been shown to be sufficiently reliable with accuracy growing with the sequencing depth [60, 82].
The first haplotype reconstruction tools were read-graph based with vertices corresponding to reference-mapped reads and edges connecting reads that agree on their overlap [83, 84]. Many tools followed this idea [60, 82, 85–92] significantly improving the quality of reconstruction [37, 93]. But all these tools usually are not fast enough to handle recently available multi-million read data sets.
Probabilistic modeling of the sequencing process and/or viral haplotype generation [94–98] was shown to be an attractive alternative to the read-graph approach. The most successful tool among probabilistic tools is PredictHaplo [96] that exhibits high specificity and can reconstruct haplotypes with frequency over 10%. Hierarchical-clustering of reads (especially long PacBio reads) has been suggested in [99], and recent methods, aBayesQR [100], combined probabilistic modeling with clustering making the Bayesian approach computationally tractable.
Novel scalable tools handling millions of reads and improving over existing tools are actively developed in multiple labs. CliqueSNV [76] efficiently recognizes groups of linked SNVs and constructs an SNV graph, where SNVs are nodes and edges connect linked SNVs. It can assemble close viral haplotypes with frequencies as low as 0.1% from Illumina and PacBio reads.
It is necessary to separately note de novo haplotype callers, i.e. tools that de novo assemble multiple distinct haplotypes rather than a consensus. Currently, there exist three de novo assemblers MLEHaplo [98], SAVAGE [65] and PEHaplo [92]. The advantage of these tools is that they do not introduce reference biases.
Recently, 12 NGS haplotype callers were tested using viral populations simulated under realistic evolutionary dynamics but without error simulation [101]. In contrast to other simulations, the number of haplotypes was very large (216-1,185) and each frequency was small (<7%). Under such stressful conditions, PreditHaplo and CliqueSNV showed certain advantages over other reference-based methods and PEHaplo among de novo assemblers. It is also very important to distinguish low-frequency haplotypes from similar high-frequency haplotypes coexisting in the same intra-host viral population. Therefore, it is critical to validate haplotype reconstruction tools on benchmarks containing such pairs of similar haplotypes.
Table 2 describes the list of tools analyzing viral NGS data for haplotype calling. For each tool, we specify (i) whether it is a de novo method or requires a reference, (ii) sequencing error handling, (iii) the method for haplotype assembly and (iv) the method for haplotype frequency estimation.
Haplotyping tools . | Year . | System . | De novo/Ref based . | Pair-end reads . | Sequencing error handling . | Haplotype assembly method . | Haplotype frequency estimation method . | Output sequences . | Tool availability . |
---|---|---|---|---|---|---|---|---|---|
Shorah [82] | 2011 | Linux | Ref | + | Probabilistic clustering | Minimal path cover | EM | Full haplotypes | https://github.com/cbg-ethz/shorah |
ViSpA [60] | 2011 | Linux | Ref | − | Binomial model | Max-bandwidth path | EM | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/vispa |
QColors [86] | 2012 | — | De novo | − | — | Overlap graph + Conflict graph | — | Full haplotypes | — |
QuRe [87] | 2012 | Java | Ref | + | Poison model | Multinomial distribution matching | Read coverage | Full haplotypes | https://sourceforge.net/projects/qure/ |
bioa [85] | 2012 | Linux | Ref | − | k-mer-based error correction | Maximum Bandwidth Path | Fork balancing | Full haplotypes | http://alan.cs.gsu.edu/vira/index.html |
Vicuna [63] | 2012 | Linux | De novo | + | Read count | — | — | Consensus + contigs | https://www.broadinstitute.org/viral-genomics/vicuna |
QuasiRecomb [95] | 2013 | Linux | Ref | + | Hidden Markov model | Hidden Markov model | Hidden Markov model | Full haplotypes | https://github.com/cbg-ethz/QuasiRecomb |
Vira (AmpMCF) [88] | 2013 | Linux | Ref | − | — | Multicommodity flows | Normalized flow size | Full haplotypes | http://alan.cs.gsu.edu/vira/index.html |
ShotMCF [88] | 2013 | JAVA | Ref | − | Binomial model | Max-bandwidth path + Multicommodity flows | EM + normalized flow size | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/shotmcf |
BAsE-Seq [61] | 2014 | − | Ref | + | Poisson binomial distribution model | Clustering of reads by SNVs | Read coverage | Full haplotypes | — |
VGA [90] | 2014 | Linux | Ref | + | Requires high-fidelity sequencing protocol | Min-graph coloring | EM | Full haplotypes | http://genetics.cs.ucla.edu/vga/ |
HaploClique [89] | 2014 | Linux | Ref | + | — | Max-clique enumeration | Normalized read count | Full haplotypes | https://github.com/cbg-ethz/haploclique |
PredictHaplo [96] | 2014 | Linux | Ref | + | Dirichlet Process Mixture Model | Dirichlet Process Mixture Model | Dirichlet Process Mixture Model | Full haplotypes | https://bmda.dmi.unibas.ch/software.html |
IVA [64] | 2015 | Linux | De novo | − | Read count | — | — | Contigs | https://sanger-pathogens.github.io/iva/ |
MLEHaplo [98] | 2015 | Linux | De novo | + | — | Maximum likelihood | — | Full haplotypes | https://github.com/raunaq-m/MLEHaplo |
ViQuaS [91] | 2015 | Linux | Ref | + | Chimeric error correction | Multinomial distribution matching | Read count | Full haplotypes | https://sourceforge.net/projects/viquas/ |
SAVAGE [65] | 2017 | Linux | De novo | + | Overlap fuzzy matching error correction | Enumerating cliques in overlap graph | EM | Contigs | https://bitbucket.org/jbaaijens/savage/ |
aBayesQR [100] | 2017 | Linux | Ref | + | Cluster coverage by reads | Bayesian inference | Bayesian inference | Full haplotypes | https://github.com/SoYeonA/aBayesQR |
RegressHaplo [97] | 2017 | R | Ref | + | — | Penalized regression | Penalized regression | Full haplotypes | https://github.com/SLeviyang/RegressHaplo |
2SNV [99] | 2017 | Java | Ref | − | Linkage of SNV pairs | Hierarchical clustering of reads by SNVs | EM | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/2snv |
PEHaplo [92] | 2018 | Linux | De novo | + | Overlap error correction | Path finding in overlap graph | — | Contigs | https://github.com/chjiao/PEHaplo |
Shiver [66] | 2018 | Linux | De novo + ref | + | BLAST database match | — | — | Consensus | https://github.com/ChrisHIV/shiver |
CliqueSNV [76] | 2018 | JAVA | Ref | + | Linkage of SNV pairs | Clique enumeration and merging | EM | Full haplotypes | https://github.com/vtsyvina/CliqueSNV |
Haplotyping tools . | Year . | System . | De novo/Ref based . | Pair-end reads . | Sequencing error handling . | Haplotype assembly method . | Haplotype frequency estimation method . | Output sequences . | Tool availability . |
---|---|---|---|---|---|---|---|---|---|
Shorah [82] | 2011 | Linux | Ref | + | Probabilistic clustering | Minimal path cover | EM | Full haplotypes | https://github.com/cbg-ethz/shorah |
ViSpA [60] | 2011 | Linux | Ref | − | Binomial model | Max-bandwidth path | EM | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/vispa |
QColors [86] | 2012 | — | De novo | − | — | Overlap graph + Conflict graph | — | Full haplotypes | — |
QuRe [87] | 2012 | Java | Ref | + | Poison model | Multinomial distribution matching | Read coverage | Full haplotypes | https://sourceforge.net/projects/qure/ |
bioa [85] | 2012 | Linux | Ref | − | k-mer-based error correction | Maximum Bandwidth Path | Fork balancing | Full haplotypes | http://alan.cs.gsu.edu/vira/index.html |
Vicuna [63] | 2012 | Linux | De novo | + | Read count | — | — | Consensus + contigs | https://www.broadinstitute.org/viral-genomics/vicuna |
QuasiRecomb [95] | 2013 | Linux | Ref | + | Hidden Markov model | Hidden Markov model | Hidden Markov model | Full haplotypes | https://github.com/cbg-ethz/QuasiRecomb |
Vira (AmpMCF) [88] | 2013 | Linux | Ref | − | — | Multicommodity flows | Normalized flow size | Full haplotypes | http://alan.cs.gsu.edu/vira/index.html |
ShotMCF [88] | 2013 | JAVA | Ref | − | Binomial model | Max-bandwidth path + Multicommodity flows | EM + normalized flow size | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/shotmcf |
BAsE-Seq [61] | 2014 | − | Ref | + | Poisson binomial distribution model | Clustering of reads by SNVs | Read coverage | Full haplotypes | — |
VGA [90] | 2014 | Linux | Ref | + | Requires high-fidelity sequencing protocol | Min-graph coloring | EM | Full haplotypes | http://genetics.cs.ucla.edu/vga/ |
HaploClique [89] | 2014 | Linux | Ref | + | — | Max-clique enumeration | Normalized read count | Full haplotypes | https://github.com/cbg-ethz/haploclique |
PredictHaplo [96] | 2014 | Linux | Ref | + | Dirichlet Process Mixture Model | Dirichlet Process Mixture Model | Dirichlet Process Mixture Model | Full haplotypes | https://bmda.dmi.unibas.ch/software.html |
IVA [64] | 2015 | Linux | De novo | − | Read count | — | — | Contigs | https://sanger-pathogens.github.io/iva/ |
MLEHaplo [98] | 2015 | Linux | De novo | + | — | Maximum likelihood | — | Full haplotypes | https://github.com/raunaq-m/MLEHaplo |
ViQuaS [91] | 2015 | Linux | Ref | + | Chimeric error correction | Multinomial distribution matching | Read count | Full haplotypes | https://sourceforge.net/projects/viquas/ |
SAVAGE [65] | 2017 | Linux | De novo | + | Overlap fuzzy matching error correction | Enumerating cliques in overlap graph | EM | Contigs | https://bitbucket.org/jbaaijens/savage/ |
aBayesQR [100] | 2017 | Linux | Ref | + | Cluster coverage by reads | Bayesian inference | Bayesian inference | Full haplotypes | https://github.com/SoYeonA/aBayesQR |
RegressHaplo [97] | 2017 | R | Ref | + | — | Penalized regression | Penalized regression | Full haplotypes | https://github.com/SLeviyang/RegressHaplo |
2SNV [99] | 2017 | Java | Ref | − | Linkage of SNV pairs | Hierarchical clustering of reads by SNVs | EM | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/2snv |
PEHaplo [92] | 2018 | Linux | De novo | + | Overlap error correction | Path finding in overlap graph | — | Contigs | https://github.com/chjiao/PEHaplo |
Shiver [66] | 2018 | Linux | De novo + ref | + | BLAST database match | — | — | Consensus | https://github.com/ChrisHIV/shiver |
CliqueSNV [76] | 2018 | JAVA | Ref | + | Linkage of SNV pairs | Clique enumeration and merging | EM | Full haplotypes | https://github.com/vtsyvina/CliqueSNV |
Haplotyping tools . | Year . | System . | De novo/Ref based . | Pair-end reads . | Sequencing error handling . | Haplotype assembly method . | Haplotype frequency estimation method . | Output sequences . | Tool availability . |
---|---|---|---|---|---|---|---|---|---|
Shorah [82] | 2011 | Linux | Ref | + | Probabilistic clustering | Minimal path cover | EM | Full haplotypes | https://github.com/cbg-ethz/shorah |
ViSpA [60] | 2011 | Linux | Ref | − | Binomial model | Max-bandwidth path | EM | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/vispa |
QColors [86] | 2012 | — | De novo | − | — | Overlap graph + Conflict graph | — | Full haplotypes | — |
QuRe [87] | 2012 | Java | Ref | + | Poison model | Multinomial distribution matching | Read coverage | Full haplotypes | https://sourceforge.net/projects/qure/ |
bioa [85] | 2012 | Linux | Ref | − | k-mer-based error correction | Maximum Bandwidth Path | Fork balancing | Full haplotypes | http://alan.cs.gsu.edu/vira/index.html |
Vicuna [63] | 2012 | Linux | De novo | + | Read count | — | — | Consensus + contigs | https://www.broadinstitute.org/viral-genomics/vicuna |
QuasiRecomb [95] | 2013 | Linux | Ref | + | Hidden Markov model | Hidden Markov model | Hidden Markov model | Full haplotypes | https://github.com/cbg-ethz/QuasiRecomb |
Vira (AmpMCF) [88] | 2013 | Linux | Ref | − | — | Multicommodity flows | Normalized flow size | Full haplotypes | http://alan.cs.gsu.edu/vira/index.html |
ShotMCF [88] | 2013 | JAVA | Ref | − | Binomial model | Max-bandwidth path + Multicommodity flows | EM + normalized flow size | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/shotmcf |
BAsE-Seq [61] | 2014 | − | Ref | + | Poisson binomial distribution model | Clustering of reads by SNVs | Read coverage | Full haplotypes | — |
VGA [90] | 2014 | Linux | Ref | + | Requires high-fidelity sequencing protocol | Min-graph coloring | EM | Full haplotypes | http://genetics.cs.ucla.edu/vga/ |
HaploClique [89] | 2014 | Linux | Ref | + | — | Max-clique enumeration | Normalized read count | Full haplotypes | https://github.com/cbg-ethz/haploclique |
PredictHaplo [96] | 2014 | Linux | Ref | + | Dirichlet Process Mixture Model | Dirichlet Process Mixture Model | Dirichlet Process Mixture Model | Full haplotypes | https://bmda.dmi.unibas.ch/software.html |
IVA [64] | 2015 | Linux | De novo | − | Read count | — | — | Contigs | https://sanger-pathogens.github.io/iva/ |
MLEHaplo [98] | 2015 | Linux | De novo | + | — | Maximum likelihood | — | Full haplotypes | https://github.com/raunaq-m/MLEHaplo |
ViQuaS [91] | 2015 | Linux | Ref | + | Chimeric error correction | Multinomial distribution matching | Read count | Full haplotypes | https://sourceforge.net/projects/viquas/ |
SAVAGE [65] | 2017 | Linux | De novo | + | Overlap fuzzy matching error correction | Enumerating cliques in overlap graph | EM | Contigs | https://bitbucket.org/jbaaijens/savage/ |
aBayesQR [100] | 2017 | Linux | Ref | + | Cluster coverage by reads | Bayesian inference | Bayesian inference | Full haplotypes | https://github.com/SoYeonA/aBayesQR |
RegressHaplo [97] | 2017 | R | Ref | + | — | Penalized regression | Penalized regression | Full haplotypes | https://github.com/SLeviyang/RegressHaplo |
2SNV [99] | 2017 | Java | Ref | − | Linkage of SNV pairs | Hierarchical clustering of reads by SNVs | EM | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/2snv |
PEHaplo [92] | 2018 | Linux | De novo | + | Overlap error correction | Path finding in overlap graph | — | Contigs | https://github.com/chjiao/PEHaplo |
Shiver [66] | 2018 | Linux | De novo + ref | + | BLAST database match | — | — | Consensus | https://github.com/ChrisHIV/shiver |
CliqueSNV [76] | 2018 | JAVA | Ref | + | Linkage of SNV pairs | Clique enumeration and merging | EM | Full haplotypes | https://github.com/vtsyvina/CliqueSNV |
Haplotyping tools . | Year . | System . | De novo/Ref based . | Pair-end reads . | Sequencing error handling . | Haplotype assembly method . | Haplotype frequency estimation method . | Output sequences . | Tool availability . |
---|---|---|---|---|---|---|---|---|---|
Shorah [82] | 2011 | Linux | Ref | + | Probabilistic clustering | Minimal path cover | EM | Full haplotypes | https://github.com/cbg-ethz/shorah |
ViSpA [60] | 2011 | Linux | Ref | − | Binomial model | Max-bandwidth path | EM | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/vispa |
QColors [86] | 2012 | — | De novo | − | — | Overlap graph + Conflict graph | — | Full haplotypes | — |
QuRe [87] | 2012 | Java | Ref | + | Poison model | Multinomial distribution matching | Read coverage | Full haplotypes | https://sourceforge.net/projects/qure/ |
bioa [85] | 2012 | Linux | Ref | − | k-mer-based error correction | Maximum Bandwidth Path | Fork balancing | Full haplotypes | http://alan.cs.gsu.edu/vira/index.html |
Vicuna [63] | 2012 | Linux | De novo | + | Read count | — | — | Consensus + contigs | https://www.broadinstitute.org/viral-genomics/vicuna |
QuasiRecomb [95] | 2013 | Linux | Ref | + | Hidden Markov model | Hidden Markov model | Hidden Markov model | Full haplotypes | https://github.com/cbg-ethz/QuasiRecomb |
Vira (AmpMCF) [88] | 2013 | Linux | Ref | − | — | Multicommodity flows | Normalized flow size | Full haplotypes | http://alan.cs.gsu.edu/vira/index.html |
ShotMCF [88] | 2013 | JAVA | Ref | − | Binomial model | Max-bandwidth path + Multicommodity flows | EM + normalized flow size | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/shotmcf |
BAsE-Seq [61] | 2014 | − | Ref | + | Poisson binomial distribution model | Clustering of reads by SNVs | Read coverage | Full haplotypes | — |
VGA [90] | 2014 | Linux | Ref | + | Requires high-fidelity sequencing protocol | Min-graph coloring | EM | Full haplotypes | http://genetics.cs.ucla.edu/vga/ |
HaploClique [89] | 2014 | Linux | Ref | + | — | Max-clique enumeration | Normalized read count | Full haplotypes | https://github.com/cbg-ethz/haploclique |
PredictHaplo [96] | 2014 | Linux | Ref | + | Dirichlet Process Mixture Model | Dirichlet Process Mixture Model | Dirichlet Process Mixture Model | Full haplotypes | https://bmda.dmi.unibas.ch/software.html |
IVA [64] | 2015 | Linux | De novo | − | Read count | — | — | Contigs | https://sanger-pathogens.github.io/iva/ |
MLEHaplo [98] | 2015 | Linux | De novo | + | — | Maximum likelihood | — | Full haplotypes | https://github.com/raunaq-m/MLEHaplo |
ViQuaS [91] | 2015 | Linux | Ref | + | Chimeric error correction | Multinomial distribution matching | Read count | Full haplotypes | https://sourceforge.net/projects/viquas/ |
SAVAGE [65] | 2017 | Linux | De novo | + | Overlap fuzzy matching error correction | Enumerating cliques in overlap graph | EM | Contigs | https://bitbucket.org/jbaaijens/savage/ |
aBayesQR [100] | 2017 | Linux | Ref | + | Cluster coverage by reads | Bayesian inference | Bayesian inference | Full haplotypes | https://github.com/SoYeonA/aBayesQR |
RegressHaplo [97] | 2017 | R | Ref | + | — | Penalized regression | Penalized regression | Full haplotypes | https://github.com/SLeviyang/RegressHaplo |
2SNV [99] | 2017 | Java | Ref | − | Linkage of SNV pairs | Hierarchical clustering of reads by SNVs | EM | Full haplotypes | http://alan.cs.gsu.edu/NGS/?q=content/2snv |
PEHaplo [92] | 2018 | Linux | De novo | + | Overlap error correction | Path finding in overlap graph | — | Contigs | https://github.com/chjiao/PEHaplo |
Shiver [66] | 2018 | Linux | De novo + ref | + | BLAST database match | — | — | Consensus | https://github.com/ChrisHIV/shiver |
CliqueSNV [76] | 2018 | JAVA | Ref | + | Linkage of SNV pairs | Clique enumeration and merging | EM | Full haplotypes | https://github.com/vtsyvina/CliqueSNV |
Secondary analysis of viral NGS data
Secondary NGS analysis addresses three tasks: (i) predicting of drug resistance that takes SNV and haplotypes obtained during primary analysis and determine whether they are drug-resistant or not; (ii) determining the recency of the infection, i.e. predicting the moment in the past when patient was infected; (iii) outbreak investigation, i.e. determining the borders of outbreak, finding the source of infection and reconstruction of infection spread paths.
Predicting drug resistance
Certain haplotypes and mutations that are found during the primary NGS should be analyzed for drug resistance. This is especially important for viruses such as HIV [102], HCV [103], influenza [39] and others [104]. For HIV, the detection of drug resistance is especially relevant since HIV patients have to adhere to a treatment for the span of their lives. If a patient develops HIV drug resistance, they will be required to switch to a different line of treatment, and these treatments may be less studied and of a higher risk to the patient’s health. Additionally, the number of drug-resistant mutations in the patient is constantly growing as well as the number of drug-resistant patients in the outbreak [105]. This makes the task of tracking HIV drug resistance a more onerous one [106].
Detection of drug resistance is typically associated with matching genome mutations with the efficiency of drugs [104]. Usually, different mutations have different resistance power and often mutations work collectively [107], so the process of finding correlations between mutations and drug resistance is non-linear [108]. The comprehensive overview of computational approaches to drug-resistant HIV mutations can be found in [109]. Most of the tools are aimed at Sanger sequencing data since NGS data has only been accumulating for a short period of time. Sanger sequencing allows the detection of mutations with frequencies >25% which has low benefits for the clinical application [110, 111]. NGS increases the sensitivity and lowers the frequency threshold up to 1–5% [112].
There are two main challenges in the detection of drug resistance that depends on the results of primary NGS data analysis. They are connected with the accuracy of detecting minority mutations and haplotypes. The first problem is that if there is a minor drug-resistant mutation, the haplotypes with this mutation will have an advantage over other haplotypes dealing with drug pressure. As a result, these drug-resistant haplotypes will begin to dominate over time [102, 113]. The second problem is that drug resistance is connected with haplotypes rather than with the mutations themselves, but haplotypes are harder to detect and so the drug resistance analysis can be significantly improved with more sensitive haplotyping tools [114].
Currently, tools for detecting drug resistance are modeled to handle Sanger sequencing data accumulated in designated databases [109]. The limitation of Sanger data is that only the major haplotype and SNVs with frequency at least 20% can be reconstructed. This hurts the performance of the most efficient drug resistance prediction tools that are based on machine-learning [31, 114–118]. Such tools would rather take into account all patient’s haplotypes [114, 119] to overcome Sanger sequencing limitations by generating all possible haplotypes with given SNVs, e.g. 10 SNVs make |${2}^{-10}$| = 1024 different haplotypes.
The number of HIV patients sequenced with NGS is beginning to grow very fast. Since NGS can detect rare SNPs and haplotypes, drug resistance can be predicted more accurately [107, 109]. We expect that the number of NGS samples to train these models will grow much faster after the Food and Drug Administration authorizes the first NGS test for detecting HIV-1 drug resistance mutations [120]. Recent clinical studies showed up to 2.7-fold improvement for detecting drug resistance with utilizing NGS data [69, 121–126] to antiretroviral therapy such as zidovudine (see Table 3). Zidovudine was designed to target the conserved domain of retroviral transcriptase. Mutations of amino acids localized at hydrophilic regions may result in conformation change of tertiary structure and block the targeted sites of zidovudine. Combining the evolutionary analytics with conformation dynamics of the retroviral transcriptase can potentially help to develop novel drugs. Therefore, it is critical to develop appropriate statistical models of the evolutionary dynamics of HIV retroviral transcriptase. One of the promising approaches to take into account the HIV protease 3D structure is based on Voronoi diagrams [114].
Detection of drug-resistant mutations in clinical studies: NGS versus Sanger sequencing
Study . | Patients group . | Patients number . | Collection date . | Region . | DRM detection: NGS/Sanger (fold) . |
---|---|---|---|---|---|
Metzner et al. [121] | Acute patients | 49 | 1999–2003 | Germany | 2.0 |
Fisher et al. [122] | Infants after PMTCT failure | 15 | 2006–2009 | South Africa | 2.5 |
Alidjinou et al. [123] | ART-naive patients | 48 | 2013–2015 | France | 2.7 |
Tzou et al. [69] | Undisclosed | 177 | 2001–2016 | Undisclosed | 1.2 |
Fokam et al. [124] | Vertically infected children | 18 | 2015 | Cameroon | 1.7 |
Derache et al. [126] | ART-naive patients | 1148 | 2012–2016 | South Africa | 1.4 |
Derache et al. [125] | Patients failing first line ART | 1287 | 2012–2016 | South Africa | 2.0 |
Study . | Patients group . | Patients number . | Collection date . | Region . | DRM detection: NGS/Sanger (fold) . |
---|---|---|---|---|---|
Metzner et al. [121] | Acute patients | 49 | 1999–2003 | Germany | 2.0 |
Fisher et al. [122] | Infants after PMTCT failure | 15 | 2006–2009 | South Africa | 2.5 |
Alidjinou et al. [123] | ART-naive patients | 48 | 2013–2015 | France | 2.7 |
Tzou et al. [69] | Undisclosed | 177 | 2001–2016 | Undisclosed | 1.2 |
Fokam et al. [124] | Vertically infected children | 18 | 2015 | Cameroon | 1.7 |
Derache et al. [126] | ART-naive patients | 1148 | 2012–2016 | South Africa | 1.4 |
Derache et al. [125] | Patients failing first line ART | 1287 | 2012–2016 | South Africa | 2.0 |
Detection of drug-resistant mutations in clinical studies: NGS versus Sanger sequencing
Study . | Patients group . | Patients number . | Collection date . | Region . | DRM detection: NGS/Sanger (fold) . |
---|---|---|---|---|---|
Metzner et al. [121] | Acute patients | 49 | 1999–2003 | Germany | 2.0 |
Fisher et al. [122] | Infants after PMTCT failure | 15 | 2006–2009 | South Africa | 2.5 |
Alidjinou et al. [123] | ART-naive patients | 48 | 2013–2015 | France | 2.7 |
Tzou et al. [69] | Undisclosed | 177 | 2001–2016 | Undisclosed | 1.2 |
Fokam et al. [124] | Vertically infected children | 18 | 2015 | Cameroon | 1.7 |
Derache et al. [126] | ART-naive patients | 1148 | 2012–2016 | South Africa | 1.4 |
Derache et al. [125] | Patients failing first line ART | 1287 | 2012–2016 | South Africa | 2.0 |
Study . | Patients group . | Patients number . | Collection date . | Region . | DRM detection: NGS/Sanger (fold) . |
---|---|---|---|---|---|
Metzner et al. [121] | Acute patients | 49 | 1999–2003 | Germany | 2.0 |
Fisher et al. [122] | Infants after PMTCT failure | 15 | 2006–2009 | South Africa | 2.5 |
Alidjinou et al. [123] | ART-naive patients | 48 | 2013–2015 | France | 2.7 |
Tzou et al. [69] | Undisclosed | 177 | 2001–2016 | Undisclosed | 1.2 |
Fokam et al. [124] | Vertically infected children | 18 | 2015 | Cameroon | 1.7 |
Derache et al. [126] | ART-naive patients | 1148 | 2012–2016 | South Africa | 1.4 |
Derache et al. [125] | Patients failing first line ART | 1287 | 2012–2016 | South Africa | 2.0 |
Estimating infection recency
Over 80% of untreated cases of HCV infection becomes chronic. This impedes the timely diagnosis of the disease, due to the fact that the infection often does not manifest any clinical symptoms in its early stages. Currently, there are no diagnostic assays to determine the stage of HCV infection. Therefore, distinguishing recently infected patients from chronically infected patients using computational methods would be highly advantageous for both personalized therapeutic purposes and for epidemiological surveillance, e.g. for detection of incident HCV cases. Similarly, detection of the age of HIV infection is crucial for HIV-1 surveillance and the understanding of viral pathogenesis [127].
Measuring the time since infection using genomic data has recently been addressed in several studies [127–131]. The simpler version of this problem is infection staging, i.e. distinguishing between recent and chronic infections using viral sequences sampled by NGS. A number of methods establish an age or stage of HIV or HCV infection using various measures of the population structure [127–131]. An underlying assumption of such methods is that intra-host viral evolution is associated with continuous genetic diversification. This results in the existence of a correlation between genetic heterogeneity of quasispecies and the age of quasispecies, which allows for the use of properly calibrated diversity measures as age markers.
Recently, groups of comprehensive features accounting for population diversity, population genetics, topological, information-theoretical and physico-chemical properties of quasispecies populations were integrated using sophisticated machine-learning-based techniques [130, 132]. These methods take into account recent observations in the evolution of viruses, such as HCV, resulting in a gradual intra-host adaptation that is accompanied by a decrease in heterogeneity and an increase in negative selection [30, 133–135].
Outbreak investigation
Detection and investigation of viral outbreaks are the primary epidemiological tasks. Historically, epidemiological investigations have been based on in-field surveys of epidemiological settings and interviews with persons potentially involved in pathogen spread. However, such methods are time- and labor-consuming and the data obtained are prone to various socio-behavioral biases. Analysis of viral genomic data provides alternative unbiased machinery for outbreak investigations and quantification of major factors responsible for disease spread [136].
It should be noted that in the recent decade, the rich variety of tools for inferring epidemiological parameters has been developed within the field of viral phylodynamics [137, 138]. In addition, there are a plethora of methods for outbreak investigations that combine various types of genomic and epidemiological data [139–145]. Despite being highly effective in many settings, these tools are currently not intended for application to NGS data and usually do not support calculations with extremely large genomic datasets. Therefore in this article, we concentrate on tools specifically designed to handle heterogeneous intra-host viral populations using NGS.
The primary task in the outbreak investigation is the detection of transmission clusters. The main challenge, here, is the development and implementation of evolutionary distance measures between intra-host viral populations that reflect the epidemiological relations between the hosts. These distances can be efficiently calculated and combined with a broad variety of clustering techniques and phylogenetic and network-based methods [46, 146]. Distances between consensus sequences that are still often used for epidemiological investigations provide only very coarse estimates of evolutionary distances and lose significant signal encoded in quasispecies structure. In particular, outbreak distances between viral variants from certain hosts can be comparable or even higher than distances between variants from different hosts. For example, for HIV-1, the recommended inter-host threshold for detecting transmission clusters in pol region is in a range of 0.5–1.5% [136], while the nucleotide genetic variability inside hosts can be as high as 5% [147].
Analysis of quasispecies populations reconstructed from NGS data drastically improves the estimation of evolutionary distances. Pioneering NGS-based study for HCV outbreak investigations [148] proposed to measure the distance between samples as the distance between the closest pair of haplotypes from different samples. Even this simple method has been shown to significantly outperform the consensus-based approach [148]. Similar techniques have been applied to HIV [50]. Despite the simplicity of the metric, its calculation is challenging for extremely large NGS datasets, since its naive implementation requires a pairwise comparison of sequences from all pairs of patients. To address this challenge, several filtering techniques have been proposed [149, 150]. In consecutive studies [43, 44, 131, 151], more sophisticated distance measures for quasispecies populations have been proposed. In particular, Melnyk et al. [151] avoid reconstruction of haplotypes and/or phylogenetic trees by utilizing k-mer-based approach. Specifically, each viral sample is represented by a corresponding k-mer distribution, the distance between pairs of k-mers is computed over a single de Bruijn graph of all k-mers, and the distance between populations is identified with the earth mover’s distance (EMD) between two k-mer distributions.
The next step of the bioinformatics pipeline for epidemiological analysis is an investigation of viral transmissions inside each transmission cluster. It includes a prediction of possible transmission directions, detection of the source or ‘superspreader’ of an outbreak and inference of transmission networks indicating who infected whom. QUENTIN [43] and VOICE [44] estimate the distance between quasispecies populations as the analogue of a cover for a Markov-type model of viral evolution and choose the direction of transmission from a sample A to sample B based on minimum evolution principle, i.e. if it requires less evolution time than the time for evolving from sample A to B. In Romero-Severson et al. [151], it is proposed to identify the transmission directions by phylogenetic analysis and detection of paraphyletic, polyphyletic and monophyletic relations between sampled intra-host variants from different hosts. This idea has been further developed and implemented in Phyloscanner [152].
Both QUENTIN and Phyloscanner also allow reconstructing viral transmission networks. QUENTIN does it via Bayesian inference and Markov chain Monte Carlo sampling, with the likelihood of a transmission network being defined using general properties of social networks relevant to the infection dissemination. Phyloscanner relies on a maximum-parsimony approach and assigns ancestral hosts to internal nodes of a viral phylogeny containing quasispecies populations from different hosts by minimizing the number of transmission events while taking into account possible contaminations, multiple infections and presence of unsampled hosts.
Before determining the source of the outbreak, it is critical to decide whether the source of the outbreak is present among sequenced samples [151]. Finding the source of an outbreak is quite important for outbreak disruption. The papers [43, 44, 151] validated their approaches on Centers for Disease Control and Prevention (CDC) data for HCV outbreaks with the known sources and showed that the source prediction accuracy is ~90%. But before determining the source of the outbreak, it is critical to decide whether the source of the outbreak is present among sequenced samples [151]. This problem is quite difficult and has been addressed for the first time in [151].
Table 4 describes the list of tools analyzing viral NGS data for outbreak investigation including identification of (i) transmission clusters, (ii) transmission direction, (iii) source of infection, (iv) presence of source and (v) transmission network. For each tool, we indicate which of five tasks are addressed by which tool.
Tool . | Year . | System . | Algorithm . | Transmission clusters . | Transmission direction . | Transmission network . | Source of infection . | Presence of source . | Tool availability . |
---|---|---|---|---|---|---|---|---|---|
MinDist [148] | 2016 | — | Distance based | + | − | − | + | − | — |
RED [44] | 2017 | Matlab | Clustering | + | + | − | + | − | https://bitbucket.org/osaofgsu/red |
VOICE [44] | 2017 | Linux | Simulation based | + | + | − | + | − | https://bitbucket.org/osaofgsu/voicerep |
PhyloScanner [152] | 2017 | Linux | Phylogeny | + | + | + | + | − | https://github.com/BDI-pathogens/phyloscanner |
Quentin [43] | 2017 | Matlab | Simulation based | + | + | + | + | − | https://github.com/skumsp/QUENTIN |
Signature-sj [150] | 2018 | Java | k-mers | + | − | − | − | − | https://github.com/vtsyvina/signature-sj |
k-mer EMD [151] | 2019 | Linux | k-mer based distance | + | + | − | + | + | |
https://github.com/amelnyk34/kemd |
Tool . | Year . | System . | Algorithm . | Transmission clusters . | Transmission direction . | Transmission network . | Source of infection . | Presence of source . | Tool availability . |
---|---|---|---|---|---|---|---|---|---|
MinDist [148] | 2016 | — | Distance based | + | − | − | + | − | — |
RED [44] | 2017 | Matlab | Clustering | + | + | − | + | − | https://bitbucket.org/osaofgsu/red |
VOICE [44] | 2017 | Linux | Simulation based | + | + | − | + | − | https://bitbucket.org/osaofgsu/voicerep |
PhyloScanner [152] | 2017 | Linux | Phylogeny | + | + | + | + | − | https://github.com/BDI-pathogens/phyloscanner |
Quentin [43] | 2017 | Matlab | Simulation based | + | + | + | + | − | https://github.com/skumsp/QUENTIN |
Signature-sj [150] | 2018 | Java | k-mers | + | − | − | − | − | https://github.com/vtsyvina/signature-sj |
k-mer EMD [151] | 2019 | Linux | k-mer based distance | + | + | − | + | + | |
https://github.com/amelnyk34/kemd |
Tool . | Year . | System . | Algorithm . | Transmission clusters . | Transmission direction . | Transmission network . | Source of infection . | Presence of source . | Tool availability . |
---|---|---|---|---|---|---|---|---|---|
MinDist [148] | 2016 | — | Distance based | + | − | − | + | − | — |
RED [44] | 2017 | Matlab | Clustering | + | + | − | + | − | https://bitbucket.org/osaofgsu/red |
VOICE [44] | 2017 | Linux | Simulation based | + | + | − | + | − | https://bitbucket.org/osaofgsu/voicerep |
PhyloScanner [152] | 2017 | Linux | Phylogeny | + | + | + | + | − | https://github.com/BDI-pathogens/phyloscanner |
Quentin [43] | 2017 | Matlab | Simulation based | + | + | + | + | − | https://github.com/skumsp/QUENTIN |
Signature-sj [150] | 2018 | Java | k-mers | + | − | − | − | − | https://github.com/vtsyvina/signature-sj |
k-mer EMD [151] | 2019 | Linux | k-mer based distance | + | + | − | + | + | |
https://github.com/amelnyk34/kemd |
Tool . | Year . | System . | Algorithm . | Transmission clusters . | Transmission direction . | Transmission network . | Source of infection . | Presence of source . | Tool availability . |
---|---|---|---|---|---|---|---|---|---|
MinDist [148] | 2016 | — | Distance based | + | − | − | + | − | — |
RED [44] | 2017 | Matlab | Clustering | + | + | − | + | − | https://bitbucket.org/osaofgsu/red |
VOICE [44] | 2017 | Linux | Simulation based | + | + | − | + | − | https://bitbucket.org/osaofgsu/voicerep |
PhyloScanner [152] | 2017 | Linux | Phylogeny | + | + | + | + | − | https://github.com/BDI-pathogens/phyloscanner |
Quentin [43] | 2017 | Matlab | Simulation based | + | + | + | + | − | https://github.com/skumsp/QUENTIN |
Signature-sj [150] | 2018 | Java | k-mers | + | − | − | − | − | https://github.com/vtsyvina/signature-sj |
k-mer EMD [151] | 2019 | Linux | k-mer based distance | + | + | − | + | + | |
https://github.com/amelnyk34/kemd |
Molecular surveillance systems and databases
The advent of NGS technologies makes possible, for the first time, the deployment of molecular epidemiological surveillance systems that are intended to analyze and infer the dynamics of epidemics and outbreaks in real or almost real time using computational analysis of viral genomic data [50, 51]. Such systems are characterized by a broad bioinformatics functionality including the processing of raw sequencing data, sequence alignment, phylogeny or network construction, transmission history inference and visualization. The number of computational molecular surveillance systems is currently being developed and deployed. One of the widely cited systems is Nextstrain [153] that allows for phylodynamics analysis and interactive visualization of the evolution of a variety of pathogens. The Nextstrain incorporates several computational tools for alignment, phylogenetic inference, reconstruction, dating and geographic localization of transmission events. However, currently, a toolkit of Nextstrain is not intended for the analysis of NGS data and intra-host viral populations, although its open-source architecture makes possible incorporation of such methods in the future. The library of tools for viral epidemiological data analysis developed and maintained by the R Epidemics Consortium [154] also should be mentioned. It includes R statistical packages for handling, visualizing and analyzing outbreak data, but has similar limitations.
Two surveillance systems that support NGS data are specifically tailored for HIV and viral hepatitis and are recommended and/or maintained by the CDC. These systems are HIV-Trace [50] and Global Hepatitis Outbreak Surveillance Technology (GHOST) [51], and they are based on high-throughput bioinformatics pipelines for genetic relatedness analysis. They allow estimates of genetic distances between intra-host populations sampled from HIV-infected individuals, use these distances to detect possible transmission linkages between the individuals, reconstruct and visualize transmission clusters and genetic relatedness networks. Both systems can work with haplotypes obtained from NGS data and are scalable for extremely large datasets produced by Illumina MiSeq and other sequencing platforms. In particular, GHOST employs several efficient k-mer-based filtering techniques for viral sequence similarity queries, which allow for the elimination of an exhaustive comparison of all pairs of viral haplotypes and allow processing of NGS data from a given HCV outbreak in minutes [150].
Another important issue is the creation of curated databases that contain both genomic and epidemiological data and can be used for the validation of new computational molecular epidemiology tools. Some previously published papers [43, 44] provide links to datasets that can be used for these purposes, but, to the best of our knowledge, large systematically curated collections of such datasets are yet to be created. In this context, Pangea HIV consortium efforts on curated analysis for HIV outbreaks in the African region [52] are very important. At this moment, they maintain a collection of >18 000 HIV NGS samples that can be used for outbreak investigations and data-driven design of prevention strategies.
Conclusions
The NGS extracts quantitatively and qualitatively more information from patients’ viral samples than the Sanger sequencing. But the extraction of this information requires sophisticated algorithms and software tools. In this article, we have reviewed bioinformatics methods and tools for NGS data analysis in viral epidemiology, which can be partitioned into the following three categories (see Figure 1):
Primary sequencing data analysis that consists of main strain reconstruction, read alignment and characterization of intra-host viral population structure including SNV and haplotype calling.
Secondary sequencing data analysis that employs reconstructed viral populations for predicting drug resistance, estimating recency of infection and outbreak investigation, including transmission cluster detection and identification of transmission direction and outbreak sources.
Molecular surveillance systems that provide a software environment for combined primary and secondary analysis of viral NGS data in real time.
In summary, NGS-based characterization of intra-host viral population structures is advanced enough and is getting ready to be used in epidemiological and clinical studies. This claim is supported by the number of recently published studies that use quasispecies analysis for outbreak investigation and transmission inference [49, 155, 156]. Inferred intra-host viral population structure can facilitate accurate answers to essential epidemiological questions about drug resistance, recency of infection, transmission clusters and outbreak sources. Future NGS-based surveillance systems should employ big data analytics to combine enormous amounts of sequencing and epidemiological data for the timely detection of outbreaks and the design of efficient public health intervention strategies.
Analysis of intra-host viral populations sampled by NGS was shown to provide important epidemiological and clinical information.
Genetic characterization of intra-host viral populations offers a new framework for studies on drug resistance, identification of transmission clusters, sources of infection in outbreaks and time of infection inception.
Application of molecular data generated by NGS in combination with epidemiological information is a key to future improvement in public health surveillance.
Funding
This work has been partially supported by National Institute of Health Grant R01.
Sergey Knyazev is a PhD student in computer science at Georgia State University, Atlanta, GA, USA. He received the MS degree in applied mathematics at Saint Petersburg Academic University, Saint Petersburg, Russia. He develops methods for analyzing viral genomic sequencing data.
Lauren Hughes received her BA degree in English from the University of Georgia, Athens, GA, USA. She is currently pursuing her BS degree in mathematics and computer science and an MS degree in geosciences at Georgia State University, Atlanta, GA, USA.
Pavel Skums received a PhD degree in computer science at Belarusian State University, Belarus in 2007. In 2010–16, he was a research fellow in the Centers for Disease Control and Prevention, and in 2016, he joined Georgia State University, Atlanta, GA, USA as an assistant professor.
Alexander Zelikovsky received a PhD degree in computer science at Belarusian State University, Belarus in 1989. He joined Georgia State University, Atlanta, GA, USA in 1999, where he is currently a distinguished university professor.