Abstract

Motivation

Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies.

Results

We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications.

Conclusions

DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects.

Availability and implementation

DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The rapid development of affordable DNA sequencing technologies has led to an explosion of the number of prokaryotic genomes assembled from metagenomes (Almeida et al., 2019; Pasolli et al., 2019). Genomes are typically assembled from short DNA sequences using assemblers such as MEGAHIT (MH; Li et al., 2015) and MetaSPAdes (MS; Nurk et al., 2017). These assemblers receive as input a collection of sequenced DNA reads originating from a large number of organisms, and aim to group them into overlapping sections of DNA. This process results in a contig, a consensus region of DNA of variable length. A metagenome-assembled genome (MAG) is a collection of such contigs which are estimated to have originated from the same strain. However, evaluating the quality of assembled contigs remains a difficult problem. In contig regions with large coverage, i.e. where a large number of reads overlap the given region in the contig, large consensus is an indicator of quality, but how to aggregate such findings into a quality score for a full contig remains an open question.

Most methods that assess the quality of assemblies require the availability of reference genomes: a set of curated genomes that the assemblies can be compared against. Such methods include metaQUAST (Mikheenko et al., 2016), which uses mapping of contigs to one or more references to infer extensive misassemblies such as inversions, relocations, and interspecies translocations. Nonetheless, several newly discovered MAGs represent novel organisms that are distantly related to organisms represented in the genomic databases, and this makes selecting a valid reference often impossible. For instance, sometimes the closest known relative belongs to a different bacterial phylum, which implies very deep evolutionary divergence (Pasolli et al., 2019). Conversely, because species of Bacteria can include strains with large genomic differences, true genomic variation among strains can be misinterpreted as misassemblies.

The assembly of full or partial genomes from metagenomes would thus benefit from methods without reference requirements, but few such methods are available. The most widely used method is CheckM (Parks et al., 2015), which provides two sub-scores commonly used for quality evaluations: completeness and contamination. These measures are based on counts of specific loci, and thus do not evaluate genome assembly at the level of individual contigs. Still, large-scale MAG datasets such as Almeida et al. (2019) and Pasolli et al. (2019) use thresholds for completeness and contamination to screen for assembly quality.

Existing reference-free methods for evaluating contig misassembly are ALE (Clark et al., 2013), SuRankCo (Kuhring et al., 2015), REAPR (Hunt et al., 2013) and VALET (Olson et al., 2019). ALE makes probabilistic assumptions about the genomes and provides a likelihood score for each nucleotide in an assembled contig, but does not provide a quality estimate at the contig level. Moreover, it is tailored for genomic and not metagenomic assemblies. SuRankCo makes fewer assumptions about the genomes and the assemblers by utilizing a random forest classifier on a set of hand-crafted features to predict a quality estimate for the input contig. REAPR maps reads to contigs and calculates an ensemble score from various per-based metrics such as coverage and proper mapping of read pairs. VALET is similar to REAPR and uses a combination of multiple metrics based on reads mapped to contigs. Crucially, ALE, SuRankCo and REAPR are no longer maintained, and compatibility issues hamper their use. In addition, accuracy of these methods has never been evaluated on complex metagenome assemblies.

1.1 Contributions

Here, we introduce DeepMAsED (mased: a Middle English term for ‘misled’; Deep Metagenome Assembly Error Detection), a machine learning system for quality evaluation of metagenomic assemblies at the contig level. The goal of DeepMAsED is to predict the extensive misassembly label from metaQUAST without the availability of reference genomes. The contributions of our article can be summarized as follows.

  • We propose a data generation pipeline for training and for rigorous evaluation. Given an initial set of known genomes, we generate a dataset {(xi,yi)}i=1n, where xi represents a raw contig sequence augmented with additional features, and yi is a binary target indicating whether metaQUAST labels the corresponding contig as misassembled.

  • We introduce DeepMAsED, a deep convolutional network for assessing assembly quality at the contig level trained on raw sequence and read-count features which makes no explicit assumptions about the genomes or the assemblers. We compare against and significantly outperform ALE and VALET on the generated datasets. Moreover, the weights from the fully trained system require a modest 1.3 MB of storage. The challenging evaluation procedure should ensure that DeepMAsED generalizes to a large variety of research projects.

  • We evaluate DeepMAsED on a subset of two large-scale assembly benchmarks from Almeida et al. (2019) and Pasolli et al. (2019). An estimate from DeepMASeD finds 1% misassembled contigs in Almeida et al. (2019) and Pasolli et al. (2019). Moreover, we show that CheckM scores, which were used for assessing assembly quality, are poor indicators of DeepMAsED predictions.

2 Materials and methods

2.1 Input features and target

A contig is a sequence of nucleotides s{A,C,T,G}L, where the length L is variable. At each position j{1,,L} in the sequence, we include features from two categories: raw sequence and read-count features.

Raw sequence features are a one-hot encoding of the nucleotide at position j: a four dimensional binary vector oj equal to one at the position of the nucleotide.

Read-  count features are obtained from mapping the reads back to the assembled contigs using Bowtie2 (Langmead and Salzberg, 2012). Let Cj be the coverage at position j, that is, the number of reads that map to position j during the assembly process. Among these reads, let nAj,nTj,nCj and nGj be the count of each nucleotide respectively, with nAj+nTj+nCj+nGj=Cj. Moreover, let dj be the number of discordant reads and vj the number of nucleotide variants at position j. Then the read-count features correspond to the vector

The vector xj=(oj,rj) of size po=11 obtained by concatenating the raw sequence and read-count features is the input feature vector for DeepMAsED at position j{1,,L}, which indicates sources of information about the contig: the raw sequence data, the coverage of the position and the agreement between the corresponding reads.

As an example, assume that the nucleotide at position j in the an assembled contig is A. Moreover, assume that ten reads mapped to this position, with six having an A and four having a T at this position, and one read was discordant. The number of nucleotide variants then equals two. Then the feature vector would be

2.1.1 Fixed input size and normalization

The convolutional network requires a fixed input contig length, which we set to L = 10 000. Shorter contigs are padded with zeroes, while longer contigs are split in non-overlapping chunks of size L, when the last chunk is not full; it is kept and padded with zeroes. An input contig is therefore represented by a matrix x of shape L× p, where L = 10 000 and p=11. To compute an aggregate score for contigs longer that L, we compute a score M(xij) for each non-overlapping sub-sequence j of size L, and return the maximum of these scores.

We normalize each non-binary dimension in x as follows. The training dataset D is a matrix X of dimension n×L×p, where n is the number of contigs available in the training set. We reshape X into a (nL)×p matrix by pooling the position dimension, and compute the mean and standard deviation for each of the p4 read-count features. We then standardize the corresponding dimensions to have mean zero and variance one. The means and SDs computed with the training set are stored and used for testing at a later stage.

2.2 Generating the synthetic datasets

Training and evaluating DeepMAsED requires a dataset of contigs, read counts and metaQUAST quality labels {(xi,yi)}i=1n. The data generation pipeline consists of (i) selecting genomes and simulating genome abundances in several metagenomes, (ii) generating reads from these metagenomes in accordance with genome abundances, (iii) assembling these reads using standard metagenome assemblers, (iv) comparing the output of the assembly (contigs) to the ground truth genomes in order to get ground truth misassembly classifications and (v) mapping the reads to each contig in order to generate features at each contig position (Fig. 1).

Summary of our data generation pipeline. Given an initial set of N genomes, this process results in a dataset {(xi,yi)}i=1n, where xi is an assembled contig with features described in Section 2.1, and yi is the corresponding misassembly label from MetaQUAST
Fig. 1.

Summary of our data generation pipeline. Given an initial set of N genomes, this process results in a dataset {(xi,yi)}i=1n, where xi is an assembled contig with features described in Section 2.1, and yi is the corresponding misassembly label from MetaQUAST

2.2.1 Create metagenomes

None of the genomes overlapped between the train and test sets. The genomes are randomly selected species representatives from the Genome Taxonomy Database (Parks et al., 2018) after filtering out genomes with a CheckM-estimated completeness of < 50% and contamination 5%, but most genomes far exceeded these thresholds (Supplementary Fig. S1). The genomes consist of both Bacteria and Archaea, with 40 phyla represented. The corresponding NCBI accession, taxonomy and CheckM values for each genome are provided in the Supplementary Material.

We utilized MGSIM (Youngblut, 2019) to create several metagenomic communities in which the abundance of each genome is randomly sampled in order to prevent over-fitting to a particular abundance distribution. The training and test sets each contain 30 replicate metagenomes. We sample taxon abundances for the N training genomes from a lognormal distribution π1,,πNLognormal(μ,σ), where μ  =  5 and σ  =  2, and randomly permute the obtained abundances in each metagenome.

2.2.2 Generate reads

MGSIM utilizes ART (v2016.06.05; Huang et al., 2012) for read simulation. We simulated 106 Illumina HiSeq2500 150 base-pair (bp) paired-end reads per metagenome, with the ART-defined default error distribution. This sequencing depth produces a realistic sample coverage of 0.3–0.4 per sample, as estimated by Nonpareil (Rodriguez et al., 2018).

2.2.3 Assemble genomes from metagenomes

For each obtained metagenome, we provided the simulated reads as input to one of two widely used assemblers: MEGAHIT (v1.2.7; Li et al., 2015) and MetaSPAdes (v3.13.1; Nurk et al., 2017). The output of the assembly process is contigs of variable length. We filtered out all contigs shorter than 1000 bp.

2.2.4 Identify true misassemblies

The ground truth genomes used for metagenome simulation are given for reference-based identification of misassemblies via metaQUAST (v5.0.2; Mikheenko et al., 2016). The extensive misassembly label from metaQUAST is used as a target for DeepMAsED. This umbrella term considers intra-genome inversions, translocations, relocations or interspecies translocations in the generated data.

2.2.5 Compute features

We used Bowtie2 (v2.3.5; Langmead and Salzberg, 2012) to map the simulated reads back to the metagenome-assembled contigs in order to obtain the read-count features described in Section 2.1. We used samtools v1.9 (Li et al., 2009) to extract coverage and nucleotide variants information from each BAM file generated from the Bowtie2 mapping.

2.3 Model and optimization

DeepMAsED is a function M:RL×p{0,1} providing a score M(x) representing the confidence that a contig x is misassembled. We denote by y{0,1} the label of the corresponding contig, where y=1 if x is extensively misassembled according to MetaQUAST. The detailed architecture for DeepMAsED can be found in Supplementary Table S2.

First, the input contig and associated count features are passed through a convolutional feature extractor E, whose output is a feature vector h=E(x). The hidden vector h is then passed through fully connected layers, resulting in a scalar F(h)R. Finally, we compute the output scores M(x)=σ(F(h))[0,1], where σ is the sigmoid function.

The model is trained to minimize the binary cross-entropy loss

The loss function is optimized using Adam (Kingma and Ba, 2015) with mini-batches of size 64. We use batch-normalization (Ioffe and Szegedy, 2015) between convolutional layers and dropout (Srivastava et al., 2014) with rate 0.5 between fully connected layers.

Given the strong class imbalance, we select average precision as a validation metric, which corresponds to the area under the curve for the precision-recall curve. To encourage generalization to diverse data distributions, we train the model using 5-fold cross-validation over metagenomes as follows. Given 30 metagenomes available for training, we successively reserve 6 of them for computing the average precision, and the remaining 24 for training. This ensures that the validation data is drawn from metagenomes with different original abundances relative to the training data. Each of the five resulting models is trained over for 5 epochs over the data. The model is jointly trained on contigs assembled from MEGAHIT and MetaSPAdes, except for the evaluations listed in Supplementary Table S6. We define test data as a hold-out of previously unobserved data that were not used for training or cross-validation.

The parameters leading to the highest average cross-validation precision are selected and provided in the Supplementary Material. Finally, a model with the chosen parameters is trained on all the available training data from both assemblers for 10 epochs over the data.

2.4 Interpretability

Given their black-box nature, the interpretability of neural networks remains an open area of research (Gilpin et al., 2018). A number of attempts have been made to provide feature importance scores for inputs to the network, i.e. quantifying the impact that individual input features have on the prediction. We provide feature importance for DeepMAsED using DeepLIFT (Shrikumar et al., 2017). Given an input contig, we use 10 references by using the di-nucleotide shuffling function from DeepLIFT.

2.5 Comparison to competing methods

Few methods are available for evaluating the quality of MAGs directly at the contig level without one or more reference genomes. ALE (Clark et al., 2013), SuRankCo (Kuhring et al., 2015) and REAPR (Hunt et al., 2013) are no longer actively maintained. Moreover, SuRankCo and REAPR cannot be used at the time of this work due to software incompatibilities, so we could not compare against either. Therefore, we restricted out evaluations to ALE and VALET (Olson et al., 2019).

ALE attributes a likelihood score to each individual position in the assembled sequence, not to a whole contig, and consists of four sub-scores: depth, place, insert and kMer log-likelihoods. The tool utilizes strong probabilistic assumptions about the contig data such as independence of the errors quantified by each sub-score and also species-specific tetra-nucleotide frequencies. In order to aggregate ALE scores into a contig score, we proceed similarly to Kuhring et al. (2015). We select a threshold for each sub-score and count the number of positions in the contig whose likelihood value is smaller than the threshold. The total count is normalized by length and the number of sub-scores to obtain a score s[0,1] for a given contig. The value of the four thresholds is obtained via parameter search on the training data: for each assembler and each out the 30 generated metagenomes, we compute the average precision given the input thresholds. We then compute the mean average precision for all 30 metagenomes, resulting in an average precision scores of APMH and APMS for MEGAHIT and MetaSPAdes, respectively for each combination of thresholds. Since DeepMAsED is trained on contigs from both assemblers, we then compute for each threshold combination the average AP=NMHAPMH+NMSAPMSNMH+NMS, where NMH and NMS are the total number of contigs from each assembler. We select the thresholds leading to the highest average precision in this fashion. The thresholds considered can be found in Supplementary Table S3.

We evaluated VALET by utilizing the same BAM files as used for DeepMAsED. All contigs listed as misassemblies in the summary tables generated by VALET were considered misassemblies during evaluation. Initial assessments revealed that VALET accuracy was extremely poor (Supplementary Fig. S8), and thus we excluded it from future evaluations.

Since CheckM (Parks et al., 2015) is the most widely used measure of MAG assembly quality, we compare our predictions with both CheckM-estimated completeness and contamination on data from Almeida et al. (2019) and Pasolli et al. (2019).

3 Results

3.1 A comparison of train and test datasets

We generated a total of two training (‘N1000’ and ‘ IS_N1000’) and three test sets (‘N100’,’ N1000’ and ‘IS_N1000’), with the number indicating the number of reference genomes, and ‘IS_’ denoting that intra-species (IS) diversity was included (up to five genomes per species versus only one species representative for all other sets). These sets were chosen to assess how the number of genomes and degree of genome relatedness influence the accuracy of DeepMAsED. The average nucleotide identity distributions confirm that the IS sets have more closely related genomes versus the other sets (Supplementary Figs S2 and S3). CheckM-estimated completeness and contamination of the reference genomes did not substantially differ among any set (Supplementary Fig. S1). Thirty replicate metagenomes were simulated from each genome set, and each metagenome was assembled with both MEGAHIT and MetaSPAdes. The distribution of misassembly types identified by metaQUAST differed slightly between sets, with the N100 set containing relatively more relocations and translocations, and the IS sets containing more inter-species translocations (Fig. 2 and Supplementary Fig. S5). These results show that an increase in the number of genomes and/or genome relatedness increases inter-species translocation misassemblies.

Distribution of errors found by metaQUAST (Mikheenko et al., 2016) in the N1000 training set (top) and N100 test set (bottom). The labels ‘inversion’, ‘transloc’, ‘reloc’ and ‘inter_transloc’ stand for inversion, translocation, relocation and inter-genome translocation, respectively, as defined by metaQUAST
Fig. 2.

Distribution of errors found by metaQUAST (Mikheenko et al., 2016) in the N1000 training set (top) and N100 test set (bottom). The labels ‘inversion’, ‘transloc’, ‘reloc’ and ‘inter_transloc’ stand for inversion, translocation, relocation and inter-genome translocation, respectively, as defined by metaQUAST

3.2 DeepMAsED generalizes to new metagenomes and outperforms ALE

DeepMAsED generalizes to the test data, regardless of the number of genomes in the test set or the amount of IS diversity (Fig. 3, Supplementary Fig. S7 and Table S4). Importantly, the genomes from the training 3and test sets do not overlap, suggesting that DeepMAsED generalizes well to novel genomes. Indeed, detailed examination shows that DeepMAsED generalization is robust to increasing taxonomic novelty (Supplementary Fig. S10 and Table S4). The same is true in regards to genome quality, as defined by CheckM estimates of completeness and contamination (Supplementary Fig. S11). DeepMAsED accuracy does decrease as contig length increases beyond the 10 kb window use for our convolutional neural network (CNN) model (Supplementary Fig. S8); however, contigs > 10 kb make up < 1% of contigs in each assembly, so a 50% recall is still achievable. Contigs > 10 kb have less inter-genome translocation misassemblies than the shorter contigs, which may explain the loss of accuracy (Supplementary Fig. S4 and Table S6). By default, DeepMAsED is trained on contigs generated from both assemblers on each dataset; still we find that DeepMAsED generalization is rather robust to the assembler used, although training on contigs from one assembler and predicting on the other did noticeably reduce accuracy relative to either homo-assembler training/testing or joint training on contigs from both assemblers (Supplementary Table S6).

Precision-recall curves for the N1000 training set (top) and N100 test set (bottom). N100 is the most challenging test dataset among other synthetic datasets. Average precision score is computed for all methods and for both MH and MS parts of the dataset. DeepMAsED generalizes to the test data and significantly outperforms ALE and random prediction. As there is a stochasticity in neural networks training, five DeepMAsED models with the same architecture were trained on the whole N1000 training dataset, mean ± SD are shown in the figure
Fig. 3.

Precision-recall curves for the N1000 training set (top) and N100 test set (bottom). N100 is the most challenging test dataset among other synthetic datasets. Average precision score is computed for all methods and for both MH and MS parts of the dataset. DeepMAsED generalizes to the test data and significantly outperforms ALE and random prediction. As there is a stochasticity in neural networks training, five DeepMAsED models with the same architecture were trained on the whole N1000 training dataset, mean ± SD are shown in the figure

DeepMAsED achieves a substantially higher precision relative to ALE for both assemblers at any recall threshold below 90% (Fig. 3 andSupplementary Fig. S7). For instance on the N100 test data, precision is 90% at 25% recall for DeepMAsED, while ALE only achieves close to 10% precision at the same recall. Both methods significantly outperform random prediction, which has low precision due to class imbalance. Although achieving high precision at high recall remains an open problem, DeepMAsED provides a conservative method to identify misassembled contigs.

3.3 DeepMAsED predicts a substantial number of misassemblies in benchmark datasets

We apply DeepMAsED on two recent large-scale metagenome assembly studies, Almeida et al. (2019) and Pasolli et al. (2019), in which 92 143 and 154 723 MAGs were generated from the assembly of thousands of human gut metagenome samples. Both studies utilized CheckM estimations of MAG completeness and contamination. Given the very large numbers of metagenomes used in Almeida et al. (2019) and Pasolli et al. (2019), we sub-sampled the datasets to 1438 and 509 MAGs assembled from 133 and 51 metagenomes, respectively.

We select two thresholds for DeepMAsED prediction: a conservative threshold achieving 25% recall data and a flexible estimate achieving 50% recall on the IS_N1000 test data. IS_N1000 was chosen as it is the closest to real world scenario. These thresholds achieve precisions of 90% and 77% respectively (see Supplementary Fig. S7, right).

The flexible estimate finds 685 misassembled contigs in Almeida et al. (2019) and 224 in Pasolli et al. (2019), see Table 1. This represents 1% misassembled contigs in both datasets. ALE identifies significantly more misassembled contigs, but the false positive rate is likely high, given the low precision achieved on the simulated test data (Supplementary Fig. S7).

Table 1.

Number of contigs classified as misassembled

MethodConservative, n (%)Flexible, n (%)
DeepMAsED (Almeida)17 (0.02)685 (1.0)
ALE (Almeida)3392 (4.8)37 312 (53)
DeepMAsED (Pasolli)15 (0.07)224 (1.1)
ALE (Pasolli)1589 (7.5)12 821 (61)
MethodConservative, n (%)Flexible, n (%)
DeepMAsED (Almeida)17 (0.02)685 (1.0)
ALE (Almeida)3392 (4.8)37 312 (53)
DeepMAsED (Pasolli)15 (0.07)224 (1.1)
ALE (Pasolli)1589 (7.5)12 821 (61)

Notes: The conservative and flexible estimates correspond to recalls of 25% and 50%, respectively, on the IS_N1000 test set. At these recalls, DeepMAsED achieves precisions of 90% and 77%, respectively, whereas ALE achieves 19% and 13%. Only contigs with both predictions from DeepMAsED and ALE were used, resulting in 21 039 contigs for Pasolli et al. (2019) and 70 854 for Almeida et al. (2019). Values in parentheses are percentages of all contigs used. DeepMAsED predicts 1% of misassembled contigs at 50% recall.

Table 1.

Number of contigs classified as misassembled

MethodConservative, n (%)Flexible, n (%)
DeepMAsED (Almeida)17 (0.02)685 (1.0)
ALE (Almeida)3392 (4.8)37 312 (53)
DeepMAsED (Pasolli)15 (0.07)224 (1.1)
ALE (Pasolli)1589 (7.5)12 821 (61)
MethodConservative, n (%)Flexible, n (%)
DeepMAsED (Almeida)17 (0.02)685 (1.0)
ALE (Almeida)3392 (4.8)37 312 (53)
DeepMAsED (Pasolli)15 (0.07)224 (1.1)
ALE (Pasolli)1589 (7.5)12 821 (61)

Notes: The conservative and flexible estimates correspond to recalls of 25% and 50%, respectively, on the IS_N1000 test set. At these recalls, DeepMAsED achieves precisions of 90% and 77%, respectively, whereas ALE achieves 19% and 13%. Only contigs with both predictions from DeepMAsED and ALE were used, resulting in 21 039 contigs for Pasolli et al. (2019) and 70 854 for Almeida et al. (2019). Values in parentheses are percentages of all contigs used. DeepMAsED predicts 1% of misassembled contigs at 50% recall.

3.4 CheckM scores are poor proxies for DeepMAsED predictions

Given that CheckM has become the de facto standard for determining MAG quality (usually based on a cutoff of 50% completeness and <5% contamination), we sought to determine how well CheckM quality estimates correlate with the prevalence of contig misassemblies, as identified by DeepMAsED. Both contamination and completeness scores from CheckM are poor indicators of DeepMAsED scores, see Figure 4 for contigs from Almeida et al. (2019). Scores for contigs from Pasolli et al. (2019) are provided in Supplementary Figure S12.

CheckM contamination (left) and completeness (right) for each contig in Almeida et al. (2019) against the corresponding DeepMAsED score. Points in red are classified as misassembled with a flexible estimate at 50% recall on the IS_N1000 test dataset. Both CheckM completeness and contamination are poor proxies for DeepMAsED quality predictions. Given the large number of contigs, contigs in green were sub-sampled to 2000 points to reduce overplotting
Fig. 4.

CheckM contamination (left) and completeness (right) for each contig in Almeida et al. (2019) against the corresponding DeepMAsED score. Points in red are classified as misassembled with a flexible estimate at 50% recall on the IS_N1000 test dataset. Both CheckM completeness and contamination are poor proxies for DeepMAsED quality predictions. Given the large number of contigs, contigs in green were sub-sampled to 2000 points to reduce overplotting

3.5 DeepLIFT highlights critical features for high misassembly scores

We use DeepLIFT to identify features in contigs that contribute substantially to misassembly scores, and we show that scores spike mainly in two scenarios: (i) locations with low coverage and an increased number of nucleotide variants (Fig. 5, top) and (ii) input locations with high coverage but a high number of nucleotide variants (Fig. 5, bottom). Further examples can be found in Supplementary Figures S13 and S14. These results suggest that DeepLIFT, when used with DeepMAsED, can help to interpret why contigs are classified as misassemblies and to identify specific contig regions where misassemblies have likely occurred.

Visualization of the DeepLIFT (feature importance) output on two contigs from Almeida et al. (2019) and Pasolli et al. (2019), NODE_6593_length_2879_cov_3.819051 and NODE_129_length_79118_cov_13.2449, which aims to display the effect on different input locations on the predicted score. Labels on the left of each plot denote the sections of the IGV contig view (‘Pos’ = position, ‘DL’ = DeepLIFT activations, ‘Cov’ = read coverage with each read shown as a horizontal bar, ‘Genes’ = genes as depicted with blue arrows). The top figure displays high activations on zones of low coverage, whereas the bottom figure shows high activation in a zone with large coverage and a large number of nucleotide variants (multi-colored vertical lines in the coverage section of the plot)
Fig. 5.

Visualization of the DeepLIFT (feature importance) output on two contigs from Almeida et al. (2019) and Pasolli et al. (2019), NODE_6593_length_2879_cov_3.819051 and NODE_129_length_79118_cov_13.2449, which aims to display the effect on different input locations on the predicted score. Labels on the left of each plot denote the sections of the IGV contig view (‘Pos’ = position, ‘DL’ = DeepLIFT activations, ‘Cov’ = read coverage with each read shown as a horizontal bar, ‘Genes’ = genes as depicted with blue arrows). The top figure displays high activations on zones of low coverage, whereas the bottom figure shows high activation in a zone with large coverage and a large number of nucleotide variants (multi-colored vertical lines in the coverage section of the plot)

3.6 Performance

We evaluate DeepMAsED using an NVIDIA K80 GPU. We benchmark performance by computing average time per mini-batch over 100 mini-batches of size 64 over the training data. The average time per mini-batch is 0.084 ± 0.00 seconds, or 42 418 contigs of size 10 000 per minute. Using only an Intel Xeon CPU E5-2680 v4 at 2.40 GHz CPU, evaluation is slower, at 0.75 ± 0.05 s per mini-batch, or 4587 contigs per minute.

The final model has 102 627 parameters, requiring a modest 1.3 MB of storage.

4 Discussion

4.1 DeepMAsED generalizes to a large number of research projects

Our findings suggest that DeepMAsED generalizes to a large variety of genomes and metagenome compositions for two widely used assemblers. In order for researchers to deploy DeepMAsED on their assemblies, input features must be computed. This requires mapping reads to contigs, but this is a usual step in other downstream applications such as estimating MAG relative abundances. Moreover, we provide code to compute the input features from raw BAM files and weight matrices from the fully trained system, which then allows researchers to easily assess the quality of their assembled contigs. Misassembly classification is rapid and should scale to millions of contigs when using a GPU.

We also provide code to generate large amounts of realistic data in a straight-forward manner. This allows researchers to train and evaluate new models from scratch with larger or more specific reference genome datasets (e.g. using biome-specific microbes or virus genomes).

Although the focus of our work is metagenome assemblies, one could utilize our modeling framework for evaluating genome misassemblies. The model would likely need to be trained on a de novo genome assembly dataset, given that the misassembly error distribution would likely differ from metagenome assemblies.

4.2 Future steps toward improving performance

A first potential direction for improving DeepMAsED performance is increasing the amount of training data. This could be achieved in several ways such as increasing the number of genomes, the number of simulated metagenomes, and/or the corresponding sequencing depth. More metagenome assemblers could be included to increase generalization, but few show the scalability, accuracy, and popularity of MEGAHIT and MetaSPAdes (Wang et al., 2019).

In another direction, DeepMAsED could be trained on long read or hybrid short-long read metagenome assemblies. Especially for assemblies solely utilizing error-prone long reads (e.g. Oxford Nanopore reads), the misassembly error profile may substantially differ from the short read assemblies evaluated in this work (Nicholls et al., 2019).

Performance could be potentially improved by changing how long contigs are evaluated, which is currently batched in non-overlapping windows of fixed sized. A sliding window may improve performance, but possibly at a high computational cost. Also, we note that our currently trained model that we provide should only be used with contigs longer than 1000 bp, as shorter contigs were filtered during training. Recurrent neural networks may also be a promising solution, given that they do not require a fixed input size.

We could also enrich the features present at each location in the input contigs. One possibility is adding gene annotations, with the assumption that misassemblies could interfere with gene calling (i.e. no annotation) or produce poor annotations (e.g. genes of unknown function). This option adds significant overhead, since gene calling and annotation are computationally expensive. Another possibility is adding taxonomy information at each position by majority voting from a sliding window taxonomy, or using the entropy of the taxonomy sliding window. High entropy could signal a rapidly changing taxonomy in the region, which may be a result of misassembly generated chimerism. The main drawback of taxonomy features, in addition to computational costs, is the quality of the reference taxonomies themselves. This is particularly relevant in the context of novel genome assemblies, which can lack close relatives in reference taxonomies. One could also add ALE features as inputs to DeepMAsED, but this would add strong assumptions that may not hold for novel genomes.

Finally, we could improve DeepMAsED by providing more fine-grained predictions for the type of misassembly present (e.g. inversion or relocation) or identifying the location of the misassembly directly. This would be of particular value for long contigs in which a large portion of the contig may still be correctly assembled.

5 Conclusion

DeepMAsED is a generally applicable tool for evaluating metagenome assembly accuracy. It demonstrates that deep learning is a tractable solution for reference-free identification of metagenome-assembled contig misassembly and provides a benchmark from which to improve upon. While there is room for improving the accuracy of DeepMAsED, we note that regardless of the misasssembly classification threshold used, the researcher can use DeepLIFT, ALE, metaQUAST or other tools to investigate the subset of contigs tentatively identified as misassemblies by DeepMAsED. This hybrid automated-manual curation approach should greatly reduce the workload relative to fine-grained checking of all contigs individually.

Funding

This work was supported by the Max Planck Society and by the Max Planck ETH Center for Learning Systems.

Acknowledgements

We thank Guillermo Luque for helpful discussions on this work and three anonymous reviewers whose comments have greatly improved this manuscript.

Conflict of Interest: none declared.

References

Almeida
 
A.
 et al. (
2019
)
A new genomic blueprint of the human gut microbiota
.
Nature
,
568
,
499
504
.

Clark
 
S.
 et al. (
2013
)
ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies
.
Bioinformatics
,
29
,
435
443
.

Gilpin
 
L.
 et al. (
2018
). Explaining explanations: an overview of interpretability of machine learning. In: IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pp.
80
89
. IEEE.

Huang
 
W.
 et al. (
2012
)
ART: a next-generation sequencing read simulator
.
Bioinformatics
,
28
,
593
594
.

Hunt
 
M.
 et al. (
2013
)
REAPR: a universal tool for genome assembly evaluation
.
Genome Biol
.,
14
,
R47
.

Ioffe
 
S.
,
Szegedy
C.
(
2015
). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol.
37
, pp.
448
456
.

Kingma
 
D.
,
Ba
J.
(
2015
). Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR).

Kuhring
 
M.
 et al. (
2015
)
SuRankCo: supervised ranking of contigs in de novo assemblies
.
BMC Bioinformatics
,
16
,
240
.

Langmead
 
B.
,
Salzberg
S.
(
2012
)
Fast gapped-read alignment with Bowtie 2
.
Nat. Methods
,
9
,
357
359
.

Li
 
H.
 et al. (
2009
)
The sequence alignment/map format and SAMtools
.
Bioinformatics
,
25
,
2078
2079
.

Li
 
D.
 et al. (
2015
)
MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
.
Bioinformatics
,
31
,
1674
1676
.

Mikheenko
 
A.
 et al. (
2016
)
MetaQUAST: evaluation of metagenome assemblies
.
Bioinformatics
,
32
,
1088
1090
.

Nicholls
 
S.
 et al. (
2019
)
Ultra-deep, long-read nanopore sequencing of mock microbial community standards
.
Gigascience
,
8
,

Nurk
 
S.
 et al. (
2017
)
metaSPAdes: a new versatile metagenomic assembler
.
Genome Res
.,
27
,
824
834
.

Olson
 
N.D.
 et al. (
2019
)
Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes
.
Brief. Bioinform
.,
20
,
1140
1150
.

Parks
 
D.
 et al. (
2015
)
CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
.
Genome Res
.,
25
,
1043
1055
.

Parks
 
D.
 et al. (
2018
)
A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life
.
Nat. Biotechnol
.,
36
,
996
1004
.

Pasolli
 
E.
 et al. (
2019
)
Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle
.
Cell
,
176
,
649
662
.

Rodriguez
 
L.
 et al. (
2018
)
Nonpareil 3: fast estimation of metagenomic coverage and sequence diversity
.
MSystems
,
3
,
e00039
18
.

Shrikumar
 
A.
 et al. (
2017
). Learning important features through propagating activation differences. In: Proceedings of the 34th International Conference on Machine Learning (ICML), pp.
3145
3153
. JMLR. org.

Srivastava
 
N.
 et al. (
2014
)
Dropout: a simple way to prevent neural networks from overfitting
.
J. Mach. Learn. Res
.,
15
,
1929
1958
.

Wang
 
Z.
 et al. (
2019
)
Assessment of metagenomic assemblers based on hybrid reads of real and simulated metagenomic sequences
.
Brief. Bioinform
.,

Youngblut
 
N.
(
2019
). MGSIM. https://github.com/nick-youngblut/MGSIM. doi: 10.5281/zenodo.3696891.

Author notes

The authors wish it to be known that, in their opinion, Olga Mineeva and Mateo Rojas-Carulla should be regarded as Joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Pier Luigi Martelli
Pier Luigi Martelli
Associate Editor
Search for other works by this author on:

Supplementary data