Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage

Genome assemblies that are accurate, complete and contiguous are essential for identifying important structural and functional elements of genomes and for identifying genetic variation. Nevertheless, most recent genome assemblies remain incomplete and fragmented. While long molecule sequencing promises to deliver more complete genome assemblies with fewer gaps, concerns about error rates, low yields, stringent DNA requirements and uncertainty about best practices may discourage many investigators from adopting this technology. Here, in conjunction with the platinum standard Drosophila melanogaster reference genome, we analyze recently published long molecule sequencing data to identify what governs completeness and contiguity of genome assemblies. We also present a hybrid meta-assembly approach that achieves remarkable assembly contiguity for both Drosophila and human assemblies with only modest long molecule sequencing coverage. Our results motivate a set of preliminary best practices for obtaining accurate and contiguous assemblies, a ‘missing manual’ that guides key decisions in building high quality de novo genome assemblies, from DNA isolation to polishing the assembly.


Outline of the quickmerge algorithm:
1) Read delta file and store the start and ends of MUMs. Convert the delta format into tsv format and write it as a file called "aln_summary.tsv".
2) Check whether the queries are completely inside the reference sequences.
3) Check whether the query alignment is at the 5' or 3' end of the reference. 4) Do the same as 3 but check if the alignment is at the 5' or 3' end of the query. 5) Check if the query is on the forward or reverse strand (the reference is always on forward strand). 6) Calculate the total length of alignment for a reference and query pair. 7) Calculate the length of overlapping but non-aligning sequence in a reference-query alignment. 8) Calculate the length of sequence overhangboth for reference as well as query in an alignment. 9) Find the anchor/seed alignments based on the length and HCO cut-off. 10) Find the path through the alignment graph that represent the longest contiguous path through the HCO nodes. Reference contigs meeting the HCO and length threshold are used as initiation points for paths and extended in both 5' and 3' directions if an adjacent HCO node is available. 11) Generate the sequence corresponding to the path found in step 10, using contig sequences from reference and query assemblies.

Downsampling
We used three different downsampling schemes on the D. melanogaster data: first, we randomly downsampled the data by drawing a random set of SMRTcells of data from the entire set of 42 SMRTcells; second, from those datasets, we downsampled the longest 50% and 75% of the reads. Finally, we downsampled the D. melanogaster data to match the read length distributions of PacBio reads from a pilot Drosophila pseudoobscura genome project that was produced using a standard protocol without aggressive size selection (generously made available by Stephen Richards). We used the lowess function in R with a smoother span (f) of 1/5 to generate curves representing the distribution of read lengths in the D. melanogaster and D. pseudoobscura datasets, then assigned a probability to each read length defined as the quotient of the melanogaster distribution and the pseudoobscura distribution at that read length. Reads were then randomly removed from the D. melanogaster dataset according to the assigned probabilities. This method was used for all coverage up to 53X. Thus, we generated a read set that relatively closely resembles the read length distribution of the original D.
pseudoobscura data, but is made up of D. melanogaster sequence data, allowing for a comparison of assembly quality with regard to read length without differences in the genomes of the two species as a confounding factor. The lowess function resulted in a slightly oversmoothed distribution such that samples drawn from it were slightly longer than in D.
pseudoobscura. Consequently, assemblies from these reads should be slightly better than if they exhibited the (shorter) distribution for D. pseudoobscura. As such, this choice is conservative and, if anything, underestimates the importance of size selection.
Additionally, we downsampled based on read quality to test the effect of read quality on assembly contiguity. We used a custom script to separate the entire 42 SMRTcell ISO1 dataset into two halves. One half contained the 50% of all reads with the lowest average base quality, while the other half contained the 50% of all reads with the highest average base quality. Cutoffs were chosen for individual 100bp length bins, so the resulting datasets maintained the length distribution of the original data. We also generated a dataset containing 50% of the data that consisted of randomly chosen reads (to preserve the quality distribution of the original data).

PacBio only assembly
The spec file for all PacBio only assembly was as follows -