Common workflow language (CWL)-based software pipeline for de novo genome assembly from long- and short-read data

ABSTRACT Background Here, we created an automated pipeline for the de novoassembly of genomes from Pacific Biosciences long-read and Illumina short-read data using common workflow language (CWL). To evaluate the performance of this pipeline, we assembled the nuclear genomes of the eukaryotes Caenorhabditis elegans (∼100 Mb), Drosophila melanogaster (∼138 Mb), and Plasmodium falciparum (∼23 Mb) directly from publicly accessible nucleotide sequence datasets and assessed the quality of the assemblies against curated reference genomes. Findings We showed a dependency of the accuracy of assembly on sequencing technology and GC content and repeatedly achieved assemblies that meet the high standards set by the National Human Genome Research Institute, being applicable to gene prediction and subsequent genomic analyses. Conclusions This CWL pipeline overcomes current challenges of achieving repeatability and reproducibility of assembly results and offers a platform for the re-use of the workflow and the integration of diverse datasets. This workflow is publicly available via GitHub (https://github.com/vetscience/Assemblosis) and is currently applicable to the assembly of haploid and diploid genomes of eukaryotes.


Background
The assembly of genomes to chromosomal contiguity for many eukaryotic organisms has turned out to be a daunting task but has been achieved, for instance, for Homo sapiens, Mus musculus, Caenorhabditis elegans, Drosophila melanogaster, and Plasmodium falciparum [1][2][3][4][5][6]. The reference genomes of these organisms now meet the quality requirements set by the National Human Genome Research Institute (NHGRI-NIH) [7], namely, that the accuracy of the assembled nucleotides is at least 99.99% (≤1 nucleotide error over 10,000 bp), decontaminated contigs (each >30 kb) are ordered to form chromosomes, the sizes of gaps between any two contigs have been estimated, and the completeness of each chromosome is ≥95%.
For the first completed genome assemblies (i.e., C. elegans and H. sapiens), effective but costly and time-consuming bacterial artificial chromosome-based Sanger sequencing approaches were used [1,3]. The use of less expensive, second-generation sequencing technologies [8,9], such as Illumina [10], led to a rapid expansion in the number of draft genome assemblies for a range of metazoan organisms [11]. However, due to the inability to resolve repetitive DNA regions using short nucleotide read (50-300 bp) datasets [12], draft genomes are typically incomplete, fragmented, and contain mis-assembled regions, all of which constrain gene predictions and any subsequent genomic anal-yses [9,13]. Nonetheless, novel draft genomes have opened up exciting new avenues for research on many non-model organisms, including parasites [14][15][16][17][18][19][20]. Some of these parasites cause neglected tropical diseases (NTDs), collectively representing a burden ≥1% of disability-adjusted life years per annum worldwide, with a related annual cost of anthelmintic treatment estimated at $3 billion [21]. In addition, resistance to anthelmintic drugs, used in mass drug administration, is a looming threat [22][23][24][25]. For these reasons, there is an imperative to advance genomic and systems biological research of these pathogens in order to gain a deep understanding of areas such as parasite biology, parasite-host interactions, disease, and drug resistance. The availability of high-quality genome assemblies is, thus, of utmost importance and could expedite the identification of novel drug targets and the design of advanced interventions (anthelmintics and vaccines) and diagnostic systems for the improved control of NTDs.
To enhance assembly quality, the use of long genomic reads (<100 kb in length) produced using third-generation sequencing technologies allows the resolution of long repeat regions and substantially reduces fragmentation [8]. With the use of scaffolding technologies, such as Hi-C [26] and BioNano [27,28], the gap toward achieving high-quality de novo genome assemblies is closing [29]. The most prominent third-generation sequencing platforms currently available are the Pacific Biosciences (PacBio) single-molecule, real-time sequencer (RS) [30][31][32] and the in silico nanopore-based MinION and GridION systems from Oxford Nanopore [33]. The error rates in sequences generated using these technologies are ∼ 13% and 5%-40%, respectively [34,35], and ∼15% for 1D and ∼5% for 1D 2 for the latest 2016 Nanopore R9 chemistry, such that substantial sequencing depth is required to resolve sequencing errors [36]. Genomes assembled from sequence data from these platforms typically exhibit high numbers of indels. Depending on sequencing depth, it is common to employ accurate short-read data to validate or resolve inaccuracies in such genomes using a process referred to as "polishing" [29,36,37]. The quality and completeness of genome assemblies can be affected by quality and yield of DNA isolated from organisms, such as parasites, and challenges associated with extracting nucleic acids from them [38,39]. DNA quantity is often limited because of the small size of some parasites and a need to isolate DNA from multiple organisms rather than one. There are often challenges in acquiring material from patients in distant locations, the cost of transport of such materials to a laboratory, and complications relating to microbial contamination, DNA degradation and nicking, co-purification of contaminating constituents, such as carbohydrates and lipids [38][39][40], and/or unique aspects, such as chromosomal diminution, in some parasites [41]. Clearly, the quality and amount of DNA have a major impact on completeness of a final genome assembly, irrespective of the sequencing technology employed.
A suitable computing environment and software tools are essential for producing a high-quality genome assembly. Such tools have dependencies on one another, particularly in terms of running order and software versions, and often require custom scripts for the integration of tools. Therefore, a substantial amount of time and effort is often required to complete a new assembly from scratch. Recently, issues surrounding the repeatability and reproducibility of results and reusability of datasets have been emphasized as being critical for scientific research [42][43][44][45], which have been neglected in some fields. Results are (i) repeatable, if the same findings are achieved multiple times using the same data [43]; (ii) reproducible, if the same findings are achieved multiple times using reproduced data [43]; and (iii) reusable, if new results are achieved using new data [42]. There is clear evidence that the repeatability of experiments that use software tools in published, peer-reviewed literature and the reusability of software for new experiments are challenging and/or error-prone [43,46]; it is thus of prime importance to tackle these pertinent issues.
One possible approach would be to employ frameworks, such as SnakeMake [47], Ruffus/Rubra [48], Toil [49], and Rabix [50], or to use the common workflow language (CWL) [51] for workflows [52]. Each of these frameworks can be used to build bioinformatics pipelines, to execute complex tasks through the integration of software tools and the control of execution flow, in order to maximize the use of available computer resources and to ensure the repeatability of an experiment and reusability of a task. For instance, SnakeMake has been used in multiple workflows relating to RNA sequencing analyses [53], and Rubra is used in workflows such as RedDog [54] to infer single-nucleotide polymorphism (SNP) datasets derived from bacterial populations for subsequent phylogenetic analyses. By contrast, CWL defines a specification and offers a reference implementation, instead of providing a complete framework. The major advantage of CWL is its capacity to implement this specification for different computing environments and/or workflow frameworks, and CWL is already available in Toil and Rabix. To automate software installation, CWL supports "pull action" of Docker containers [46] and has beta-implementation for the integration of Bioconda (bioinformatics software package channel) [55]. Docker supports operating system virtualization [46] and has the capacity to form customized "containers" through the installation of particular software components. These containers can be deployed to different platforms, thereby conferring cross-platform portability [46]. Bioconda relies on the universal package manager Conda [56] to build binary software packages for Linux, MacOS, and Windows operating systems, to manage dependencies among software components within these packages, and to install packages locally into an isolated environment [55]. Although BioConda provides Docker containers for individual versions of a software tool to achieve high repeatability, built-in stochasticity of distinct versions has the potential to effect repeatability. CWL can use both Docker and Bioconda to install and run defined versions of software tools without manual intervention. Despite a growing interest in CWL, this framework has not yet gained the popularity that it deserves.
Here, employing CWL v1.0, we established an entirely novel, automated genome assembly pipeline [57] that integrates software tools and data from multiple sequencing platforms. This pipeline achieves repeatable and reproducible high-quality genome assemblies for metazoan organisms using PacBio sequence data, followed by "polishing" with Illumina short-read data. The pipeline resolves the dependencies among software packages via well-defined, versioned software packages that are automatically installed and executed at each step in the workflow, as required. This genome assembly pipeline should be broadly applicable in the biological and biomedical sciences.

CWL assembly pipeline
The pipeline executes the programs integrated into the bioinformatics workflow (Fig. 1). First, PacBio reads from HDF5 formatted files that were converted to FASTA formatted files using the program Dextractor. These raw reads were then corrected using multiple rounds of read overlapping [58] and trimmed (e.g., re- Figure 1: Diagram illustrates an automated common workflow language (CWL)-based genome assembly pipeline for PacBio long-read and Illumina short-read data. PacBio reads are first pre-processed and then used for assembly and long-read polishing. Illumina reads are cleaned and used to further polish the long-read assembly. Finally, haplotypes are merged in the repeat-masked, polished assembly. While the workflow is running, dependent software tools are automatically deployed from Bioconda package channel and DockerHub container repository. The code for the workflow and the Dockerfiles for the docker containers are stored in a GitHub code-repository. moval of hairpin adapters and chimeric sequences) [36] using the program Canu. Subsequently, reads from potential contaminants (such as viruses, bacteria, and/or other microbes) were removed using the program Centrifuge, and remaining reads were assembled employing the program Canu. Using the program Arrow, PacBio raw reads were then employed to polish the assembly; further polishing was done with Illumina reads using the program Pilon. For polishing, Illumina reads were cleaned using the program Trimmomatic, mapped to the Arrow-polished assembly using the program Bowtie2, and sorted using the program SAMtools. For haplotype removal from the resultant assembly, custom repeat regions were inferred using the program RepeatModeler. The assembly was then masked employing inferred custom repeats, known transposons, and inferred tandem repeats using the program RepeatMasker. Finally, the program HaploMerger2 was used to identify and then remove the duplicated haplotypes from the masked Pilon-polished assembly, resulting in the final de novo-assembled diploid genome. Docker containers used in the pipeline were deposited to DockerHub [46] and automatically deployed using the software udocker. Required software tools were automatically fetched from Bioconda and installed into the target compute environment.

Pipeline assemblies
Using the CWL assembly pipeline, the reference genomes of C. elegans, D. melanogaster, and P. falciparum were each re-assembled from publicly available PacBio and Illumina datasets (Table 1). Quality metrics were calculated for the resultant assemblies at each phase of the pipeline, i.e., Canu contigs, Arrow-polished contigs, Pilon-polished contigs, and haplo-merged contigs (Ta-bles 2-4). For P. falciparum with haploid DNA [59], the Pilonpolished contigs represented the final assembly.

Completeness and contiguity
The final assembly for P. falciparum (23.4 Mb; GC content of 19.33%; no gaps and no unresolved nucleotides) represented complete chromosomes (n = 14) and a complete apicoplast genome (Table 4a). When aligned to the reference (23.3 Mb; GC content of 19.34%; no gaps and no unresolved nucleotides), the assembly had 15.3 kb, 193.3 kb, and 89.4 kb of "missing, duplicated, and compressed reference bases," respectively (Table 4a). In terms of contiguity and completeness, the Arrow-polished assembly (23.4 Mb) was no different from the Pilon-polished one (Table 4a).
The haplo-merged assembly of D. melanogaster resulted in 61 contigs (N50 = 13.3 Mb; L50 = 4; L90 = 10; GC content of 42.2%; no gaps or unknown nucleotides) and was markedly smaller (129.7  Tables 2, 3, and 4a). In comparison to pure Canu assemblies, Arrow mpolishing increased the number of complete BUSCO orthologs from 147 to 148, 952 to 969, and 1,637 to 1,653, respectively, and reduced the fragmented BUSCO orthologs in C. elegans from 21 to 10 and D. melanogaster from 17 to 2 (Tables 2-4). Pilon polishing did not change the total number of BUSCO orthologs detected but did reduce the number of fragmented orthologs by two for C. elegans (Tables 2, 3, and 4a).

Vembar assembly for P. falciparum
When compared with the reference assembly, the Vembar assembly resulted in 1,233 nucleotide mis-matches, 546 large indels, and 31,261 small indels (Table 4b). For the Arrow-polished Vembar assembly, the number of nucleotide differences increased slightly (n = 1,396), but the number of large (n = 213) and small (n = 9,391) indels was substantially reduced (Table 4b). The comparison of the Arrow-polished pipeline assembly to the Vembar assembly resulted in a modest number of nucleotide mismatches (n = 458 bp), but in a high number of large (n = 338) and small (n = 28,473) indels (Table 4c). For the Pilon-polished pipeline assembly, the numbers were similar (n = 443 mis-matches; n = 336 large indels; n = 28,490 small indels) when compared with the Vembar assembly (Table 4c). However, the numbers of nucleotide differences (n = 368) and large (n = 154) and small (n = 3901) indels were small when the Arrow-polished Vembar assembly and the Arrow-polished pipeline assembly were compared (Table 4c). Both the Vembar assembly and Arrow-polished pipeline assembly shared 8,947 indels and 2,007 nucleotide differences in the same locations in the reference genome. For the Vembar assembly, it was 7.7-fold more likely for indels to be observed in non-coding (27,619 indels/10,987,349 bp) than in coding regions (4,172 indels/12,282,956 bp) (Tables 1 and 4a). The numbers of BUSCO orthologs detected were 142, 147, and 147

Indel correlations
For P. falciparum, the genomic locations with indels correlated positively with positions of nucleotide differences, repeat regions, and gaps in mapping coverage and correlated negatively with coding regions, GC content, and Illumina-mapping coverage (Fig. 2). Although not as pronounced, a similar pattern was observed in both C. elegans and D. melanogaster (Fig. 2). None of the assemblies showed a clear distinction in correlation between PacBio sequencing depth and coding and/or repeat regions (Fig. 2). Telomeric regions, being at the ends of the chromosomes of P. falciparum, were clearly visible based on an abundance of repeats and a lack of coding sequences (Fig. 2).

Discussion
In the present study, we demonstrate unequivocally that CWL is a language to clearly describe a workflow and develop a fully automated pipeline with capacities to parallelize its execution, to define dependencies to the order of execution, and to automatically install versioned software packages. Therefore, CWL offers a practical and convenient way for researchers to obtain repeatable and reproducible results from bioinformatics experiments for subsequent scientific publications. This language is highly suited to different compute environments for integration, the reuse of diverse datasets, and repeating or reproducing results reported from previous experiments (using CWL) published in the peer-reviewed literature. Current reference implementation of CWL does not scale to distributed compute systems but is usable on servers configured with multiple central processing units (CPUs). For the present assembly workflow, the use of soft- ware tools directly from Bioconda was preferred [62], and Docker containers were only created for custom scripts or if a tool was not available in Bioconda or dysfunctional. For instance, it was not possible to use RepeatModeler via Bioconda because the latest RepeatLibrary from RepBase could not be installed in that version. The integration of RepeatModeler with a Docker container resolved this issue. Thus, CWL allows an efficient integration of alternative tools and extensions, such as assemblers and new scaffolding tools. Despite the successful creation of the present assembly workflow, CWL v1.0 has some limitations. The essential feature of container integration currently supports only Docker containers and, thus, can pose a serious security risk in a multi-user computing environment, such as high-performance computing (HPC) systems [63][64][65]. The container processes are spawned from a root-owned Docker daemon and, consequently, executed as a root, thus escaping policies to the privileged usage of resources and controls [64,65], which may lead to "container escape attacks" [63]. For example, knowing that Docker daemon communicates either using a Unix-or TCP-socket and that the Unix socket typically has root: docker (user: group) rights, users who belong to the docker-group are granted root rights to resources such as file systems, communication protocols, and mounting, thereby exposing the environment to malicious and/or accidental mis-uses [63]. The possible case of daemon communicating via a TCP socket would allow misuse from out-side of the server through an internet connection, if not appropriately configured [63]. The distribution of Docker images, for instance, from DockerHub, has the potential to lead to the distribution of malicious Dockerfiles through a compromised GitHub account [63]. The latter issue can be prevented by uploading docker images directly to DockerHub or by disabling the update-link between GitHub and DockerHub. CWL implementation addresses the security issue related to root rights by enforcing the user and group identifiers to those of the current user in Docker execution. However, a security risk still remains, because Docker containers can be used in non-CWL contexts and, therefore, should not be installed into a multi-user HPC environment. This security issue can be addressed in CWL by extending support to containers, such as the open source effort called Singularity [65], or by using an alternative Docker implementation, such as rootless udocker, which was shown to be successful in the present study.
In addition to security aspects, minor issues relating to the use of CWL were encountered. For instance, CWL enforces readonly access to the file system inside a Docker container, thereby creating unnecessary complexity when using some tools, such as SmrtLink. Specifically, in SmrtLink, the creation of reference genomes in the file system is hardcoded. Therefore, it would be advisable for CWL to allow the user to pre-define directories with write-access inside the container. The latter restriction does not exist when udocker is used, leading to a compatibility issue. Re- Figure 2: Correlation diagrams of indels are illustrated for one chromosome of each reference genome reassembled. The columns represent Caenorhabditis elegans, Drosophila melanogaster, and Plasmodium falciparum, from left to right. The y-axis on left side represents the data to correlate with indels (gray bars and smoothened black line), whereas red bars and blue bars on the right side represent positive and negative correlations, respectively. Clearly, the regions around indels correlate with those around nucleotide differences, repeat regions, non-coding non-repeat regions, and gaps in Illumina coverage. In contrast, regions around GC content, coding regions, and Illumina coverage correlate negatively to those around indels. As expected, due to lack of context bias, PacBio coverage does not show clear correlation to indels and have only few low coverage regions in these chromosomes. The correlation patterns for C. elegans and D. melanogaster follow those of P. falciparum, although they are not as conspicuous.
garding the workflow definition, the order of execution relies on the resultant data from the previous step to be consumed in the next one, sometimes enforcing workarounds, such as "expression tool" for file indexing; therefore, alternative methods are needed to address these dependencies. Finally, support for alternative workflow paths would facilitate the creation of versatile and adaptive workflows.
Using the present CWL-based assembly workflow, all three genome assemblies were completed successfully. Metrics from the evaluation methods Quast and GAGE were used to compare the CWL-based assemblies to respective, high-quality reference genomes (Tables 2-4a; Fig. 2). To avoid false reports on misassemblies, particularly those caused by transposons, key parameters were set at twice the minimum read length of 6 kb [66] for the aligned sequences and 99.5% for the alignment accuracy. For Quast metrics, these parameter settings linked events, such as transposon insertion and deletion, to local mis-assemblies instead of relocations or translocations. In addition, it needs to be acknowledged that some degree of built-in stochasticity in the programs is to be expected, such that resultant assemblies might differ slightly when the workflow is repeated.
The assembly of the smallest genome (23 Mb; P. falciparum) using a PacBio sequence coverage of 225 (Table 1) achieved chromosomal contiguity and also yielded the whole apicoplast genome. The circular nature of the apicoplast genome was not recognized by the program Canu and, thus, needed processing with the program Circlator [67] to circularize it. For the P. falciparum datasets used herein, DNA was derived from infected human erythrocytes [59], which likely predominantly contained (haploid) merozoites from an in vitro culture; thus, the program Haplomerger2 was not applied to the assembly. The original laboratory strain 3D7 of P. falciparum was isolated from a patient in the Netherlands in 1987 [68] and is maintained and propagated by continuous in vitro culture [69]. Using MicroArray technologies, employing a coverage of 76% for the coding and 41% for the non-coding regions, Bopp and coworkers [70] demonstrated that the genome of P. falciparum was relatively stable, showing only 58 small nucleotide variants the parental 3D7 clone relative to the 3D7 reference genome published in 2002 [6]. Mutation and structural variation rates were estimated at 1.7 × 10 −9 and 4.7 × 10 −6 per nucleotide per generation, respectively [70]. Therefore, minor deviations from the reference genome were expected in the present pipeline assemblies.
The Quast metrics for the Arrow-polished pipeline against the Vembar assembly (i.e., polished using the program Arrow) showed only one mis-assembly and nine local mis-assemblies, and the number of nucleotide mis-matches (n = 458; 1.96 per 100 kb) was comparable with an estimated nucleotide accuracy of 99.999% [59]. However, the number of indels (n = 28,775; 123 per 100 kb) raised some questions. From the correlation diagrams, using the reference assembly, it was evident that indels correlated positively to AT-rich non-coding regions and negatively to less AT-rich coding regions (Fig. 2). This information suggests that AT-rich regions are vulnerable to indels, supported by a likelihood of 18.0-fold to observe indels in non-coding rather than in coding regions for the Arrow-polished pipeline assembly, and 7.7-fold for the Vembar assembly (polished using the program Quiver [71], the predecessor of the program Arrow). To further clarify this aspect, we showed that both assemblies shared a substantial number of indels (n = 8,947) and nucleotide differences (n = 2,007) in the exact same locations in the reference genome, therefore, suggesting that discrepancies might represent accumulated mutation events as a consequence of continuous in vitro culture of P. falciparum. The comparison of these assemblies to the reference genome revealed slightly less nucleotide differences (n = 1,233; 5.35 per 100 kb) and more indels (n = 31,807; 138 per 100 kb) in the Vembar assembly than in the pipeline assembly (n = 1,242, i.e., 5.36 per 100 kb for nucleotide differences, and n = 9,409, i.e., 40.59 per 100 kb for indels), suggesting a better compliance of the latter assembly with the reference genome. Interestingly, the Arrow-polished Vembar assembly resulted in a reduced number of indels with respect to both the reference genome (n = 9,604; 41.56 per 100 kb) and the Arrow-polished pipeline assembly (n = 4,055; 17.37 in 100 kb). Taken together, this information suggests a difference in the efficiency of polishing between the Quiver-polished Vembar assembly and the Arrow-polished pipeline assembly. This difference is likely due to the use of corrected reads for the polishing of the Vembar assembly, as raw reads were used for the Arrowpolished pipeline assembly. This insight suggests that substantial sequencing depth (≥100) of raw reads is beneficial compared with a limited depth of corrected reads. This observation supports the assumption in which high sequencing depth results in increased accuracy in a consensus sequence due to the elimination of erroneous base calls (random error rate of 11%, no sequence context bias) from PacBio data [72]. Indeed, PacBiocoverage of mapped raw reads shows neither a clear correlation pattern for coding nor for non-coding regions (Fig. 2), supporting the assumed absence of a sequence context bias and the proposal for the use of raw reads for polishing.
The N2 strain of C. elegans was originally collected in 1951 near Bristol, England [73], and was propagated in culture for about 300 to 2,000 generations from 1951 to 1969 [73] before cryogenic preservation was applied for storage. The use of this strain around the world is likely to be associated with phenotypic differences in the worm among laboratories linked to genetic change over time [73]. For D. melanogaster, the iso-1 laboratory strain [74] used for reference genome assembly was sequenced from libraries in 1990, 1998, and 1999, and differences among sequences assembled from these libraries were detected during the creation of a third version of the reference assembly [75]. Based on this information, mutation events are expected to be detected in both reference genomes of both of these model organisms. The vulnerability to indels in Pilon-polished pipeline assemblies is reflected in likelihoods of 10.9-fold to encounter indels in non-coding rather than coding regions in C. elegans, and 10.3-fold in D. melanogaster, similar to 20.4-fold for P. falciparum. For C. elegans and D. melanogaster, the correlation patterns for indels in coding vs non-coding regions resemble those for P. falciparum, although they are less conspicuous (cf. Fig. 2).
As expected, Illumina read-coverage gaps correlate positively to indels, which correlate negatively to coding and positively to non-coding regions (cf. Fig. 2), indicating low read-coverage in non-coding regions and suggesting low resolution of AT-rich sequences. These findings suggest that Pilon-based polishing is more efficient in coding than in non-coding regions. This aspect was demonstrated for the Vembar assembly of P. falciparum data by a greater reduction in indel number in coding regions (n = 4,172 to 1,748; ratio: 2.38) than in non-coding regions (n = 27,619 to 22,403; ratio: 1.23). In addition, for C. elegans, the Pilon-polished assembly had similarly reduced indel numbers in coding regions (n = 1,104 to 177; ratio: 6.24) compared with non-coding regions (n = 21,499 to 5,889; ratio: 3.65) in the Arrow-polished assembly. However, Pilon-based polishing altered only slightly the numbers of indels in the pipeline assemblies for P. falciparum and D. melanogaster. This is likely due to the high coverage of PacBio raw data for P. falciparum (n = 225x) and D. melanogaster (n = 109x) in comparison to C. elegans (n = 47x), supporting the beneficial effect of substantial sequencing coverage of PacBio data on observed indels [36]. Neither Arrow-nor Pilon-polishing had a major effect on nucleotide mis-matches in any of the three assembled genomes; for the pipeline assemblies (Canu, Arrow-polished, Pilon-polished, and HaploMerger2-merged), C. elegans had between 13,869 and 15,355 mis-matches, D. melanogaster had between 4,909 and 8,441, and P. falciparum had between 1,242 and 2,237 mis-matches. A putative dependency of indels and nucleotide differences on gene predictions was reflected in the BUSCO results, in which an increase in the number of complete BUSCO orthologs was recorded following Arrow polishing for C. elegans (n = 954 to 969), D. melanogaster (n = 1,637 to 1,653), and P. falciparum (n = 147 to 148). This pattern was reflected also in the numbers of affected mRNA/conceptually translated protein sequences, i.e., 2,877/2,858 to 969/948, 2,660/2,640 to 123/105 and 711/704 to 420/418, respectively. Pilon-polishing improved the BUSCO result only for C. elegans (n = 969 to 971).
Combined with the observed lack of sequence context bias for PacBio data in correlation diagrams (Fig. 2), the likelihood of encountering indels in coding vs non-coding regions (for all three organisms) strongly supported the existence of mutation events, as expected based on the origins and culturing conditions/environments/techniques used for each of these model organisms. These observations demonstrate a challenge to accurately assemble AT-rich regions.
In terms of reference quality, the completeness of the genomes of C. elegans (97.0%) and P. falciparum (99.6%) is clearly >95%, but D. melanogaster (91.5%) was incomplete. The latter finding is likely due to a substantial interspersed repeat content in the Pilon-polished assembly for D. melanogaster (28.8%) compared with that of the reference genome (19.0%) and this content's influence on the performance of the program Hap-loMerger2. The number of mis-assemblies reduced substantially (from 136 to 63), as did the predicted size of the genome (from 158.0 to 129.7 Mb) and its completeness (98.1% to 91.5%). For D. melanogaster, the high interspersed repeat content is likely due to the use of pooled male iso1 flies (n = 1950) for the original DNA extraction for sequencing [61], and HaploMerger2 has likely compressed the interspersed repeat content (14.6%) to less than that of the reference (19.0%). For C. elegans, the increase in observed translocations (from 1 to 5), following the application of HaploMerger2, suggests an impaired detection of haplotypic sequences. For these reasons, being able to use sequence reads in HaploMerger2 might help create more confident results and could support the assembly of polyploid genomes, such as that of the parasitic nematode Haemonchus contortus [76].
For C. elegans and D. melanogaster, contigs did not represent complete chromosomes, which emphasizes the need for scaffolding technologies, such as Hi-C and/or BioNano. Limited amounts of sub-optimal quality DNA from invertebrates, including parasites [38][39][40], can often lead to fragmented DNA, ultimately resulting in gaps in assembled sequences [9]. Therefore, the role of scaffolding technologies is of critical importance to achieve chromosomal contiguity. The program BUSCO, conventionally used to assess the completeness of genome assemblies, was utilized here to evaluate gene completeness of the present assemblies in relation to the reference genomes. For P. falciparum, gene completeness (68.4% to 68.8%) was low compared with C. elegans (98.8%) and D. melanogaster (99.6%). This low value for P. falciparum is misleading, as it relates to an inadequate representation in BUSCO of data for protistan taxa, which are closely related to P. falciparum. For the pipeline assemblies of both C. elegans and P. falciparum, the gene completeness was slightly better than that of respective reference genomes. The requirement for an accuracy of ≥99.99% [7] is somewhat debatable for de novo assemblies produced using the present CWL pipeline, because the number of accumulated mutation events (over time) is not known. The highest accuracy (>99.99%) was achieved for coding regions vis-à-vis non-coding regions (>99.9%; <99.99%) ( Tables 2-4). For P. falciparum, the numbers of mis-assemblies (n = 2) and local mis-assemblies (n = 47) in the Pilon-polished pipeline assembly vs the reference assembly was low; while some of these mis-assemblies are genuine, others might be "false positives" caused by repetitive regions or mitotic, homologous recombination events occurring in cell culture. For C. elegans and D. melanogaster, the numbers of mis-assemblies (n = 58, n = 63, respectively) and local mis-assemblies (n = 696, n = 313, respectively) were clearly higher than those in P. falciparum. The runtimes required to assemble genomes depend largely on genome size, amount of genomic data, and the characteristics of the genome, such as GC and repeat contents. Therefore, the runtime does not always follow the size of the genome. Here, runtimes were 424, 1,537, and 6,501 CPU hours for the genomes of P. falciparum, C. elegans, and D. melanogaster, respectively. The re-spective calendar time is dependent on the server configuration, such as the number of CPUs, and the pipeline can be readily expanded to HPC clusters in the future. The RAM usage peaked at 132.1 GB for all three assemblies when the program Centrifuge loaded NCBI NT database into heap memory.

Conclusions
Our aim in this study was to produce and evaluate the capacity of CWL to define a repeatable, reproducible, and reusable bioinformatics workflow for genome assembly. This pipeline was assessed for the de novo assembly of eukaryotic genomes of ∼23-138 Mb employing PacBio long-read and Illumina short-read data. It has also been used to assemble genomes of ∼300 Mb in shorter run times than for the D. melanogaster genome (138 Mb), using similar data coverage, which indicates that it will be applicable to larger genomes. Clearly, CWL achieved our aim, and using high-quality DNA with high sequencing depth, the present pipeline produced near reference-quality assemblies using PacBio data alone. However, when PacBio sequencing depth was moderate, such as for C. elegans, the use of additional shortread data (in this case, Illumina) during "polishing" gained increased relevance. In pursuit of chromosomal completeness, the fragmentation remaining within the de novo assembled genomes of C. elegans and D. melanogaster, and the known challenges associated with acquiring high-quality DNA from some invertebrates will likely benefit from the integration of data obtained via Hi-C and BioNano scaffolding technologies. Clearly, CWL supports the integration of additional software tools, including those required for scaffolding. To further improve versatility, security, and the use of CWL in multi-user HPC systems, CWL will likely support alternative paths and secure containers in informatics workflows.
Using this CWL pipeline, differences from the reference genome, including possible insertion/deletion events, were more prevalent in non-coding than coding regions. This finding contrasts with the expected lack of sequence context bias of PacBio data, such that it is not clear to what extent these indels and/or other differences represent mutations resulting from evolutionary processes or assembly errors and how they might impact on inferred gene structure and function. Clearly, further research is required to address such issues. Taken together, the results of this study show that this newly developed automated CWL workflow delivers genome assemblies of the high quality expected by NHGRI-NIH and the scientific community, to underpin confident gene predictions and ensuing postgenomic analyses in many areas, including functional genomics, population genomics, evolutionary biology, drug and vaccine discovery, and drug resistance.

Assembly quality
To assess accuracy and nucleotide differences, resultant de novo assemblies were compared with the respective reference assemblies using the program Quast v4.6.3 (QUAST, RRID:SCR 001228) [88] employing both embedded scripts for GAGE [89] and the program MUMMER v3.23 [90]. Within the program Quast, parameters -min-identity = 99.5% and -extensive-mis-size = 12 000 (twice the minimum required read-length of 6,000 bp) were used to minimize false reports of mis-assemblies from repetitive DNA sequences, such as translocations, relocations, and inversions. For translocations, the flanking regions of a sequence align to different chromosomes; for relocations, the flanking regions align >12 kb further apart from one another than expected, or overlap by the same length within the same chromosome; for inversions, the flanking regions align to opposite strands of the same chromosome [88]. Recorded were also local misassemblies of 85 bp < apart/overlap < 12 kb on the same strand and chromosome; large indels of >5 and ≤85 bp; and small indels of ≤5 bp [88,91]. Custom scripts [92] were created to count indels and nucleotide mis-matches in both coding and noncoding regions. These scripts used the reference assemblies, ref-erence gene models in GFF format, and SNP files produced by the program Quast. Co-locations of indels and nucleotide differences between an assembly and a reference genome were calculated using the scripts "colocation.sh." The program BUSCO v3 (BUSCO, RRID:SCR 015008) [93] was employed to establish presence/absence of expected eukaryotic core genes in each taxonomic lineage as well as the completeness of each assembly. The BUSCO lineage designations "nematode," "insect," and "protist" were used for C. elegans, D. melanogaster, and P. falciparum, respectively. A workflow was included to produce all relevant assembly metrics [92]. Mitochondrial and apicoplast sequences were manually identified and removed prior to calculating these metrics for the (i) Canu, (ii) Arrow-polished, (iii) Pilon-polished, and (iv) HaploMerger2-merged pipeline assemblies.

Correlation of indels to assembly features
To illustrate the relationship of indels to features in a reference assembly, correlation diagrams were generated for the length of each reference chromosome. To achieve this, (i) observed indels and nucleotide differences, coverage, and gaps of coverage for mapped PacBio and Illumina reads were positioned to the reference chromosomes. Then, (ii) coding regions, predicted repeat regions, and remaining non-coding regions were identified in the same chromosomes. For features in (i) and (ii), nucleotide counts matching each feature were summed up along the chromosome for each 100-1,000 bp-sliding window at 50-500 bp-steps. Resultant counts were then used to calculate the average correlation for 200 consecutive counts for a pair of features in 50-500 bp steps spanning 10-100 kb, resulting in a correlation vector for each chromosome. Correlations were calculated using the R programming language [94], and the vectors were illustrated using the R package ggplot2 (ggplot2, RRID:SCR 014601) [95].

Availability of supporting data
Output assemblies, BUSCO results, and snapshots of the code are available from the GigaScience GigaDB repository [96], alongside an Object Bundle of the workflow [97].