Improving the annotation of the Heterorhabditis bacteriophora genome

Abstract Background Genome assembly and annotation remain exacting tasks. As the tools available for these tasks improve, it is useful to return to data produced with earlier techniques to assess their credibility and correctness. The entomopathogenic nematode Heterorhabditis bacteriophora is widely used to control insect pests in horticulture. The genome sequence for this species was reported to encode an unusually high proportion of unique proteins and a paucity of secreted proteins compared to other related nematodes. Findings We revisited the H. bacteriophora genome assembly and gene predictions to determine whether these unusual characteristics were biological or methodological in origin. We mapped an independent resequencing dataset to the genome and used the blobtools pipeline to identify potential contaminants. While present (0.2% of the genome span, 0.4% of predicted proteins), assembly contamination was not significant. Conclusions Re-prediction of the gene set using BRAKER1 and published transcriptome data generated a predicted proteome that was very different from the published one. The new gene set had a much reduced complement of unique proteins, better completeness values that were in line with other related species’ genomes, and an increased number of proteins predicted to be secreted. It is thus likely that methodological issues drove the apparent uniqueness of the initial H. bacteriophora genome annotation and that similar contamination and misannotation issues affect other published genome assemblies.


Introduction
The sequencing and annotation of a species' genome is often but the first step in exploiting these data for comprehensive biological understanding. As with all scientific endeavour, genome sequencing technologies and the bioinformatics toolkits available for assembly and annotation are being continually improved. It should come as no surprise therefore that first estimates of genome sequences and descriptions of the genes they contain can be improved.
For example, the genome of the nematode Caenorhabditis elegans was the first animal genome to be sequenced [1]. The genome sequence and annotations have been updated many times since, as further exploration of this model organism revealed errors in original predictions, such that today, with release WS260 (http://www.wormbase.org/) [2], very few of the 19099 protein coding genes announced in the original publication [1] retain their original structure and sequence. The richness of the annotation of C. elegans is driven by the size of the research community that uses this model species. However for most species, where the community using the genome data is small or less-well funded, initial genome sequences and gene predictions are not usually updated.
Heterorhabditis bacteriophora is an entomopathogenic nematode which maintains a mutualistic association with the bacterium Photorhabdus luminescens. Unlike many other parasitic nematodes, it is amenable to in vitro culture [3] and is therefore of interest not only to evolutionary and molecular biologists investigating parasitic and symbiotic systems, but also to those concerned with the biological control of insect pests [4,5]. P. luminescens colonises the anterior intestine of the free-living infective juvenile stage (IJ). IJs are attracted to insect prey by chemical signals [6,7]. On contacting a host, the IJs invade the insect's haemocoel and actively regurgitate P. luminescens into the haemolymph. The bacterial infection rapidly kills the insect, and H. bacteriophora grow and reproduce within the cadaver. After 2-3 cycles

Page 4
Heterorhabditis bacteriophora reannotation of replication, the nematode progeny develop into IJs, sequester P. luminescens and seek out new insect hosts.
Axenic H. bacteriophora IJs are unable to develop past the L1 stage [8] , and H. bacteriophora may depend on P. luminescens for secondary metabolite provision [9,10]. Mutation of the global post-transcriptional regulator Hfq in P. luminescens reduced the bacterium's secondary metabolite production and led to failed nematode development, despite the bacterium maintaining virulence against host (Galleria mellonella) larvae [11]. Together these symbionts are efficient killers of pest (and other) insects, and understanding of the molecular mechanisms of host killing could lead to new insecticides.
H. bacteriophora was selected by the National Human Genome Research Initiative as a sequencing target [12]. Genomic DNA from axenic cultures of the inbred strain H. bacteriophora TTO1 was sequenced using Roche 454 technology and a high quality 77 Mb draft genome assembly produced [13]. This assembly was predicted (using JIGSAW [14] ) to encode 21250 proteins. Almost half of these putative proteins had no significant similarity to entries in the GenBank non-redundant protein database, suggesting an explosion of novelty in this nematode. The predicted H. bacteriophora proteome had fewer orthologues of Kyoto Encyclopedia of Genes and Genomes loci in the majority of metabolic categories than nine other nematodes. H. bacteriophora was also predicted to have a relative paucity of secreted proteins compared to free-living nematodes, postulated to reflect a reliance on P.
luminescens for secreted effectors [13]. The 5.7 Mb genome of P. luminescens has also been sequenced [15]. The H. bacteriophora proteome had fewer shared orthologues when clustered and compared to other rhabditine (Clade V) nematodes (including Caenorhabditis elegans and the many animal parasites of the Strongylomorpha) [16].
In preliminary analyses we noted that while the genome sequence itself had high completeness scores when assessed with the Core Eukaryote Gene Mapping Approach (CEGMA) [17]  If these unusual characteristics reflect a truly divergent proteome, the novel proteins in H. bacteriophora may be crucial in its particular symbiotic and parasitic relationships, and of great interest to development of improved strains for horticulture. However, it is also possible that contamination of the published assembly or annotation artefacts underpin these unusual features. We re-examined the H. bacteriophora genome and gene predictions, and used more recent tools to re-predict protein coding genes from the validated assembly.
As the BRAKER1 predictions were demonstrably better than the original ones, we explored whether some of the unusual characteristics of the published protein set, in particular the level of novelty and the proportion of secreted proteins, were supported by the BRAKER1 protein set.

No evidence for substantial contamination of the H. bacteriophora genome assembly
We used BlobTools [20] to assess the published genome sequence [13] for potential contamination. The raw read data from the published assembly was not available on the trace archive or short read archive (SRA). We thus utilised new Illumina short-read resequencing data generated from strain G2a1223, an inbred derivative of H. bacteriophora strain "Gebre", isolated by Adler Dillman in Moldova. G2a1223 has about 1 single-nucleotide change per ~2000 nucleotides compared to the originally-sequenced TT01 strain. G2a1223 was grown in culture on the non-colonising bacterium Photorhabdus temperata. The majority of these data (96.3% of the reads) mapped as pairs to the assembly, suggesting completeness of the published assembly with respect to the new raw read data. In addition, 99.96% of the published assembly had at least 10-fold coverage from the new raw reads.
The assembly was explored using a taxon-annotated GC-coverage plot, with coverage taken from the new Illumina data and sequence similarity from the NCBI nucleotide database (nt) ( Figure 1). H. bacteriophora was excluded from the database search used to annotate the scaffolds to exclude self hits from the published assembly. All large scaffolds clustered congruently with respect to read coverage and CG content. A few (57) scaffolds had best BLASTn matches to phyla other than Nematoda (Table 1). A small amount (5 kb) of likely remaining P. luminescens contamination was noted. We identified 100 kb of the genome of a strain of the common culture contaminant bacterium Stenotrophomonas maltophilia [21].
Contamination of the assembly with S. maltophilia was acknowledged [13] but removal of scaffolds before annotation was not discussed. Two high-coverage scaffolds that derived from the H. bacteriophora mitochondrial genome were annotated as "undefined Eukaryota"

Page 7
Heterorhabditis bacteriophora reannotation because of taxonomic misclassification in the NCBI nt database. Many scaffolds with coverages close to that of the expected nuclear genome had best matches to two unexpected sources: the platyhelminths Echinostoma caproni and Dicrocoelium dendriticum, and several hymenopteran arthropods. Inspection of these matches showed that they were due to high sequence similarity to a family of H. bacteriophora mariner-like transposons [22] and thus these were classified as bona fide nematode nuclear sequences. A group of scaffolds contained what appears to be a H. bacteriophora nuclear repeat with highest similarity to histone H3.3 sequences from Diptera and Hymenoptera. The remaining scaffolds had lowscoring nucleotide matches to a variety of chordate, chytrid and arthropod sequences from deeply conserved genes (tubulin, kinases), but had coverages similar to other nuclear sequences.
Scaffolds with average coverage of less than 10-fold were removed from the assembly (35 scaffolds spanning 132949 bases, 0.2% of the total span; see Supplementary File 1). This removed all scaffolds aligning to S. maltophilia and to Photorhabdus spp. (104 kb). The origins of the additional 28 kb were not investigated. In the published annotation [13], 76 genes were predicted from these scaffolds.

Page 8
Heterorhabditis bacteriophora reannotation   Heterorhabditis bacteriophora reannotation parameter for orthologue finding. The BRAKER1/soft-masked gene set contained a substantially higher percentage of complete, and lower percentage of fragmented BUSCO genes than the published set (Table 2). Two H. bacteriophora transcriptome datasets, publicly available Roche 454 data and Sanger expressed sequence tags, were mapped to the published and BRAKER1/soft-masked transcriptomes to assess gene set completeness. This suggested that the BRAKER1/soft-masked transcriptome predictions were more complete than the original ( Table 2).
Nearly half (9893/20964; 47.2%) of the published proteins were reported to have no significant matches in the NCBI non-redundant protein database (nr) [13]. This surprising result could be due to a paucity of data from species closely related to H. bacteriophora in the NCBI nr database at the time of the search, or inclusion of poor protein predictions in the published set, or both. Targeted investigation of these 9893 orphan proteins here was not possible due to inconsistencies in gene naming in the publically available files. The published and BRAKER1/soft-masked proteomes were compared to the Uniref90 database [24], using DIAMOND v0.9.5 [25] with an expectation value cut-off of 1e -5 . In the published proteome, 8962 proteins (42.7%) had no significant matches in Uniref90. Thus a relatively poorly populated database was not the main driver for the high number of orphan proteins reported in the published proteome. In the BRAKER1/soft-masked proteome, only 2889 proteins (18.3%) had no hits in the Uniref90 database (Table 2). Heterorhabditis bacteriophora reannotation formed H. bacteriophora-specific orthogroups. Orthology analysis including only the BRAKER/soft-masked protein set predicted 1112 H. bacteriophora singletons (7.1% of the proteome) with 167 proteins in H. bacteriophora-specific orthogroups ( Figure 2D). In comparison, when the orthology analysis included the BRAKER1/soft-masked predictions there were 1858 C. elegans singletons (9.2% of the C. elegans proteome). Very few universal, single copy orthologues were defined in either analysis. Exploring "fuzzy-1-to-1" orthogroups (where true 1-to-1 orthology was found for greater than 75% of the 24 species -i.e. 18 or more species), the published protein predictions had more missing fuzzy-1-to-1 orthologues than did the BRAKER1/soft-masked predictions ( Table 2). In the clustering that    ** The list of strict one-to-one orthologues was augmented with protein clusters where 75% of species had single copy representatives ("fuzzy-1-to-1" orthologues identified by KinFin).

Page 16
Heterorhabditis bacteriophora reannotation A supermatrix maximum likelihood phylogeny was generated from the fuzzy-1-1 orthologues in the clustering that included both H. bacteriophora proteomes (Figure 3; see Supplementary File 7). The phylogeny, rooted with Pristionchus spp., shows the H.
bacteriophora proteomes as sisters. However the BRAKER1/soft-masked proteome has a shorter branch length to Heterorhabditis' most recent common ancestor with other Clade V nematodes, suggesting that the published proteome includes uniquely divergent sequences.

Page 17
Heterorhabditis bacteriophora reannotation  Page 18 Heterorhabditis bacteriophora reannotation The secretome of H. bacteriophora has been of particular interest as it may contain proteins involved in symbiotic interactions with P. luminescens, and proteins crucial to invasion and survival within the insect haemocoel. In the original publication, only 603 proteins (2.8% of the proteome) were predicted to be secreted [13]. This proportion is much lower than in free living nematodes such as C. elegans and it was postulated that H. bacteriophora relies on P. luminescens for secreted effectors [13]. The signal peptide detection method used in the original analyses was not described [13]. We used SignalP

Discussion
Assembly of, and genefinding in, new genomes is a challenging task, and especially so in larger genomes and those phylogenetically distant from any previously analysed exemplar.
When applied de novo to datasets from extremely well-assembled and well-annotated model species, even the best methods fail to recover fully contiguous assemblies and yield predicted gene sets that have poor correspondence with the known truth [27]. A major issue with primary assemblies and gene sets arises when exceptional findings are taken at face value, and used to assert exceptional biology in a target species [28]. Where these exceptions are in fact the result of methodological failings, the scientific record, including the public databases, becomes contaminated. At best, erroneous assertions can be quickly checked and corrected, but at worst they can mislead and inhibit subsequent work.
A second concern arises from the recognition that while no method can currently produce perfect assemblies and perfect gene sets from raw data, analyses using the same toolsets will resemble each other and reflect the successes and failings of the particulars of the algorithms employed. However, when comparing genome assemblies and gene sets produced by different pipelines, it may be that the disparity in output generated by different pipelines dominates any signal from biology. Genomes assembled and annotated with the same tools will look more similar, and in a pool of assemblies and protein sets the one species that used a variant process will be flagged as exceptional. Again, the model organisms show the way: as new data and new scrutiny is added to the genome, better and better analyses are available. With additional analysis, and additional independent data, genome and gene predictions can be improved markedly for any species [29].
Here we examined the "outlier" whole-genome protein predictions from the entomopathogenic nematode H. bacteriophora [13]. The original publication noted that the number of novel proteins (those restricted to H. bacteriophora) was particularly large, while the number of secreted proteins was rather small, and suggested that these genome features might be a result of evolution to the species' novel lifestyle (which includes an essential symbiosis with the bacterium P. luminescens). Overall we found that while the published genome sequence had a small amount of bacterial contamination, and a small number of "nematode" genes were predicted from these contaminants, the assembly itself to BRAKER1, the resulting gene set has much improved numerical and biological scores. In particular we note that the biological completeness of the predicted gene set now matches that of the genome sequence from which it was derived ( Table 2).

AG 3') introns. While most genomes have a low proportion of non-canonical introns
(usually approximately 0.5% of all introns), some species have markedly higher proportions [19]. The high proportion found initially in H. bacteriophora could perhaps have been taken as a warning that the prediction set was of concern. We note that gene predictors can be set to disallow any predictions that require non-canonical splicing, and many published genomes have zero non-canonical introns. These gene prediction sets are likely to categorically miss true non-canonically spliced genes.
The new BRAKER1 gene prediction set had many fewer species-unique genes (7.1%) than did the original (42.7%) when compared to 23 other related nematodes. We regard this reduction in novelty as indicative of a better prediction, as, for example, C. elegans, the bestannotated nematode genome, had only 9.2% of species unique genes in our analysis. Having a large proportion of orphan proteins is not unique to the published H. bacteriophora predictions. Nearly half (47%) of the gene predictions in Pristionchus pacificus were reported to have no homologues in fifteen other nematode species [30]. Evaluation of proteomic and transcriptomic evidence, as well as patterns of synonymous and non-synonymous substitution, suggested that as many as 42-81% of these genes were in fact expressed [31].
Therefore the high proportion of orphan genes in H. bacteriophora is not prima facie evidence of poor gene predictions. Expanded transcriptomic and comparative data are needed to build on the work we have presented in affirming the true H. bacteriophora gene set.
Biological pest control agents may become increasingly important for ensuring crop protection in the future [32]. A number of factors currently limit the commercial applicability of H. bacteriophora, including their short shelf life, susceptibility to environmental stress and limited insect tropism [12,33]. Accurate genome annotation will assist in the Page 22 Heterorhabditis bacteriophora reannotation analysis of H. bacteriophora, facilitating the exploration of genes involved in its parasitic and symbiotic interactions, and supporting genetic manipulation to enhance its utility as a biological control agent.

Contaminant screening and Removal of Low Coverage Scaffolds
The assembly scaffolds were aligned to the NCBI nucleotide (nt) database, release 204, using Nucleotide-Nucleotide BLAST v2.6.0+ (RRID:SCR_008419) in megablast mode, with an evalue cut off of 1e -25 and a culling limit of 2 [38]. H. bacteriophora hits were excluded from the search using a list of all H. bacteriophora associated gene identifiers downloaded from NCBI GenBank nucleotide database, release 219. Raw, paired-end Illumina reads from the re-sequencing project were mapped against the assembly, as paired, using Burrows-Wheeler Aligner (BWA) v0.7.15 (RRID:SCR_010910) in mem mode with default options [39]. The
Blobtools v0.9.19 [20] was used to create taxon annotated GC-coverage plots for the published assembly, using the Nucleotide-Nucleotide BLAST and raw read mapping results.
Scaffolds that did not have Nematoda as a top BLAST hit at the phylum level were identified, and the species-level top BLAST hit, length of scaffold, and scaffold mean base coverage were extracted from the Blobology output. Scaffolds with a mean base coverage of <10x were identified from the output of the Blobology pipeline and removed from the assembly.
A list of excluded scaffolds is available in Supplementary File 1.

Generation of BRAKER1 Gene Predictions
Before annotation the published assembly was soft masked for known Nematoda repeats from the RepeatMasker Library v4.0.6 using RepeatMasker v4.0.6 (RRID:SCR_012954) [41] with default options. The two publicly available Roche 454 RNA-seq data files were adaptor and quality-trimmed using BBDuk v36.92 (unpublished toolkit from Joint Genome Institute, n.d.). Reads below an average quality of 10 or shorter than 25 nucleotides were discarded.
Regions with average quality below 20 were trimmed. The cleaned reads were mapped to the soft masked assembly using STAR v2.5 (RRID:SCR_005622) with default options [42,43]. The soft masked assembly was annotated with BRAKER1 [23] with guidance from the mapping output from STAR. An identical annotation method was applied to a hard masked version of the assembly. Hard masking was for known Nematoda repeats from the RepeatMasker Library v4.0.6 using RepeatMasker v4.0.6 with default options. The published and BRAKER1 proteomes were compared using DIAMOND v0.9.5 [25] in BLASTP mode to the Uniref90 database (release 03/2017) [24] with an expectation value cut-off of 1e -5 and no limit on the number of target sequences. Hits to H. bacteriophora proteins were removed using its TaxonID.

Gene Prediction Statistics
Gene-level statistical summaries were calculated including only the longest isoforms of the BRAKER1 gene predictions. The longest isoform for each gene in the BRAKER1 H.
bacteriophora annotation was identified from the general feature format file, and then selected from the protein FASTA files. The general feature format file (GFF) for the published gene predictions did not contain any isoforms and was analysed in its entirety.
Gene features, extracted from the GFF files, were assessed for overlap using bedtools v2.26 (RRID:SCR_006646) in intersect mode [45]. Only genes on the same strand were considered to be overlapping. To calculate the number of identical proteins shared between the published and BRAKER1 proteomes non-redundant protein fasta files were generated using cd-hit v4.6.1 (RRID:SCR_007105) [46] for the BRAKER1 and published predictions. Heterorhabditis bacteriophora reannotation each gene, and for all proteomes (except the BRAKER1/soft-masked H. bacteriophora protein set), proteins less than 30 amino-acids in length were excluded before clustering.
For the H. bacteriophora BRAKER1/soft-masked protein set, proteins less than 30 aminoacids (SF5.2) were removed manually from the orthofinder clustering statistics after clustering. None of these proteins seeded new clusters and are therefore will not have influenced the clustering results. Kinfin v0.9 [47], was used with default settings to identify true and fuzzy 1-to-1 orthologues, and their associated species specific statistics. Fuzzy 1-to-1 orthologues are true 1-to-1 orthologues for greater than 75% of the species clustered.
For the clustering analysis presented in Supplementary File 3, the BRAKER1/soft masked and published proteomes were clustered simultaneously to the 23 other Clade V nematode proteomes, and singletons, and species-specific clusters were excluded.

Phylogenetic Analyses
Both H. bacteriophora proteomes were clustered simultaneously with the 23 Clade V nematode proteomes into orthologous groups using Orthofinder v1.0 [26]. The fuzzy 1-to-1 orthologues were extracted and processed using GNU parallel [49]. They were aligned using MAFFT v7.267 (RRID:SCR_011811) [50], and the alignments trimmed with NOISY A zipped archive (11.2 Mb) of the supermatrix alignment and the phylogenetic trees produced for the the analyses of the Heterorhabditis bacteriophora proteomes. The archive contains the following files: