Companion: a web server for annotation and analysis of parasite genomes

Currently available sequencing technologies enable quick and economical sequencing of many new eukaryotic parasite (apicomplexan or kinetoplastid) species or strains. Compared to SNP calling approaches, de novo assembly of these genomes enables researchers to additionally determine insertion, deletion and recombination events as well as to detect complex sequence diversity, such as that seen in variable multigene families. However, there currently are no automated eukaryotic annotation pipelines offering the required range of results to facilitate such analyses. A suitable pipeline needs to perform evidence-supported gene finding as well as functional annotation and pseudogene detection up to the generation of output ready to be submitted to a public database. Moreover, no current tool includes quick yet informative comparative analyses and a first pass visualization of both annotation and analysis results. To overcome those needs we have developed the Companion web server (http://companion.sanger.ac.uk) providing parasite genome annotation as a service using a reference-based approach. We demonstrate the use and performance of Companion by annotating two Leishmania and Plasmodium genomes as typical parasite cases and evaluate the results compared to manually annotated references.


PSEUDOGENE DETECTION
The pseudogene assignment approach is similar to the one used by PseudoPipe (1), but uses more modern and efficient tools to achieve its goal. Reference protein sequences are aligned to the target DNA sequence, allowing for frame shifts. This is done using LAST (2) after preprocessing the input sequences with tantan (3) to reduce the amount of unspecific matching due to repeats.
Of all proteins aligning to the same target locus, the median length one is selected to avoid excessively long or short homologous matches from species with potentially different levels of curation (match C in the example above). The selected match is then reconciled with existing gene models: (a) no gene is predicted at the locus, keep match and label as gene or pseudogene, matches (b) contained in or (c) identical to existing genes are not considered potential pseudogenes, (d) short genes overlapping the frame-shifted match are subsumed by a pseudogene if they agree with the frames of the pseudogenic exons they overlap, and (e) genes for which a protein match suggests a possible 5' or 3' extension are converted to pseudogenes if the extended region exceeds a given proportion of the gene length.

GENE MODEL POST-PROCESSING
Some parasite species show uncommon characteristics requiring special attention during gene finding. For example, trypanosomatids such as Trypanosoma and Leishmania show polycistronic transcription (4), a process involving groups of genes on the same strand To eliminate spurious false positive gene predictions on the strand opposite of the PTU in question, we have added an optional post-processing step ('PTU smoothing'). It can be described as follows: Let of a gene be 1 if is located on the forward strand, and -1 if is located on the reverse strand. Each gene ! of total genes in a pseudochromosome can then be assigned a neighborhood score with neighborhood radius (default = 6) as follows: Hence every gene in a forward strand neighborhood is assigned a positive score, and a gene in a reverse strand neighborhood is assigned a negative score, with the score of genes in SSRs approaching zero from either side. All genes showing disagreement between their strand annotation and their neighborhood score, i.e. genes for which sgn ! , ≠ ( ! ), are removed.

SANITIZATION AND VALIDATION
To minimize the impact of low quality or erroneous gene models on the submission process, we have implemented multiple sanitizing steps at various points in the pipeline, making sure that, for example, features annotated as genes have intact protein translations, do not intersect with gaps or have consistent coordinates across the transcript and CDS features. Problematic features will be automatically cleaned up as required. Finally, an HTML report is generated, listing all potentially remaining problematic features as well as the nature of their requirement violations. Such a report is useful for a curator to quickly assess the potentially required amount of manual editing and to act as a checklist for the curation process.

STAND-ALONE VERSION
The stand-alone version can be run efficiently on powerful desktop machines as well as large compute clusters (LSF, SGE, SLURM, …). As Companion depends on a multitude of thirdparty software, we also provide a Docker container 1 encapsulating all external dependencies, including software binaries and most of the freely redistributable data files 2 . If the Docker container is used, the only additional software requirements for Linux users are the Nextflow workflow engine 3 and the Java 7 JRE, both of which are easy to install on modern distributions. Users of non-Linux platforms such as Mac OS X also require the boot2docker virtual machine 4 . The use of a Docker container comes with only a minimal speed tradeoff compared to a customized installation (5), and its use as a reference execution environment also ensures that results stay reproducible (6, 7) as long as the same container is used across invocations.
It should be noted that such stand-alone use requires preparation and preprocessing of a reference data set, which may include additional organisms besides the one used as the annotation reference. These organisms can be organized into subgroups, e.g. for a kinetoplastid reference, the Trypanosoma and Leishmania genera can be separated while still being able to make use of orthology information across all species. Reference annotations and sequences can be imported from appropriately formatted FASTA, GAF and GFF3 files. For the species available as references on the web server, the pre-processed reference data sets are also available for download 5 to use with the stand-alone tool. We also provide a brief on-line documentation of the reference preparation process on the Companion GitHub wiki 6 .

EVALUATION MEASURES FOR ANNOTATION COMPARISON
To assess agreement between prediction and reference annotations, we employ the notions of 'partial' matches, for which two features have to overlap to produce a match, and 'complete' matches, requiring exactly identical coordinates on the annotated sequence. The number of partial and complete matches is used to compute sensitivity and specificity values on the gene, transcript and CDS/exon levels. We also calculated nucleotide and amino acid level specificity and sensitivity to provide a more fine-grained assessment of accuracy.
A custom implementation was used to calculate these values based on GFF3 files, considering both non-coding and coding genes. As a second source of results, we used the existing ParsEval software (8) which gave similar results but only considers protein-coding genes. On the other hand it also calculates other common measures like matching coefficient, annotation edit distance and F1 score (9,10). They are also given in supplementary table 1 and 2 for all comparison runs.

GENE COORDINATE DEVIATIONS
To determine the nature and extent of the imperfect gene model predictions in the Leishmania donovani use case, we plotted the deviation of the predicted start and end position of the gene compared to the respective position in the manually annotated reference. Most of the differences are found in the start positions (n=565), much less in the end positions (n=140). There was no convincing evidence for a preference towards elongation or shortening of gene models upstream. Only in infrequent extreme cases did the difference extend beyond 1Kb in either direction.

SOFTWARE FOUNDATIONS
To perform efficient and robust processing of genomic sequences and annotations, Companion is built on its lowest level on the software infrastructure provided by the GenomeTools toolkit (11). This includes the stand-alone executable gt as well as custom 5 ftp://ftp.sanger.ac.uk/pub/project/pathogens/companion 6 https://github.com/sanger-pathogens/companion/wiki/Preparing-reference-data-sets scripts written in the Lua programming language (12) using the GenomeTools application programming interface (API).