Gotree/Goalign: toolkit and Go API to facilitate the development of phylogenetic workflows

Abstract Phylogenetics is nowadays at the center of numerous studies in many fields, ranging from comparative genomics to molecular epidemiology. However, phylogenetic analysis workflows are usually complex and difficult to implement, as they are often composed of many small, reccuring, but important data manipulations steps. Among these, we can find file reformatting, sequence renaming, tree re-rooting, tree comparison, bootstrap support computation, etc. These are often performed by custom scripts or by several heterogeneous tools, which may be error prone, uneasy to maintain and produce results that are challenging to reproduce. For all these reasons, the development and reuse of phylogenetic workflows is often a complex task. We identified many operations that are part of most phylogenetic analyses, and implemented them in a toolkit called Gotree/Goalign. The Gotree/Goalign toolkit implements more than 120 user-friendly commands and an API dedicated to multiple sequence alignment and phylogenetic tree manipulations. It is developed in Go, which makes executables easily installable, integrable in workflow environments, and parallelizable when possible. Moreover, Go is a compiled language, which accelerates computations compared to interpreted languages. This toolkit is freely available on most platforms (Linux, MacOS and Windows) and most architectures (amd64, i386) on GitHub at https://github.com/evolbioinfo/gotree, Bioconda and DockerHub.


INTRODUCTION
Increase in computer power and development of bioinformatics methods that handle very large datasets make it possible to perform phylogenetic analyses at an unprecedented scale. For example, it is common to perform phylogenetic studies involving several thousand trees or several thousand taxa (e.g. (1) or (2)). Such studies often present and run complex pipelines involving many steps and several tools and scripts, which makes them hard to describe and share. For example, the current COVID-19 pandemics and the diverse SARS-CoV-2 data analysis workflows that flourish demonstrate the need to manipulate sequences, alignments, and phylogenetic trees in an easy and automated way (e.g. https://github.com/roblanf/sarscov2phylo/). Figure 1 displays an example of a phylogenetic workflow inspired from (3) and described in details in the Results section. The main objective of this workflow is to analyse a phylogenomic dataset to infer a species tree and compare it to a reference tree. It contains 19 steps among which the majority is not constituted by usual computer intensive tasks (e.g. tree inference and multiple sequence alignment), but rather by alignment and tree manipulations such as downloading, renaming, reformating, comparing, rerooting, annotating, etc. These tasks are (i) repetitive (found in many similar workflows), (ii) tedious to implement (many ways to do it) and (iii) error prone, which makes this workflow difficult to implement, describe and reproduce.
That is why we think a common framework was needed to help implement such analyses in a concise and reproducible way.
Several tools, either in the form of command line or APIs, already exist to manipulate phylogenetic trees and/or alignments. For example, Newick-utilities (10) is a command line tool dedicated to phylogenetic tree manipulation in newick format. ETE toolkit (11) is a Python API for the analysis and visualization of trees. Ape (12) is a package dedicated to the analyses of phylogenetics and evolution in the Gray boxes represent processes performed using Goalign, and Gray boxes with blue contour represent processes performed using Gotree. Processes are linked by lines if results of the upstream process are needed for the execution of the downstream process. The steps represented by boxes surrounded by the dashed line are executed for each gene in parallel. The steps are the following: 1) match RefSeq, NCBI and HGNC (4) gene identifiers; 2) get OrthoDB (5) identifiers of orthologous groups corresponding to HGNC identifiers; 3) download sequences of each group; 4) get species name of each sequence (from OrthoDB id); 5) rename the sequences using the species names; 6) clean the sequences (e.g. removing special characters); 7) align the sequences (MAFFT (6)); 8) concatenate all alignments in a single large alignment; 9) clean the alignment (BMGE (7)); 10) reformat the alignment into Phylip format; 11) infer the phylogenetic tree (IQ-TREE (8)); 12) download the NCBI taxonomy in Newick format; 13) keep only the 25 species of interest from NCBI taxonomy; 14) change species names that differ between OrthoDB and NCBI taxonomy; 15-16) Reroot NCBI taxonomy and inferred tree; 17) compare both trees in terms of common bi-partitions; 18) annotate inferred tree with NCBI taxonomy clades; 19) upload the annotated tree to iTOL (9). R language. Buddy Suite (13) is another python framework (command line) to manipulate phylogenetic trees and alignments. BIO++ (14) is a C++ library for the analysis of sequences and trees. Lastly, Phyx (15) is a command line tool written in C++ that performs phylogenetic analyses on trees and sequences. Some of these tools (e.g. ape, ETE toolkit, BIO++) are mainly developer oriented and may be difficult to use for non-programmer phylogenetic analysts who are not familiar with R, python or C++. Others are command line only (e.g. Newick-utilities), and may be of limited use for developers who want to use part of them in their own programs. They are all implemented either in C, C++, Python or R. To our knowledge, no phylogenetic tree toolkit currently exists, which mix alignment and tree manipulation, in a command line environment and via an API, and proposes a large diversity of commands. In particular, none exists for the Go programming language for which several bioinformatics libraries and tools already exist such as biogo (16) and Vcfanno (17).
In this context, we developed Gotree/Goalign, a toolkit dedicated to burdensome and repetitive phylogenetic tasks, which (i) consists of two user-friendly executables, gotree and goalign, integrating state of the art phylogenetic commands and requiring no programming skills to use, (ii) provides a set of chainable commands that are integrable in workflows (e.g. Nextflow (18) or Snakemake (19)), (iii) is straightforward to install via static binaries available for most platforms and architectures, and (iv) provides a public API accessible to developers wanting to manipulate phylogenetic trees and multiple sequence alignments in Go.

Alignment manipulation
Goalign implements more than 60 commands to manipulate sequences and multiple sequence alignments. A subset of these commands is given in Table 1.
The most obvious (and maybe the most widely used) commands reformat an input sequence file into any supported format, which is essential to integrate different tools in the same analysis workflow. The supported input and output formats are: Fasta, Phylip, Clustal, Nexus, and tool specific formats (e.g. TNT and PaML input formats). It is worth noting that Goalign supports compressed input files (.gz, .bz and .xz) and remote files through http(s).
A second kind of commands is designed to extract summary statistics about input alignments, such as the length, the number of sequences, the frequency of each character, the list of sequence names, etc.

Tree manipulation
Gotree implements more than 60 commands dedicated to the manipulation of phylogenetic trees (see Table 1 for a list of a few of them).
As for Goalign, Gotree reformatting commands are the first to be used in phylogenetic workflows and constitute the glue between all phylogenetic tools, which often support heterogeneous, specific formats. Gotree supports the following input and output formats: Newick, Nexus and PhyloXML. Moreover, Gotree supports compressed input files (.gz), and remote files from any urls or from dedicated servers (iTOL (9) and Treebase (29), See Supp. Text 1 for examples of such commands).
The second kind of commands implemented in Gotree produces summary statistics about input phylogenetic trees, such as the number of tips, number of branches, average branch length and branch support, sum of branch lengths, number of cherries, tree balance indices (Colless, Sackin), and patristic distance matrix.
A third kind of commands modifies the input tree, e.g. modifying branch lengths and supports, collapsing branches by length or support, removing lists of tips, rerooting the tree in different ways, etc.
Other commands generate new trees or new data on the trees, such as bootstrap support computations (Felsenstein's Bootstrap Proportions (30) and Transfer Bootstrap Expectation (31)) or random tree generation using different models.
The last kind of commands compares trees, for example comparing topologies (bipartition distance), or comparing tip names.
Put together, these commands provide the user with a large panel of possibilities adapted to many different situations, a few of which are described in the Results section.

Implementation of Gotree/Goalign
The Gotree/Goalign toolkit is implemented in the Go (https://golang.org/) programming language, and thus takes advantage of (i) being available on major operating systems (Windows, Linux, MacOS), (ii) the richness of the Go standard library and the massive amount of available packages (http, hashmaps, compression, regex, channels, concurrency, gonum, etc.), (iii) the distributed nature of Go packages, providing phylogenetic packages easily accessible to any developers.
It consists of two executables gotree and goalign providing all the commands in the same spirit as git or docker.
All the commands are implemented in the Unix mindset, having one atomic command per task, reading data from the standard input and writing results to the standard output when possible.

Integration in workflows
The Gotree/Goalign toolkit has been developed to be easily integrated in phylogenetic workflows, using main workflow managers such as Nextflow (18) or Snakemake (19). Two main characteristics facilitate this behavior.
First of all, Gotree/Goalign consists of a large set of command line tools that can be easily chained, and that avoid manual interventions as much as possible. This feature makes them easily usable in workflows as all commands are atomic, easily citable, describable and reproducible.
Moreover, they are made available through three channels that are compatible with the diversity of use in workflow managers, i.e., multi-plateform executables, Docker and Singularity containers, and Bioconda packages.

RESULTS: USE CASE IN PHYLOGENOMICS
In this use case, we analyze a phylogenomic dataset inspired from (3), in which the authors analyze a set of 1730 genes from primates. They infer the species tree either from individual gene trees using ASTRAL III (32) or from gene concatenation using maximum likelihood. Our use case is inspired from the concatenation study, inferring a species tree from available groups of primate orthologous proteins in OrthoDB (5) after having mapped the RefSeq identifiers of the 1730 analyzed genes to their OrthoDB counterpart.
The workflow is displayed in more detail in Figure 1 (code available at https://github.com/evolbioinfo/gotree usecase and in Supp. Data 1), and consists of the following main steps (details of the tools and options are given in Supplementary Figure S1): (i) mapping RefSeq, HGNC (4) and OrthoDB identifiers ; (ii) retrieving each group of orthologous proteins in OrthoDB if they are present in at least 90% of the 25 primate species (Supp. Data 2) and in a single copy (this results in 1315 orthologous groups, listed in Supp. Data 3); (iii) for each orthologous group, downloading the sequences of all proteins and their associated metadata; (iv) renaming, cleaning and aligning sequences using MAFFT (6); (v) concatenating the alignments of the 1315 orthologous groups (this results in a single alignment of 674 089 amino-acids); (vi) cleaning the alignment using BMGE (7) (this keeps 516 999 sites); (vii) inferring the species tree using IQ-TREE (8); (viii) downloading the NCBI taxonomy; (ix) comparing the inferred tree with the NCBI taxonomy and (x) uploading both trees to iTOL (9). Each step and their parameters are described in detail in Supplementary Figure  S1.
Since it involves many very different steps, this example is challenging to implement and to run fully automatically, without manual intervention. Apart from tree inference and multiple sequence alignment, most of the operations required to develop this workflow are available in the Gotree/Goalign toolkit and are straightforward to execute (Supplementary Figure S1). The implementation of the workflow, made of 19 steps and ∼350 lines of code, is fully homogeneous in terms of file formats and input/outputs. Furthermore, the ability to run the full analysis, from the input data to the output tree visualization, without manual intervention facilitates its description and reproducibility.
The resulting primate phylogeny, given in Figure 2, is in almost complete agreement with the NCBI taxonomy, with 15 out of 16 branches of the NCBI taxonomy found in our tree. It is worth noting that the only conflicting branch (in red), grouping Callitrix and Aotus, has a lower bootstrap support than other branches and is also present in the tree inferred using maximum likelihood from gene concatenation in (3). In fact, our topology and theirs (3) are identical.

DISCUSSION
We developed the Gotree/Goalign toolkit to simplify the manipulation of phylogenetic trees and alignments, and to facilitate the development of reproducible phylogenetic workflows. Importantly, it is not a wrapper around major tree inference and multiple sequence alignment software, i.e. it does not take care of aligning sequences and inferring trees, but rather takes care of all the other complex, tedious and numerous tree and sequence operations that are necessary in phylogenetic workflows. The Gotree/Goalign toolkit is developed in Go and is easily installable on major operating systems and provides a public API usable in any Go project.
It is already used in several projects, and we are confident that Gotree/Goalign will be able to build a community interested in adding new functionnalities.