Abstract

Motivation

Phylogenetic placement (PP) is a process of taxonomic identification for which several tools are now available. However, it remains difficult to assess which tool is more adapted to particular genomic data or a particular reference taxonomy. We developed Placement Evaluation WOrkflows (PEWO), the first benchmarking tool dedicated to PP assessment. Its automated workflows can evaluate PP at many levels, from parameter optimization for a particular tool, to the selection of the most appropriate genetic marker when PP-based species identifications are targeted. Our goal is that PEWO will become a community effort and a standard support for future developments and applications of PP.

Availability and implementation

https://github.com/phylo42/PEWO.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

When a reference phylogeny is available, taxonomic identification of biological sequences can be achieved with phylogenetic placement (PP). PP provides the most informative type of classification because each query sequence is assigned to its putative origin in the tree. PP can be applied in many contexts, including community ecology, species diversity or medical studies. Several PP tools were developed for these purposes (Berger et al., 2011; Matsen et al., 2010; Mirarab et al., 2012; Zheng et al., 2018), with four recent tools capable of processing larger sequence volumes (Balaban et al., 2020; Barbera et al., 2019; Czech and Stamatakis, 2019; Linard et al., 2019). In the preliminary phase of experimental design, assessing which tools answer the needs of a given application remains a tedious task often involving manual tests (Mangul et al., 2019). Strikingly, PP has a broad range of applications, but lacks user guidelines and benchmarking. Some procedures to evaluate PP accuracy were proposed (Matsen et al., 2010), but never automated via a dedicated software. Benchmarking is essential to determine which tool suits better a given metagenomic task or a specific dataset (Sczyrba et al., 2017).

To fill this gap, we developed Placement Evaluation WOrkflows (PEWO), the first tool dedicated to PP benchmarking. PEWO automatizes evaluation procedures (which were not implemented for the community), and introduces novel procedures. Beyond benchmarking, PEWO can help decision making in any metagenomic or metabarcoding project for PP-based taxonomic identification. With applications ranging from parameter optimization on particular genomic data, to the selection of the most appropriate genetic marker, PEWO provides the user community with standardized workflows for easy and reproducible assessment of PP analyses.

2 Overview

PEWO implements evaluation workflows in Python and Snakemake (Köster and Rahmann, 2012), whose framework ensures flexibility, platform independence and reproducibility. Each workflow automatically performs multiple steps from query generation up to summary plots/tables, and can be tailored via Snakemake configuration files. PEWO and its dependencies are easily installed via a conda virtual environment. Currently, PEWO incorporates five state-of-the-art PP tools, which cover a majority of PP uses: EPA(RAxML), PPlacer, EPA-ng, RAPPAS and APPLES. Four are alignment-based tools, while RAPPAS is alignment-free. As input, each workflow takes a phylogenetic tree and the reference multiple sequence alignment from which it was built (Fig. 1). Optionally, the user can provide a set of query sequences. Below, we describe the workflows and some of their applications.

(A) Overview of PEWO inputs and outputs. (B) An example of plots dynamically generated by the PAC procedure on a 16S rRNA bacterial reference. Measured Mean eND are reported (lower value = better accuracy). Panels report selected conditions for PPlacer and RAPPAS, e.g. different parameter values tested in different rows and columns. For PPlacer, varying parameters are ms (max-strikes, X axis) and sb (strike-box, Y axis). Parameter mp (max-pitches, gray box) is fixed. For RAPPAS, varying parameters are k (phylo-kmer size) and o (omega threshold). Parameters red (alignment reduction) and ar (software used for ancestral reconstruction) are fixed. (C) Four PAC procedures were run for different Coleopteran mitogenome loci (rows) and compiled. Average eND is measured for three tools (columns) using default parameters. For each locus, the lowest average eND is highlighted in bold. For RAPPAS, the last column shows that accuracy can be improved when increasing k-mer size (default is k = 8). Examples B. and C. are more extensively discussed in Supplementary Materials. (Color version of this figure is available at Bioinformatics online.)
Fig. 1.

(A) Overview of PEWO inputs and outputs. (B) An example of plots dynamically generated by the PAC procedure on a 16S rRNA bacterial reference. Measured Mean eND are reported (lower value = better accuracy). Panels report selected conditions for PPlacer and RAPPAS, e.g. different parameter values tested in different rows and columns. For PPlacer, varying parameters are ms (max-strikes, X axis) and sb (strike-box, Y axis). Parameter mp (max-pitches, gray box) is fixed. For RAPPAS, varying parameters are k (phylo-kmer size) and o (omega threshold). Parameters red (alignment reduction) and ar (software used for ancestral reconstruction) are fixed. (C) Four PAC procedures were run for different Coleopteran mitogenome loci (rows) and compiled. Average eND is measured for three tools (columns) using default parameters. For each locus, the lowest average eND is highlighted in bold. For RAPPAS, the last column shows that accuracy can be improved when increasing k-mer size (default is k = 8). Examples B. and C. are more extensively discussed in Supplementary Materials. (Color version of this figure is available at Bioinformatics online.)

2.1 PEWO procedures

  • Pruning-based accuracy evaluation (PAC): in this standard procedure for assessing placement accuracy (Berger et al., 2011; Matsen et al., 2010), a subset of sequences is randomly pruned from the reference phylogeny and alignment. Each pruned sequence then serves to generate queries for placement, and the accuracy of each tool is measured in number of nodes separating predicted from true placement. PEWO offers two versions of this topological metric: Node Distance and expected Node Distance (eND). The eND accounts for placement uncertainty (e.g. likelihood weight ratios). All selected tools are compared for a user-selected combination of parameters.

  • Likelihood-based accuracy evaluation (LAC) is a new, faster evaluation procedure introduced in PEWO to assess relative accuracy of PP. It iterates the following process for a set of queries: place the query, extend the phylogeny to include that query, optimize the branch lengths of this extended tree and return its log-likelihood (LL). The user can then compare the LL values obtained with different tools, or different settings of a same tool (e.g. by inspecting the distribution of the differences between LL values obtained with two different tools). See the Supplementary Materials for a more detailed description.

  • Resource evaluation: outputs the runtime and memory usage of selected tools, with details for each placement step (e.g. profile alignment, database construction, placement, etc.). One can compare the impact on time and memory for tool-specific parameter combinations, while searching for an appropriate accuracy/resource trade-off, or evaluate the tools’ scalability with respect to input size.

2.2 Applications

PEWO procedures cover numerous use cases arising with PP, as illustrated by six exemplar applications provided on GitHub (two are reported in Fig. 1B and C). As new PP tools can be incorporated in PEWO, PEWO procedures enable comparing existing and future tools on resource usage, scalability, or accuracy in a reproducible way. With PEWO, users can optimize their PP pipeline design. For instance, for a given reference (tree and alignment), determine which tool and parameter combination will maximize placement accuracy, and at which computational cost. PEWO facilitates such tests, as in Figure 1B, which shows two plots automatically generated by the PAC procedure running PPlacer and RAPPAS for nine and six parameter combinations, respectively.

As a second example, we show how PEWO can be used to compare different genetic markers available for the same taxa, as the choice of the marker may impact the accuracy of placement. For example, we evaluated the placements for four loci (16S, 12S, cox1 and cyt) on their associated phylogeny for 900 Coleopteran mitochondrial genomes (Linard et al., 2018). Figure 1C displays the results (reproducible via GitHub example 4) highlighting that: (i) 12S yields the most accurate placements, despite being the second shortest locus, (ii) the tool achieving the best accuracy depends on the marker and (iii) with RAPPAS, a longer k-mer size is required to obtain accuracy similar or better than alignment-based methods.

2.3 Availability and implementation

PEWO, with full documentation and example workflows, is freely available from its repository URL: https://github.com/phylo42/PEWO. Its modular, well documented and evolvable source code enables the community to easily extend it by adding new tools, procedures or metrics. Notably, users can develop their own evaluation procedures starting from PEWO Snakemake rules as templates for their own workflows. Any PP tool can be integrated as long as it outputs results in jplace format [a json specification, standard in PP, see Matsen et al. (2012)], can be parameterized via the command line, and is available on a conda or pip repository (see the documentation for guidelines).

3 Conclusion

Reproducibility of computational analyses in life sciences is a crucial issue, even more when large-scale data come into play, as in the case of metagenomics. With PEWO, we provide a resource that facilitates the evaluation and comparison of PP tools under a unified framework. It allies flexibility, extensibility, with ease of use, while it inherits a standardized installation procedure from the conda framework. The set of workflows in PEWO aims to grow as a community effort, and extensions are welcome. In PEWO, we introduce a LAC procedure, which is complementary to existing procedures (Matsen et al., 2010). PEWO will help the community in its efforts to develop future PP tools and will facilitate experimental decisions when PP is chosen as a means to species identification. With the help of future contributors, we hope that PEWO will evolve as a standard for PP benchmarking, and answer forthcoming unforeseen yet auspicious applications.

Acknowledgements

The authors thank Vincent Lefort for technical assistance, the ATGC bioinformatic platform, the Institut Français de Bioinformatique [ANR-11-INBS-0013].

Funding

This work was supported by France Génomique [ANR-10-INBS-0009], MNERT fellowship to N.R.

Conflict of Interest: BL is research scientist in a private company, specialized on the use of eDNA for species detection.

References

Balaban
 
M.
 et al. (
2020
)
Apples: scalable distance-based phylogenetic placement with or without alignments
.
Syst. Biol
.,
69
,
566
578
.

Barbera
 
P.
 et al. (
2019
)
EPA-ng: massively parallel evolutionary placement of genetic sequences
.
Syst. Biol
.,
68
,
365
369
.

Berger
 
S.A.
 et al. (
2011
)
Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood
.
Syst. Biol
.,
60
,
291
302
.

Czech
 
L.
,
Stamatakis
A.
(
2019
)
Scalable methods for analyzing and visualizing phylogenetic placement of metagenomic samples
.
PLoS One
,
14
,
e0217050
.

Köster
 
J.
,
Rahmann
S.
(
2012
)
Snakemake-a scalable bioinformatics workflow engine
.
Bioinformatics
,
28
,
2520
2522
.

Linard
 
B.
 et al. (
2018
)
The contribution of mitochondrial metagenomics to large-scale data mining and phylogenetic analysis of coleoptera
.
Mol. Phylogenet. Evol
.,
128
,
1
11
.

Linard
 
B.
 et al. (
2019
)
Rapid alignment-free phylogenetic identification of metagenomic sequences
.
Bioinformatics
,
35
,
3303
3312
.

Mangul
 
S.
 et al. (
2019
)
Systematic benchmarking of omics computational tools
.
Nat. Commun
.,
10
, 1393.

Matsen
 
F.A.
 et al. (
2010
)
pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree
.
BMC Bioinformatics
,
11
,
538
.

Matsen
 
F.A.
 et al. (
2012
)
A format for phylogenetic placements
.
PLoS ONE
,
7
,
e31009
.

Mirarab
 
S.
 et al. (
2012
) SEPP: sate -enabled phylogenetic placement.
Pac Biocomput.
,
247
258
.

Sczyrba
 
A.
 et al. (
2017
)
Critical assessment of metagenome interpretation – a benchmark of metagenomics software
.
Nat. Methods
,
14
,
1063
1071
.

Zheng
 
Q.
 et al. (
2018
)
HmmUFOtu: an HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies
.
Genome Biol
.,
19
,
82
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Arne Elofsson
Arne Elofsson
Associate Editor
Search for other works by this author on:

Supplementary data