MetaPathways v2.5: quantitative functional, taxonomic and usability improvements

Summary: Next-generation sequencing is producing vast amounts of sequence information from natural and engineered ecosystems. Although this data deluge has an enormous potential to transform our lives, knowledge creation and translation need software applications that scale with increasing data processing and analysis requirements. Here, we present improvements to MetaPathways, an annotation and analysis pipeline for environmental sequence information that expedites this transformation. We specifically address pathway prediction hazards through integration of a weighted taxonomic distance and enable quantitative comparison of assembled annotations through a normalized read-mapping measure. Additionally, we improve LAST homology searches through BLAST-equivalent E-values and output formats that are natively compatible with prevailing software applications. Finally, an updated graphical user interface allows for keyword annotation query and projection onto user-defined functional gene hierarchies, including the Carbohydrate-Active Enzyme database. Availability and implementation: MetaPathways v2.5 is available on GitHub: http://github.com/hallamlab/metapathways2. Contact: shallam@mail.ubc.ca Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
Since the publication of MetaPathways (Konwar et al., 2013), a modular annotation and analysis pipeline that enables construction of environmental pathway/genome databases using Pathway Tools (Karp et al., 2002b(Karp et al., , 2010 and MetaCyc (Caspi et al., 2012;Karp et al., 2000Karp et al., , 2002a, there have been improvements to the software via the Knowledge Engine data structure, a graphical user interface (GUI) for data management and browsing and a master-worker model for task distribution on grids and clouds (Hanson et al., 2014b). Version 2.5 features faster and more accurate quantitative functional and taxonomic inference. Inspired by the pathway-centric analysis of the Hawaii-Ocean Time-series (Hanson et al., 2014a), a weighted taxonomic distance (WTD) has been integrated to detect taxonomic divergence of predicted MetaCyc pathways. Next, because it is difficult to determine relative open reading frame (ORF) abundance in assembled datasets, we adopt reads per kilobase per million mapped (RPKM) to provide a quantitative measure of sequence-coverage on a per-ORF basis (Patil et al., 2011). Additionally, the LAST code has been modified to calculate BLASTequivalent Bit-score and E-value statistics (Altschul et al., 1990; V C The Author 2015. Published by Oxford University Press.

3345
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Methods
Here, we describe MetaPathways v2.5 improvements in more detail.

Weighted taxonomic distance
MetaPathways runs the PathoLogic algorithm without taxonomic pruning, but this omission enables prediction of MetaCyc pathways outside their expected taxonomic range. WTD serves as a measure of predicted pathway taxonomic divergence between observed RefSeq taxonomy and its expected taxonomic range (Hanson et al., 2014a). Briefly, for each predicted pathway P, WTD D is calculated on the connecting path Pðx exp ; x obs Þ between x obs , the lowest common ancestor of observed annotations, and each member of its expected taxonomic range x exp , where e a,b is an edge between nodes a and b on the connecting path E Pðxexp; x obs Þ , and d(a) is the depth of node a. (For complete algorithm details and motivation, see Online Methods and Supplementary Note S2 of Metabolic pathways for the whole community (Hanson et al., 2014a)).

Reads per kilobase per million mapped
Functional analysis of de novo assembled environmental sequence information is impeded by the lack of quantitative ORF annotations. ORF counts are affected by both sequencing depth and ORF length, longer ORFs naturally encompass more reads, making quantitative comparisons between samples difficult. To resolve this, we have implemented a bwa-based version of the RPKM (Li and Durbin, 2010). Intuitively RPKM is a simple proportion of the number of reads mapped to a sequence section, normalized for sequencing depth and ORF length:

LAST bit-score and E-value
Although both LAST and BLAST are dynamic programming seedand-extend approximations to the Smith Waterman algorithm (Altschul and Erickson, 1986;Smith and Waterman, 1981), in practice, LAST's adaptive-seed lengths and simpler code base is 20-to 100-times faster, more accurate and portable. However, LAST adoption has lagged due to the absence of BLAST-like output format and statistics. We modified the LAST code to produce the compatible Bit-score and E-value calculations.

Conclusions
MetaPathways v2.5 now addresses quantitative functional and pathway prediction hazards based on WTD and RPKM calculations, provides performant LAST output equivalent with BLAST, and more flexible annotation subsetting and projection via GUI keyword searches. These improvements enable improved large-scale comparative analysis of next-generation environmental sequence information.