Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data

Motivation: The characterization of phylogenetic and functional diversity is a key element in the analysis of microbial communities. Amplicon-based sequencing of marker genes, such as 16S rRNA, is a powerful tool for assessing and comparing the structure of microbial communities at a high phylogenetic resolution. Because 16S rRNA sequencing is more cost-effective than whole metagenome shotgun sequencing, marker gene analysis is frequently used for broad studies that involve a large number of different samples. However, in comparison to shotgun sequencing approaches, insights into the functional capabilities of the community get lost when restricting the analysis to taxonomic assignment of 16S rRNA data. Results: Tax4Fun is a software package that predicts the functional capabilities of microbial communities based on 16S rRNA datasets. We evaluated Tax4Fun on a range of paired metagenome/16S rRNA datasets to assess its performance. Our results indicate that Tax4Fun provides a good approximation to functional profiles obtained from metagenomic shotgun sequencing approaches. Availability and implementation: Tax4Fun is an open-source R package and applicable to output as obtained from the SILVAngs web server or the application of QIIME with a SILVA database extension. Tax4Fun is freely available for download at http://tax4fun.gobics.de/. Contact: kasshau@gwdg.de Supplementary information: Supplementary data are available at Bioinformatics online.


Datasets
All publicly accessible amplicon and metagenome data sample files from the Human Microbiome Project (HMP) (Human Microbiome Project Consortium, 2012), mammalian guts (Muegge et al., 2011), soils (Fierer et al., 2012), Guerrero Negro hypersaline microbial mat (Kunin et al., 2008;Harris et al., 2013) were downloaded in December 2013. Due to the credit system of SILVAngs, we restricted the analysis of the HMP data to a subset of 49 samples for all taxonomic and functional profiling approaches. Subsequently the 16S profile from the amplicon data was estimated using SILVAngs (Quast et al., 2013) and QIIME (Caporaso et al., 2010) (see Supplementary Methods section 2.2 and 2.3). Finally, the functional profile of the microbial communities was predicted using the PICRUSt (Langille et al., 2013) and Tax4Fun approach. In total, we were able to process 49 paired HMP samples, 56 mammalian guts samples, 13 paired soil samples, and 10 paired Guerrero Negro microbial mat samples (see Supplementary Excel File).

Precomputation of the association matrix
In Tax4Fun, the linking of 16S rRNA gene sequences with the functional annotation of sequenced prokaryotic genomes is realized with a linear transformation of the SILVA-based 16S rRNA profile to a taxonomic profile based on the prokaryotic organisms in the KEGG database (Kanehisa and Goto, 2000;Kanehisa et al., 2014). For the transformation, we precomputed an association matrix. The matrix is built from a BLASTN analysis where we extract 16S rRNA gene sequences of all prokaryotic organisms in KEGG (Release 64.0) and search them against the SILVA SSU Ref NR database (Release 115) (Quast et al., 2013). For the assignment, we require a sufficient sequence similarity according to a threshold on the BLAST bitscore (>1500). A non-zero entry in this sparse matrix represents a valid assignment of a SILVA sequence identifier to one of the KEGG organisms. In case that K different KEGG organisms simultaneously show significant hits for a SILVA 16S rRNA gene sequence the corresponding entries in the association matrix are set to 1/K.

Precomputation of the functional reference profiles
Organism-specific functional profiles are precomputed for all prokaryotic genomes in KEGG. The genomes are downloaded and subsequently fragmented into overlapping reads simulating a two-fold coverage of the genomes as previously described in (Klingenberg et al., 2013). To take different sequencing lengths into account, we generate overlapping reads of length 400 bp with 200 bp overlap for long read data and of length 100 bp with 50 bp overlap for short read data.
UProC (Meinicke, 2014) and PAUDA (Huson and Xie, 2014) are used for computation of the functional reference profiles in terms of KEGG Orthologs (KOs) of bacterial and archaeal origin. The UProC protein classification tool is executed in short read mode for the simulated short read data and in long read mode otherwise. The PAUDA homology search is performed in --fast mode with default parameters. In the case of multiple matches, only the best hit is considered.
In total, 6977 KOs of bacterial and archaeal origin are considered. By using UProC, we obtained 6644 (long reads) and 6671 (short reads) KO abundances for 1,943 genomes. By applying PAUDA, we captured only 3725 (long reads) and 3805 (short reads) KO abundances.

SILVAngs
First, 16S rRNA sequence data (see Supplemental Material section 1.1) was uploaded to SIL-VAngs. The SILVAngs analyses were performed using default parameters. Only the sequence type and sequencing technology were selected according to the specific characteristics of the amplicon data. Then, the results of the SILVAngs analysis were downloaded and the file: projectName---ssu---fingerprint----Total---sim 93---tax silva---td 20.csv in the directory results/ssu/tax breakdown/fingerprint/ was used as input for the Tax4Fun approach.

QIIME (quantitative insights into microbial ecology)
Sequences were initially clustered into operational taxonomic units (OTUs) employing the pick otus.py script implemented in the QIIME software package (v 1.8.0) (Caporaso et al., 2010) with a similarity threshold of 0.97. Representative sequences (one per OTU) were selected with the pick rep set.py script. Taxonomy was assigned to these sequences using the parallel assign taxonomy blast.py script and the most recent SILVA database (SSURef 115 NR) (Quast et al., 2013) as reference. OTU tables were subsequently calculated using the make otu table.py script.

PICRUSt (phylogenetic investigation of communities by reconstruction of unobserved states)
Processed sequences were initially clustered into operational taxonomic units (OTUs) employing the pick closed reference otus.py script implemented in the QIIME software package with a similarity threshold of 0.97 and reverse strand matching enabled. The most recent greengenes reference database (gg 13 5) (DeSantis et al., 2006) was used as reference for clustering. Obtained OTU tables were subsequently normalized by copy number using the normalize by copy number.py script implemented in PICRUSt (v 1.0.0) (Langille et al., 2013). Afterwards, functional predictions (KOs) and NSTI values were calculated based on normalized OTU tables employing the predict metagenomes.py script. Here, the predicted functional trait abundances are inferred from 6,909 KO abundances of 2,590 sequenced bacterial and archaeal genomes from the IMG database (Markowitz et al., 2012) v3.5.

Quality of the SILVA to KEGG assignment matrix
Because Tax4Fun was intended to realize a conservative mapping from SILVA 16S sequences to KEGG genomes, Tax4Fun requires a very high similarity over the whole 16S rRNA gene. For a sufficient sequence similarity, the BLASTN bitscore threshold was determined by inspecting the quality of the assignments at the phylum level. For different phyla, the percentage of correctly and incorrectly assigned SILVA entries to KEGG organisms was assessed.

Quality survey of the prediction methods
For all taxonomic and functional annotation approaches, the quality of the predictions was assessed. For the functional metagenome annotation using UProC and PAUDA the Fraction of Sequences Unexplained (FSU) (Klingenberg et al., 2013) was calculated. The coverage of the QIIME and the SILVAngs analysis pipelines was assessed in terms of the fraction of reads that could be classified by QIIME/SILVAngs. For Tax4Fun, we calculated the Fraction of Taxonomic units Unexplained (FTU). The FTU measures the fraction of sequences assigned to taxonomic units that can not be mapped to KEGG organisms using the association matrix. For PICRUSt, we used the weighted nearest sequenced taxon index (NSTI) for the quality survey. For all datasets and profiling methods, the quality values can be obtained from the Supplementary Excel File "SupplementalExcelFile QualitySurvey.xlsx".