GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts

Walker, Mark A; Pedamallu, Chandra Sekhar; Ojesina, Akinyemi I; Bullman, Susan; Sharpe, Ted; Whelan, Christopher W; Meyerson, Matthew

doi:10.1093/bioinformatics/bty501

Abstract

Summary

We present an updated version of our computational pipeline, PathSeq, for the discovery and identification of microbial sequences in genomic and transcriptomic libraries from eukaryotic hosts. This pipeline is available in the Genome Analysis Toolkit (GATK) as a suite of configurable tools that can report the microbial composition of DNA or RNA short-read sequencing samples and identify unknown sequences for downstream assembly of novel organisms. GATK PathSeq enables sample analysis in minutes at low cost. In addition, these tools are built with the GATK engine and Apache Spark framework, providing robust, rapid parallelization of read quality filtering, host subtraction and microbial alignment in workstation, cluster and cloud environments.

Availability and implementation

These tools are available as a part of the GATK at https://github.com/broadinstitute/gatk.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Modern microbial genomic approaches utilize a combination of genomic technologies and bioinformatics tools to directly retrieve the genetic content of entire communities of organisms. Methods based on aligning sequencing reads to large databases of micro-organism genome references continue to illuminate the true microbial composition of deep sequenced biological samples. However, a limitation of such reference alignment-based analyses is that candidate pathogens may have limited identity to micro-organisms with sequenced genomes. To overcome this limitation, we pioneered the identification of novel non-human sequences (putative pathogens) in a comprehensive and unbiased manner using a ‘computational subtraction’ approach (Bhatt et al., 2013; Kostic et al., 2011; Weber et al., 2002). This approach relies on the removal of host reads and analysis of the non-host portion of reads, resulting in the detection of both known and novel pathogens as well as any resident micro-organisms. Other methods for identifying novel micro-organisms in sequencing data from host organisms include digital transcript subtraction (Feng et al., 2007), Pathoscope (Francis et al., 2013), SURPI (Naccache et al., 2014) and VirusSeq (Chen et al., 2013).

We present an updated implementation of the PathSeq pipeline (Kostic et al., 2011) that makes substantial improvements on the original version. First, computational efficiency has been improved by incorporating faster computational approaches (Materials and Methods Section). Second, unlike the original version, Genome Analysis Toolkit (GATK) PathSeq permits users to configure the workflow for multiple use cases such as different library types (i.e. whole-genome and RNA sequencing), sample types (e.g. blood, tissue, sputum, etc.), or host species. Third, the tool suite is implemented in Java with the GATK engine (McKenna et al., 2010) and Apache Spark framework (Zaharia et al., 2010), enabling parallelized data processing on local workstations, computing clusters and Google Cloud computing services (https://cloud.google.com).

2 Materials and methods

PathSeq begins with removal of low quality, low complexity, host-derived and duplicate reads (Supplementary Fig. S1, Supplementary Material S1). Several methodological improvements have been made to these steps in order to improve performance. Apache Spark is used to process batches of sequencing reads asynchronously in memory, thus maximizing resource utilization and minimizing slow hard-disk operations between pipeline stages. To accelerate filtering of low-complexity sequences, the RepeatMasker step (http://www.repeatmasker.org/) has been replaced with the symmetric DUST algorithm (Morgulis et al., 2006). To rescue reads that are partially informative, the new version also incorporates trimming of low-quality bases, sequencing adapter artifacts and low-complexity sequences. The reads are then subjected to a fast k-mer search using a Bloom filter (Bloom, 1970) to detect short sequences from the host reference. Reads containing at least one host k-mer (k = 31) are removed. This step typically subtracts ∼90% of the host reads prior to performing sequence alignment, thus greatly reducing run time. The BLAST (Altschul et al., 1990) aligner has been replaced with the faster BWA-MEM aligner (Li, 2013), which is used to map the remaining reads to the host reference (Supplementary Table S1).

PathSeq then aligns the remaining non-host reads to a reference of microbial genomes (viruses, bacteria, fungi, etc.; Supplementary Table S2) and classifies each read taxonomically. Alignment efficiency is again improved using BWA-MEM. Classification specificity has been increased by running the aligner in paired-end mode and requiring that both reads in a pair map to the same organism during classification. Finally, a report is generated containing microbial abundance estimates at each taxonomic level (e.g. species, genus, family, etc.).

3 Results

We used two datasets to demonstrate the performance of GATK PathSeq: (i) a total RNA sequencing library from HeLa cervical cancer cell line (Kostic et al., 2011) and (ii) mRNA and WGS sequencing libraries from a cervical cancer patient (SGCX-NOR-030; Ojesina et al., 2014). The cloud computing environment used for this study are described in Supplementary Material S2. A cutoff of 70% sequence identity was used for microbial read classification; this cutoff can be set by the user.

The HeLa library (10.3 million high-quality 76 bp paired-end reads) was used as a positive control and compared to the earlier PathSeq version. GATK PathSeq identifies 38 350 reads, comprising 0.37% of the library, aligned to HPV-18 (Supplementary Table S3), compared with the original PathSeq pipeline, which identified 25 879 HPV-18 reads (Kostic et al., 2011). While we did not identify the cause of this increase, it may be due to improvements in the completeness of the microbe reference, read quality filtering and host read subtraction methodology. GATK PathSeq also reveals 4078 unmapped reads, or approximately 0.04%. The total CPU time taken for the analysis is 0.8 h.

The mRNA and WGS libraries derived from patient SGCX-NOR-030 contained approximately 218 million (76 bp) and 1.58 billion (101 bp) paired-end reads, respectively (Supplementary Material S3). These libraries were available as BAM files aligned to a human reference. Mapped reads were filtered before running PathSeq, reducing each library size by an order of magnitude. After quality filtering and host subtraction, the SGCX-NOR-030 mRNA and WGS samples contain 23 803 non-host reads and 2.2 million non-host reads including 1.7 million bacterial reads, respectively. GATK PathSeq identifies HPV-16 as the predominant HPV strain in SGCX-NOR-030 mRNA and WGS samples with read counts of 1984 and 9882 reads, respectively. This finding is consistent with previously published data (Ojesina et al., 2014). The identity cutoff does not substantially affect the number of detected HPV reads, which only decreases by up to 13% if the identity cutoff is raised from 60% to 90% (Supplementary Table S4). Note that in both samples HPV-16 reads comprise <0.01% of the input reads and <0.001% of the original libraries, demonstrating PathSeq’s ability to detect extremely low abundance organisms. Residual unmapped reads were 12 215 (∼0.006% of the original reads) from the SGCX-NOR-030 mRNA library and 453 423 (∼0.03%) from the SGCX-NOR-030 WGS library (Supplementary Table S3). The total CPU time taken for analysis is approximately 2.3 h for the mRNA sample and 6.7 h for the WGS sample. Processing of large libraries scales efficiently in cluster environments (Supplementary Table S5).

Using a BLAST search, 36% of the unmapped reads of the HeLa library, 28% of the SGCX-NOR-030 mRNA library and 11% from the WGS library could be identified. These consisted primarily of human, HPV, bacteria, fungi and microbial eukaryote sequences (Supplementary Table S6) not represented in the PathSeq reference. Assembly of the unmapped reads using Fermi-lite (Li, 2015) yielded little further insight into the identity of these sequences in the HeLa and SGCX-NOR-030 mRNA libraries. However, in the SGCX-NOR-030 WGS library, 11 of the 20 largest assembled contigs mapped to Sphingomonadaceae, indicating the possible presence of a novel bacterium from this family (Supplementary Table S7). Understanding these residual unmapped reads will help us to improve GATK PathSeq and will also provide insight into the completeness of the human reference genome and the sequencing quality control process.

In summary, we have developed an adaptable and easily configurable pipeline for identification of microbial sequences in next generation sequencing data. This tool allows for customized analyses of biological samples with substantially reduced computational time.

Acknowledgements

This work is dedicated to our former colleague and outstanding scientist, the late Dr. Joonil Jung. We would like to thank Dr. Ami Bhatt, Dr. Gad Getz and Mrs. Fujiko Duke for their kind discussions.

Funding

This work was supported by funding from the Avon Foundation for Women (AIO and MM) and the V Foundation for Cancer Research (AIO).

Conflict of Interest: none declared.

References

Altschul

S.F.

et al. (

1990

)

Basic local alignment search tool

.

J. Mol. Biol

.,

215

,

403

–

410

.

Bhatt

A.S.

et al. (

2013

)

Sequence-based discovery of Bradyrhizobium enterica within cord colitis syndrome

.

N. Engl. J. Med

.,

369

,

517

–

528

.

Bloom

B.H.

(

1970

)

Space/time trade-offs in hash coding with allowable errors

.

Commun. ACM

,

13

,

422

–

426

.

Google Scholar

Crossref

WorldCat

Chen

Y.

et al. (

2013

)

VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue

.

Bioinformatics

,

29

,

266

–

267

.

Feng

H.

et al. (

2007

)

Human transcriptome subtraction by using short sequence tags to search for tumor viruses in conjunctival carcinoma

.

J. Virol

.,

81

,

11332

–

11340

.

Francis

O.E.

et al. (

2013

)

Pathoscope: species identification and strain attribution with unassembled sequencing data

.

Genome Res

.,

23

,

1721

–

1729

.

Kostic

A.D.

et al. (

2011

)

PathSeq: software to identify or discover microbes by deep sequencing of human tissue

.

Nat. Biotechnol

.,

29

,

393

–

396

.

Li

H.

(

2013

) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: 1303.3997v2.

Li

H.

(

2015

)

FermiKit: assembly-based variant calling for Illumina resequencing data

.

Bioinformatics

,

31

,

btv440

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Morgulis

A.

et al. (

2006

)

A fast and symmetric DUST implementation to mask low-complexity DNA sequences

.

J. Comput. Biol

.,

13

,

1028

–

1040

.

McKenna

A.

et al. (

2010

)

The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data

.

Genome Res

.,

20

,

1297

–

1303

.

Naccache

S.N.

et al. (

2014

)

A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples

.

Genome Res

.,

24

,

1180

–

1192

.

Ojesina

A.I.

et al. (

2014

)

Landscape of genomic alterations in cervical carcinomas

.

Nature

,

506

,

371

–

375

.

Weber

G.

et al. (

2002

)

Identification of foreign gene sequences by transcript filtering against the human genome

.

Nat. Genet

.,

30

,

141

–

142

.

Zaharia

M.

et al. (

2010

)

Spark: cluster computing with working sets

.

Proc. 2nd USENIX Conf. Hot Top. Cloud Comput

.,

10

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Author notes

The authors wish it to be known that, in their opinion, the Mark A.Walker and Chandra Sekhar Pedamallu authors should be regarded as Joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor:

Download all slides

Month:	Total Views:
July 2018	235
August 2018	208
September 2018	145
October 2018	135
November 2018	100
December 2018	420
January 2019	176
February 2019	127
March 2019	128
April 2019	139
May 2019	182
June 2019	99
July 2019	95
August 2019	71
September 2019	129
October 2019	98
November 2019	113
December 2019	97
January 2020	91
February 2020	120
March 2020	92
April 2020	105
May 2020	55
June 2020	108
July 2020	108
August 2020	104
September 2020	74
October 2020	103
November 2020	127
December 2020	77
January 2021	103
February 2021	94
March 2021	144
April 2021	134
May 2021	150
June 2021	112
July 2021	81
August 2021	116
September 2021	77
October 2021	94
November 2021	92
December 2021	107
January 2022	110
February 2022	112
March 2022	103
April 2022	101
May 2022	122
June 2022	103
July 2022	66
August 2022	102
September 2022	136
October 2022	147
November 2022	213
December 2022	244
January 2023	152
February 2023	131
March 2023	191
April 2023	179
May 2023	179
June 2023	157
July 2023	155
August 2023	170
September 2023	119
October 2023	168
November 2023	153
December 2023	154
January 2024	188
February 2024	164
March 2024	140
April 2024	91

Article Contents

GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts

Abstract

1 Introduction

2 Materials and methods

3 Results

Acknowledgements

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts

Abstract

1 Introduction

2 Materials and methods

3 Results

Acknowledgements

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only