- Split View
-
Views
-
Cite
Cite
Christopher M Gibb, Robert Jackson, Sabah Mohammed, Jinan Fiaidhi, Ingeborg Zehbe, Pathogen–Host Analysis Tool (PHAT): an integrative platform to analyze next-generation sequencing data, Bioinformatics, Volume 35, Issue 15, August 2019, Pages 2665–2667, https://doi.org/10.1093/bioinformatics/bty1003
- Share Icon Share
Abstract
The Pathogen–Host Analysis Tool (PHAT) is an application for processing and analyzing next-generation sequencing (NGS) data as it relates to relationships between pathogens and their hosts. Unlike custom scripts and tedious pipeline programming, PHAT provides an integrative platform encompassing raw and aligned sequence and reference file input, quality control (QC) reporting, alignment and variant calling, linear and circular alignment viewing, and graphical and tabular output. This novel tool aims to be user-friendly for life scientists studying diverse pathogen–host relationships.
The project is available on GitHub (https://github.com/chgibb/PHAT) and includes convenient installers, as well as portable and source versions, for both Windows and Linux (Debian and RedHat). Up-to-date documentation for PHAT, including user guides and development notes, can be found at https://chgibb.github.io/PHATDocs/. We encourage users and developers to provide feedback (error reporting, suggestions and comments).
1 Introduction
Analysis of pathogen data, especially of their genomes (Xiang et al., 2007) via high-throughput or next-generation sequencing (NGS), is an essential endeavour to understanding intricate pathogen–host relationships. While the ease of producing NGS data has grown significantly, bottlenecks still exist in its processing and analysis. In particular, short-read alignment algorithms and the tools that implement them have matured to the point that they no longer represent the major hurdle in the data analysis process (Li and Homer, 2010). Instead, the availability of fast and user-friendly tools has become the limiting factor (Milne et al., 2010). While there are excellent tools which perform one or several discrete functions in the same domain, e.g. Bowtie2 (Langmead and Salzberg, 2012) and SAMtools (Li et al., 2009), all-in-one type platforms can offer a breadth of features that help address barrier-to-entry (i.e. the ease in which users can setup and perform analyses). Integrative multi-tool platforms such as Comparative Genomics (CoGe) (Lyons and Freeling, 2008), VirBase (Li et al., 2014), Pathogen–Host Interaction Data Integration and Analysis System (PHIDIAS) (Xiang et al., 2007), Galaxy (Afgan et al., 2016) and Unipro UGENE (Okonechnikov et al., 2012) exist, but they are often server or cloud-based. The infrastructure behind some of these projects, and their cloud-based nature, introduce roadblocks in the transfer of data to and from their servers (Li and Homer, 2010). One solution to such a limitation is to establish an onsite computational cluster. However, technical and infrastructure requirements may pose further barrier-to-entry for data analysis.
We sought to develop the Pathogen–Host Analysis Tool (PHAT) to alleviate these issues by presenting an easy-to-setup and easy-to-use platform for life scientists conducting pathogen–host NGS analysis on common desktop computing hardware (e.g. Windows).
2 Features
Pathogen–host NGS analysis typically begins with high-throughput sequencing output files: experimentally relevant nucleic acid read information. PHAT is a platform for analyzing these data, with a focus on pathogen sequences within NGS data (Fig. 1). Reads are entered into PHAT as FASTQ files (Cock et al., 2010), comprised of sequence reads with per base nucleotide identities and quality scores, or pre-aligned SAM/BAM files (Li et al., 2009) generated via powerful cloud-based tools such as Galaxy (Afgan et al., 2016). Quality control can be performed on individual files, with graphical reports generated. Reference genomes, recorded as FASTA files, must be indexed before they can be visualized or used for analysis. Once a pair of forward and reverse reads (paired FASTQ files) and a reference have been input, alignment can occur. PHAT also supports unpaired alignment and visualization of pre-aligned sequences.
The core functions of the PHAT platform as well as FASTQ quality control, sequence alignment, visualization, and its automated analyses are performed through well-known, established implementations. FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc) is used for quality control scoring. Sequence alignment is done by Bowtie2 (Langmead and Salzberg, 2012) or HISAT2 (Kim et al., 2015), while linear alignment visualization is via pileup.js (Vanderkam et al., 2016). Circular genomes are viewed with our enhancements to AngularPlasmid (http://angularplasmid.vixis.com/) which we make available as a new project called ngPlasmid (https://github.com/chgibb/ngPlasmid). Automated variant calling of single-nucleotide polymorphisms (SNPs) is by VarScan2 (Koboldt et al., 2012).
The graphical user interface, based on GitHub’s Electron project (https://electronjs.org/docs), operates in a client-server-based architecture. Each window acts as a client, communicating with a background server process. The server manages the saving and propagation of workspace data, as well as the generation of additional processes such as sequence alignment and quality control. This mechanism allows processes to act as threads, allowing the flow of data to and from the application window that invoked it and the created process itself. On systems with limited power, the server process limits the number of concurrently running processes and the amount of data propagated between windows to reduce memory and central processing unit (CPU) usage. We utilize an internal pipeline, spawning new processes as others end, passing data from one application window to another (e.g. alignment output). The server process, as well as the application windows themselves are implemented in Typescript. These windows can be conveniently undocked from the main toolbar.
3 Future work
With the development of PHAT, we aim to bring simple-to-use cross-platform NGS analysis to off-the-shelf hardware for life scientists studying pathogen–host relationships. In our own lab, we study human papillomavirus type 16 (HPV16) variants and their tumourigenicity in epithelia using NGS (Jackson et al., 2016), but PHAT can be applied to a wide variety of pathogen–host relationships (e.g. genotyping of microbes such as viruses, bacteria, fungi and protozoans) from host NGS samples. To aid in our own experimental work, including analysis of HPV sequences within curated datasets (e.g. The Cancer Genome Atlas, TCGA), we are currently testing a viral-host integration detection feature in PHAT, with linkage to sequence databases. Additional features could include advanced alignment options as well as tools for further exploring pathogen–host interactions. We plan to actively develop, update and support PHAT based on user feedback and needs, with auto-updating features already included, in anticipation of building an active user and developer community.
Acknowledgements
The need for PHAT was conceived by RJ and IZ. Interface was designed by CMG and RJ, with programming by CMG. Manuscript writing was carried out by CMG, RJ, SM, JF and IZ. Intellectual property considerations made by SM and JF. Thanks to Zehbe Lab members for user testing, Dr. M. Togtema for feedback, as well as students M. Pynn, J. Braun, S. Liu, Z. Moorman and N. Catanzaro for improvements. We are thankful to the developers of open-source tools that are used in PHAT as well as GitHub and Reddit communities for helpful discussions.
Funding
This work was supported by Natural Sciences and Engineering Research Council of Canada (NSERC) grants to IZ (#RGPIN-2015-03855) and RJ (CGS-D #454402-2014). The funding body had no role in study design, data collection, data analysis and interpretation, or preparation of the manuscript.
Conflict of Interest: none declared.
References
Author notes
The authors wish it to be known that, in their opinion, Christopher M. Gibb and Robert Jackson authors should be regarded as Joint First Authors.