Heat*seq: an interactive web tool for high-throughput sequencing experiment comparison with public data

Summary: Better protocols and decreasing costs have made high-throughput sequencing experiments now accessible even to small experimental laboratories. However, comparing one or few experiments generated by an individual lab to the vast amount of relevant data freely available in the public domain might be limited due to lack of bioinformatics expertise. Though several tools, including genome browsers, allow such comparison at a single gene level, they do not provide a genome-wide view. We developed Heat*seq, a web-tool that allows genome scale comparison of high throughput experiments chromatin immuno-precipitation followed by sequencing, RNA-sequencing and Cap Analysis of Gene Expression) provided by a user, to the data in the public domain. Heat*seq currently contains over 12 000 experiments across diverse tissues and cell types in human, mouse and drosophila. Heat*seq displays interactive correlation heatmaps, with an ability to dynamically subset datasets to contextualize user experiments. High quality figures and tables are produced and can be downloaded in multiple formats. Availability and Implementation: Web application: http://www.heatstarseq.roslin.ed.ac.uk/. Source code: https://github.com/gdevailly. Contact: Guillaume.Devailly@roslin.ed.ac.uk or Anagha.Joshi@roslin.ed.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
High throughput sequencing is now becoming routine for many biological assays including transcriptome analysis through RNAsequencing (RNA-seq), or transcription factor (TF) binding sites identification through chromatin immuno-precipitation followed by sequencing (ChIP-seq). Additionally, collaborative projects such as Bgee (Bastian et al.), ENCODE (Bernstein et al., 2012) and Roadmap Epigenomics (Kundaje et al., 2015) have generated genome-wide datasets across hundreds of cell types or tissues. Despite this large data being freely available in the public domain, the lack of computational tools accessible to experimental scientists with no or elementary computational skills prohibits the use of this data to its full potential for discovery.
Though genome browsers, including summary tracks provided by many consortia, are extremely useful to study a few genes, promoters or single nucleotide polymorphisms, they lack the genome-wide overview. Only a few public resources such as the CODEX database (S anchez-Castillo et al., 2015a) and the BLUEPRINT GenomeStats tool (Zerbino et al., 2014) allow a genome-wide comparison with the user data. We therefore developed Heat*seq, a free, open source, web application providing fast and interactive comparison against high throughput sequencing experiments in the public domain. Users can upload a processed text file containing either gene expression value (Fragments Per Kilobase of transcript per Million (FPKM) or Tags Per Million (TPM)), peak coordinates or peak coordinates and corresponding expression value for CAGE(Cap Analysis of Gene Expression). The application provides clustered correlation heatmaps, summarising global similarities between all samples in the dataset and the user sample. Heat*seq provides over 12 000 publicly available genome-wide experiments in human, mouse and drosophila for fast and interactive comparison. In summary, Heat*seq is an interactive web tool that allows users to contextualize their sequencing data with respect to vast amounts of public data in a few minutes without requiring any programming skills.

Data collection
We collected gene expression data (RNA-seq), TF ChIP-seq data and CAGE data (over 4000 individual experiments) from Bgee (Bastian et al., 2008), Blueprint epigenome (Pradel et al., 2015), CODEX (S anchez-Castillo et al., 2015b), ENCODE (Bernstein et al., 2012), FANTOM5 (Forrest et al., 2014), FlyBase (Attrill et al., 2016), GTEx (Lonsdale et al., 2013), modENCODE (Celniker et al., 2009) and Roadmap Epigenomics (Bernstein et al., 2010), in human, mouse and drosophila. Data formatting was done using R (R scripts available on GitHub). The source for each dataset is listed in Supplementary Table  S1. Heatmaps represent Pearson's correlation values between experiments calculated using a Gene x Experiment numeric matrix with gene expression values for expression data (log scaled), a Genomic regions Â Experiments binary matrix indicating presence or absence of a peak for TF ChIP-seq data and a Genomic regions Â Experiments numeric matrix of expression values for CAGE data (log scaled). Importantly, we constructed a metadata table which provides a weblink to original data and allows users to sub select each dataset.

Web-application development
Heat*seq is an R shiny open source interactive tool which computes correlation values between the user file and each experiment in a dataset.
Detailed user instructions are on the application website.

Application description
Heat*seq tool supports three data types: HeatRNAseq, HeatChIPseq and HeatCAGEseq. Data upload, correlation calculation and heatmap generation takes about a minute. Importantly, users can interactively sub select relevant experiments using the metadata information (e.g. cell type, TF name). The interactive heatmap also allows selecting different clustering methods as well as zooming in and out on the heatmap. The high resolution figures and tables can be downloaded in multiple formats. Thus, Heat*seq provides global overview of relationships between public experiments and the user data. Four user scenarios are discussed below.

User data quality control
We compared a Neocortex, 10 days post-partum (Ray et al., 2015) RNA-seq sample with Bgee mouse RNA-seq data using HeatRNAseq. The top five correlation values (Pearson Correlation Coefficient > 0.9) correspond to Bgee brain samples (Supplementary  Table S2). Thus, Heat*seq can be used as a fast data quality check for next-generation sequencing data.

Cell context identification
An oestrogen receptor (ER) alpha ChIP-seq in MCF7 cells (Zhuang et al., 2015) comparison to the ENCODE TFBS dataset by sub-selecting ENCODE ER ChIP-seq experiments revealed that the binding pattern of ERa in MCF7 cells was more similar to its binding pattern in T-47D cells than in ECC-1 cells (Fig. 1A). MCF7 and T-47D were derived from mammary tumours while ECC-1 is an endometrial cell line.

New hypotheses by data integration
CpG islands (CGI) from the UCSC (Karolchik et al., 2004) comparison to HeatChIPseq found that RNA polymerase II and TAF1 (Supplementary Table S4) were enriched at CGIs, as $50% of human gene promoters contain a CGI (Illingworth and Bird, 2009). Interestingly, we identified factors avoiding CGIs including MAFK, GATA3 and ZNF274. Similarly, tRNA promoters were highly correlated with RNA polymerase III, and its co-factors BDP1, RPC155 and BRF1 (Supplementary Table S4) using HeatChIPseq. Interestingly, comparison with BRF family data revealed that BRF1, but not BRF2 was bound at tRNA genes ( Supplementary Fig. S1B).

Public data assessment
Heat*seq can be used to assess data in the public domain, highlighted by two examples below amongst others: A MYC ChIP-seq in H1-hESC cells does not cluster with other ENCODE MYC ChIP-seq experiments (Fig. 1C), including H1-hESC sample from a different experimental group (Devailly et al., 2015).
Two out of seven erythroblast RNA-seq samples from the Blueprint Epigenome consortium are more correlated with endothelial cells than with the rest of the erythroblast samples (Fig. 1D).

Conclusion
With Heat*seq, comparing RNA-seq, ChIP-seq or CAGE experiments to hundreds of publicly available datasets becomes a trivial task. Researchers can now investigate the relationships between   various high-throughput sequencing experiments fast and interactively without requiring any programming skills. Such analysis can assess data quality, cell variability and generate novel regulatory hypotheses.