No Promoter Left Behind (NPLB): learn de novo promoter architectures from genome-wide transcription start sites

Abstract Summary: Promoters have diverse regulatory architectures and thus activate genes differently. For example, some have a TATA-box, many others do not. Even the ones with it can differ in its position relative to the transcription start site (TSS). No Promoter Left Behind (NPLB) is an efficient, organism-independent method for characterizing such diverse architectures directly from experimentally identified genome-wide TSSs, without relying on known promoter elements. As a test case, we show its application in identifying novel architectures in the fly genome. Availability and implementation: Web-server at http://nplb.ncl.res.in. Standalone also at https://github.com/computationalBiology/NPLB/ (Mac OSX/Linux). Contact: l.narlikar@ncl.res.in Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
Promoters play a key role in transcription initiation by harbouring specific DNA elements, which act as transcription factor recognition sites. But how these promoter elements (PEs) contribute to the diversity in transcriptional regulation is not yet clear. While highthroughput technologies are increasingly used to produce accurate maps of transcription start sites (TSSs) (Ohler and Wassarman, 2010), the subsequent step of characterizing promoters and their functions is still done using two rather dated approaches. The first involves classifying them based on known PEs such as the INR motif or TATA-box. Unfortunately, a majority of promoters and their activities cannot be explained by the presence or absence of these few PEs. Alternatively, de novo motif discovery methods are used to identify overrepresented elements directly from the sequences. These can miss PEs present only in a small fraction of promoters. Since promoters have diverse mechanisms of activation, most PEs fall in this category (Juven-Gershon et al., 2008). Even methods that identify cis-regulatory modules fail here, since although they look for motif-combinations, these are still required to be common across the full set (Van Loo and Marynen, 2009).
No Promoter Left Behind (NPLB) is a new method modelled along the lines of unsupervised learning with feature selection that partitions TSS-aligned promoter sequences into distinct promoter architectures (PAs), each characterized by its own set of PEs, all learned de novo (Narlikar, 2014). Since it explicitly allows for diversity, NPLB can be applied to the full dataset, leaving out no promoter, in contrast to the standard approach of presorting/ preselecting promoters on the basis of criteria such as presence of known PEs (Chen et al., 2014) or TSS peak characteristics (Ni et al., 2010). In this new parallel software, the number of PAs and PEs are determined automatically using a mix of Bayesian modelling and cross validation. All other positions follow a background categorical distribution, common for all PAs. Parameters of models with various numbers of PAs are learned using Gibbs sampling and the best model is decided using cross validation. Key advantages of NPLB are that it 1. identifies novel and possibly diverse architectures and elements, with the only input being the set of promoters, 2. is an organism and a cell-type independent, 3. can be applied to the full set, directly, 4. employs a likelihood-based approach, thus can be used to make new predictions of promoters, as well as classify between architectures, 5. uses multiprocessing, making it fast: takes about 2 h for bacteria and 10 h for fly on an Intel i7-3770 K desktop. (Supplementary Fig. S1 shows how runtime scales with number of promoters.)

Overview of NPLB
Written in C and Python, NPLB requires a prior installation of gnuplot 4.6þ. Weblogo 3.3 (Crooks et al., 2004), and is modified to generate sequence logos.

NPLB output
A successful run of PROMOTERLEARN produces the following outputs: A successful run of PROMOTERCLASSIFY produces all the aforementioned files except CVLikelihoods.txt, settings.txt and the likelihood plots.
Here, 12 PAs were identified ( Supplementary Fig. S4a); PROMOTERLEARN was run again on each of them. Eight PAs were split further into a total of 23 PAs ( Supplementary Fig. S4b), three of which were split to get a final set of 30 PAs (Fig. 1b). A1-A6 contain the TATA-box, but differ in its distance from the TSS. Interestingly, the INR motif TCAGTY varies slightly with the TATA-box position in A3-A6. Standard analyses miss such variations, either because they rely on known PEs or look for elements overrepresented in the full set. For instance, in the sequences left out in the original study, NPLB finds PAs characterized by known as well as novel PEs (Supplementary Fig. S3b).
The characteristic file with the number of tags at each TSS and 5 0 UTR length was used to construct two box-plots ( Fig. 1c and d). A30 contains the ribosomal TCT motif (Parry et al., 2010) in place of the INR, which explains the significantly higher number of tags at those promoters (P < 10 À21 ). This PA was missed in the original analysis possibly since it contains <2% of all promoters. Interestingly, A7-A11, which contain variants of the DPE, but no obvious upstream element, create transcripts with longer 5 0 UTRs than other PAs (P < 10 À62 ). This has not been noted before. A more detailed description of the PAs is available in the Supplementary methods. PAs can be further analysed for function through conservation analysis (Karolchik et al., 2014;Supplementary Fig. S5) and GO term enrichment studies (Huang et al., 2007; Supplementary Table S1).

Conclusion
Data from new and advanced high-throughput technologies are increasingly making it clear that cells employ diverse mechanisms for transcriptional regulation. NPLB seeks to fulfil the need for an efficient and unbiased method that can identify these mechanisms directly from such data. Although NPLB has been designed for TSS maps, it can be applied to any DNA sequences aligned on the basis of a common genomic event such as splicing, eRNA synthesis or protein-DNA binding and expected to have distinct sequence architectures in the immediate neighbourhood.

Funding
This work was supported by an Early Career Fellowship from Wellcome Trust/DBT India Alliance to L.N.
Conflict of Interest: none declared.