Chromatin signature discovery via histone modification profile alignments

We report on the development of an unsupervised algorithm for the genome-wide discovery and analysis of chromatin signatures. Our Chromatin-profile Alignment followed by Tree-clustering algorithm (ChAT) employs dynamic programming of combinatorial histone modification profiles to identify locally similar chromatin sub-regions and provides complementary utility with respect to existing methods. We applied ChAT to genomic maps of 39 histone modifications in human CD4+ T cells to identify both known and novel chromatin signatures. ChAT was able to detect chromatin signatures previously associated with transcription start sites and enhancers as well as novel signatures associated with a variety of regulatory elements. Promoter-associated signatures discovered with ChAT indicate that complex chromatin signatures, made up of numerous co-located histone modifications, facilitate cell-type specific gene expression. The discovery of novel L1 retrotransposon-associated bivalent chromatin signatures suggests that these elements influence the mono-allelic expression of human genes by shaping the chromatin environment of imprinted genomic regions. Analysis of long gene-associated chromatin signatures point to a role for the H4K20me1 and H3K79me3 histone modifications in transcriptional pause release. The novel chromatin signatures and functional associations uncovered by ChAT underscore the ability of the algorithm to yield novel insight on chromatin-based regulatory mechanisms.


Contents
Instructions for installing and running the ChAT software 2 -4 Supplementary Table S1 5 Supplementary Figure S1  Instructions for installing and running the ChAT software

Preparation:
In order to run ChAT, you need to: 1) Download the compressed folder "ChAT_package.tar.gz" from http://jordan.biology.gatech.edu/page/software/ChAT; 2) Decompress the folder. There are three files within the created folder: A) ChAT, B) clustering.R and C) cluster_figure.R. 3) Make sure these files are always kept in the same folder. 4) Make sure R program is already installed on your computer. 5) Check the shebang line of the ChAT file and correct it by the path of env of your computer. 6) Add the directory of the folder ChAT_package into the PATH. The detailed explanations of the parameters can be found in Section 4.
One example of running ChAT is: $ ChAT -i /home/CD4_sample_data -o sample_pattern -m mark_name.txt -d critical_mark.txt -c chromosome.txt -p 0. 05 -b 200 This command takes the Wiggle format histone modification files (must be named as *.wig) located in "/home/CD4_sample_data" folder as the inputs and create the directory "sample_pattern" to store all the final and intermediate results. "mark_name.txt" contains the list of histone modifications (each row has a histone modification name) that are consistent with the file names in the input directory. "critical_mark.txt" contains a subset of histone modifications used for initial grouping. "chromosome.txt" contains a list of chromosomes (each row has a chromosome name that are consistent with the chromosome names in the wiggle format input files) under consideration. The p-value threshold used to cut the hierarchical tree is set as 0.05. The bin size is set as 200bp using "-b".

File Format: (A) Input files
Corresponding to each individual histone modification, there is a Wiggle format file of the ChIP-seq data. All of the files need to be named as "histone_mark_name.wig". For example, "H3K36me3.wig" for H3K36me3. All the files must be stored in the same directory. The name of the directory is the most important parameter for ChAT.
A file of the list of all the histone modifications under consideration need to be provided. Each row has the name of a histone modification.
A file of the list of critical histone modifications for initial grouping need to be provided. Those modifications are important marks based on a priori biological knowledge. This list must be a subset of the modifications under consideration. Each row has the name of a critical histone modification.
A file of the list of all the chromosomes under consideration need to be provided. Each row has the name of a chromosome. The names need to be consistent with the chromosome names in the input wiggle format files.
(B) Output files All the output files are stored in the created folder specified by "-o". The most important final results are saved in 2 folders.
The BED format tracks of genomic locations sharing specific combinatorial chromatin signatures are stored in "BED_tracks".
The average histone modification profiles of each signature and the corresponding enrichment curves in PDF files are stored in "Signature_info".

Parameters:
-i: The directory where all the wiggle format input files (one file for each histone modification) are located. The wiggle format files must be names as "*.wig". -m: The file with the list of histone modifications under consideration. Each row has the name of one histone modification. They need to be consistent with the name of the wiggle format input files.
-d: The file with the list of critical histone modifications used for initial grouping. Each row has the name of one histone modification.
-c: The file with the list of chromosomes under consideration. Each row has the name of one chromosome. They need to be consistent with the chromosome names in the wiggle format input files.
-p: The p-value threshold used to cut the hierarchical tree, default value: 0.05.
-b: The size of bin, default value 200 (bp).
-h,-help: Display brief explanations of parameters.

Computational performance:
ChAT is tested on a Ubuntu Linux server (with memory 8 Gb) to identify combinatorial signatures based on the ChIP-seq datasets of 14 histone methylations on human chromosome 2, and it takes 25.5 minutes to produce the combinatorial chromatin signatures.  Supplementary Figure S1: Algorithm performance comparison. A set of histone methylation ChIP-seq datasets on human chromosome 2 are used to test four specific algorithmic features of ChAT, ChromaSig and CoSBI. The three softwares identify similar chromatin signatures for a standard mono-modal pattern (A). For a bi-modal pattern identified by ChAT, ChromaSig and CoSBI only found mono-modal signatures (B). For a set of genomic locations enriched with the same set of histone modifications, ChAT discriminate two patterns with distinct enrichment shapes (C). ChAT identified a complex large-sized signature (~82 kb) for a set of genomic locations, while ChromaSig found a number of small-sized signatures as parts of the large signature (D).   Supplementary Figure S4: Average histone modification profiles for the large pattern example B. Each curve shows the average profile of a specific histone modification of genomic locations with the same pattern.