SPIB and BATF provide alternate determinants of IRF4 occupancy in diffuse large B-cell lymphoma linked to disease heterogeneity

Interferon regulatory factor 4 (IRF4) is central to the transcriptional network of activated B-cell-like diffuse large B-cell lymphoma (ABC-DLBCL), an aggressive lymphoma subgroup defined by gene expression profiling. Since cofactor association modifies transcriptional regulatory input by IRF4, we assessed genome occupancy by IRF4 and endogenous cofactors in ABC-DLBCL cell lines. IRF4 partners with SPIB, PU.1 and BATF genome-wide, but SPIB provides the dominant IRF4 partner in this context. Upon SPIB knockdown IRF4 occupancy is depleted and neither PU.1 nor BATF acutely compensates. Integration with ENCODE data from lymphoblastoid cell line GM12878, demonstrates that IRF4 adopts either SPIB- or BATF-centric genome-wide distributions in related states of post-germinal centre B-cell transformation. In primary DLBCL high-SPIB and low-BATF or the reciprocal low-SPIB and high-BATF mRNA expression links to differential gene expression profiles across nine data sets, identifying distinct associations with SPIB occupancy, signatures of B-cell differentiation stage and potential pathogenetic mechanisms. In a population-based patient cohort, SPIBhigh/BATFlow-ABC-DLBCL is enriched for mutation of MYD88, and SPIBhigh/BATFlow-ABC-DLBCL with MYD88-L265P mutation identifies a small subgroup of patients among this otherwise aggressive disease subgroup with distinct favourable outcome. We conclude that differential expression of IRF4 cofactors SPIB and BATF identifies biologically and clinically significant heterogeneity among ABC-DLBCL.


ChIP-Seq
Overview of ChIP-seq data:

ChIP-Seq alignment and peak discovery
Before aligning, reads were trimmed to remove adapters and low confidence regions using a python script. A 4 base sliding window was run along each read, if the average Q Phred score for a window was < 20 the read was trimmed at the window start, finally any match to adapter sequences were trimmed.
Trimmed reads were aligned using Bowtie2 with the --very-sensitive parameter (1). The resultant SAM files were converted to BAM using Samtools with the quality filter set to 20 (i.e. -q 20) (2).
The BAM files were analysed for peaks using GEM, with quality filter set to 1 (i.e. -q 1) (3).
The resultant BAM files were converted to BED files and read cross-correlation was assessed using MaSC (4). Reads were extended to the estimated fragment length, and a scaled (reads per million; rpm) BED file generated. This was converted to a coverage file using the UCSC genomeCoverageBed tool and then to a BigWig file using UCSC bedGraphToBigWig (5). These coverage files were used for visualisation via IGV and the extended BED files for fold change analysis (see Fold-change analysis) (6).

ChIP-seq peak overlap
The output from GEM was analysed using a python script to find overlaps between the different transcription factors (TFs). The peak centres for all TFs were ordered per chromosome. Starting at the beginning of each chromosome peaks were added to a cluster. New peaks were only added to the cluster if the distance between them and the first peak was < 250 bp, else a new cluster was started. As new clusters were generated if peaks in the earlier cluster were closer to the new cluster's centre they were moved into the new cluster (thus a peak can only belong to one cluster). The minimum GEM Q score for a peak to start a new cluster was set to >= 2, however, this was lowered to 1 for the addition of peaks to existing clusters.

Peak annotation
Version 14 of the Gencode gene annotation data was downloaded from UCSC, the genes were reannotated using the HUGO Gene Nomenclature Committee annotations (2013/06/03 version) (5, 7). Each GEM peak was assigned the nearest gene as its primary gene, peaks lying 2000 bp up/down-stream of a TSS were deemed to be in promoters, those outside this region but within a gene body were termed intragenic and all others were termed intergenic.

Motif detection and enrichment analysis
BED files were generated for GEM peaks, and peak overlap sets, -/+ 125 bases around each peak centre (or overlap cluster centre). These were analysed for de novo motifs of length 8 -14 using HOMER (8).

PWM analysis
De novo position weight matrices (PWM) from HOMER were scanned across the entire genome using MOODS (9). For each PWM the matches that accounted for <= 30% of total genome wide matches were retained.
Plots of PWM occurrence around ChIP-seq peak centres were generated using a python script.
Using the results from MOODS the top 200 most significant peaks for each peak set (see ChIPseq peak overlap) were searched for PWM matches -/+ 200 bp around each peak centre. The nearest PWM to the peak centre was assigned as the primary match. All matches within -/+ 100 bp around the primary motif were retained.
The resulting matches were drawn using Matplotlib, with the primary match in the centre, significance indicated by intensity of colour and orientation by arrow direction (10). The peak centres are represented as short vertical lines.

ENCODE analysis
In addition to locally generated data, BED files were downloaded for BATF, IRF4 and PU.1 from the ENCODE data set GM12878 (11). The genome was split into 100 bp windows, and any window overlapped by a TF's BED file was assigned a 1, while those without were set to 0. This 2D matrix of TF's vs bp windows was then used to calculate a Pearson's correlation using numpy.

Fold-change analysis
For the SPIB knock-down ChIP-seq data sets we wanted to explore the quantitative occupancy differences for IRF4, PU.1 and SPIB upon SPIB knockdown (e.g. IRF4_Scrm vs IRF4_SPIBkd).
This was accomplished with a simple python script. Peaks generated by GEM and the extended reads in the form of BED files (see ChIP-Seq alignment and peak discovery) were used as input.
For a given transcription factor the peaks from the scramble and the SPIB_knock-down ChIP-seq were merged to form the union, i.e. all peak locations before/after knockdown in one file. Then for each of peak locations the sum of reads (as reads per million; rpm) was calculated -/+ 50bp around the peak centre in the scramble and SPIBkd data set, the input read count in the same region subtracted from each and a pseudocount added (0.25). Finally, the log fold-change was calculated between the scramble/SPIBkd conditions. Several filtering steps were employed to minimise artefacts. Only peaks with a GEM log p-value of >= 2 were used. Peaks for which the read count in either scramble/SPIBkd < 1 rpm (after input subtraction) were discarded.
In certain contexts we wanted to be able to study the quantitative changes to a single TF in isolation, e.g. IRF4_Only. Here our main aim was to be confident that this TF was binding alone (i.e. for IRF4_Only that PU.1 and SPIB were not binding), this could be confounded by GEM not calling peaks for the other TFs or calling peaks with a log p-value < 2. Thus for such cases the read numbers for the TFs that were to be excluded were also calculated (e.g. for IRF4_Only read numbers for SPIB_ Scrm, SPIB_SPIBkd, PU1_Scrm and PU1_SPIBkd were also calculated) and if the maximum read number > 0.5 rpm the peak was excluded.

Gene expression data
RNA was run on Illumina HumanHT-12 v4 expression bead arrays, scanned with the Illumina BeadScanner and initial data processing carried out with Illumina GenomeStudio using the Gene Expression Module. The resulting final-report file was re-annotated using the NCBI API and then using the HUGO Gene Nomenclature Committee annotations (2013/06/03 version).

Differential gene analysis
The re-annotated data was quantile normalised using the Lumi package for R (12). A linear model was fitted to the gene expression data using the R Limma package (13). Differentially expressed genes between scramble and SPIB knock down were gauged using the Limma empirical Bayes statistics module, adjusting for multiple testing using Benjamini & Hochberg correction.

DLBCL data sets and analysis
The diffuse large b-cell data sets used are described in full in our previous work (14).

Contingency table groups
The ABC-DLBCL data was ranked by SPIB and BATF, the top/bottom 50% of each assigned to high/low and then split into 4 groups (high/high, high/low, low/high and low/low), e.g.

Survival analysis
For each DLBCL data set the survival of the SPIB high /BATF low and SPIB low /BATF high groups were compared. The Survival library for R was used to analyse this right-censored survival data, overall survival was estimated using the Kaplan-Meier method, modelled with Cox Proportional Hazards technique (15).

Meta-profile generation
The 10 DLBCL data sets were previously classified into ABC/GCB/TypeIII (14). For each data set the ABC classified patients were split into contingency groups (see Contingency table group). The SPIB high /BATF low and SPIB low /BATF high groups were used for differential gene expression analysis (see Differential gene analysis). The genes up-regulated (p-value < 0.05) in SPIB high /BATF low and SPIB low /BATF high were used to create a meta-profile by finding genes commonly differentially expressed.

Comparison of meta-profile against B-cell differentiation
Meta-profile genes up-regulated in >= 4 data sets were compared against the top 15% most variable probes (n=3200) from a B-cell differentiation time series (GSE41208) (16). The probes were merged per gene by taking the median expression (n=2743). The significance of overlap was then assessed using a hypergeometric distribution.
To assess potential skewing of meta-profile genes to stages of B-cell differentiation the 2743 genes from the differentiation time course were split into groups according to their maximal expression: Day0 BC, Day3 ABC, Day6 PB and >Day6 PC. The number of metaprofile genes up-regulated in >=4 data sets that fell into each group was compared against 1x10 7 random selections of genes (chosen from the 2743, n=size of meta-profile) and p-values generated.

Comparison of meta-profile against ChIP-seq data
Meta-profile genes up-regulated in >= 4 data sets were compared against genes occupied by BATF and SPIB.
Genes with significant peaks (GEM log p-value >=2) lying -5kb upstream of TSS or intragenic were deemed to be occupied by that TF. All genes with a log 2 expression >= 7 were considered as the total gene pool. The significance of overlap between the meta-profile genes and the ChIPseq occupied promoters was then assessed using a hypergeometric distribution. Gene signature enrichment was carried out for the SPIB high /BATF low and SPIB low /BATF high metaprofiles (see Contingency table groups) using a hypergeometric analysis. To avoid any bias BATF/SPIB were removed from the gene signatures.  Figures 5B, 6A, 7A were generated using ggplot2 (24). Wordles in Figures 3B and 8D were generated using http://www.wordle.net/. The heatmap in Figure 7A was created using GenePattern with text highlighted in Adobe Illustrator (25). The heatmap in Figure 8D was created using a customised version of ggplot2 heatmap.2. ChIP-seq visualisations were generated using output from IGV, with the floor set to 0.25 rpm and the ceiling rpm displayed (6).

Figure generation
Motif logos throughout the figures were created using STAMP (26).