excluderanges: exclusion sets for T2T-CHM13, GRCm39, and other genome assemblies

Abstract Summary Exclusion regions are sections of reference genomes with abnormal pileups of short sequencing reads. Removing reads overlapping them improves biological signal, and these benefits are most pronounced in differential analysis settings. Several labs created exclusion region sets, available primarily through ENCODE and Github. However, the variety of exclusion sets creates uncertainty which sets to use. Furthermore, gap regions (e.g. centromeres, telomeres, short arms) create additional considerations in generating exclusion sets. We generated exclusion sets for the latest human T2T-CHM13 and mouse GRCm39 genomes and systematically assembled and annotated these and other sets in the excluderanges R/Bioconductor data package, also accessible via the BEDbase.org API. The package provides unified access to 82 GenomicRanges objects covering six organisms, multiple genome assemblies, and types of exclusion regions. For human hg38 genome assembly, we recommend hg38.Kundaje.GRCh38_unified_blacklist as the most well-curated and annotated, and sets generated by the Blacklist tool for other organisms. Availability and implementation https://bioconductor.org/packages/excluderanges/. Package website: https://dozmorovlab.github.io/excluderanges/.

Removing reads overlapping those regions, referred hereafter as exclusion sets, improves normalization of the signal between samples, correlation between replicates, and increases accuracy of both peak calling and differential ChIP-seq analysis (Kharchenko et al. 2008;Carroll et al. 2014;Wimberley and Heber 2020). Therefore, standardized availability of those exclusion sets is critical for improving reproducibility and quality of bioinformatics analyses.
Finding and choosing an exclusion set can be a non-trivial task. The ENCODE project returns 94 hits using the "exclusion" search term (as of 11 August 2022) (https://www.encodeproject.org/search/? searchTerm=exclusion), most of them having minimal annotation and unknown curation methods. These sets are available for human and mouse genome assemblies; however, the ENCODE project lacks exclusion sets for the latest Telomere-to-Telomere (T2T-CHM13) human and Genome Reference Consortium Mouse Build 39 (GRCm39/mm39) mouse assemblies. Converting exclusion set coordinates between genomic assemblies using liftOver is not advisable since new artifact-prone regions are added and others are lost due to closed gaps (Amemiya et al. 2019); therefore, exclusion sets should be generated and used for their respective genome assemblies. Furthermore, exclusion regions have been observed in genomes of other species and many exclusion sets for model organisms remain unpublished and scattered across GitHub repositories. We curated a collection of exclusion sets for six model organisms and 12 genome assemblies, including the newly generated T2T and mm39 exclusion sets. We included two other types of potentially problematic regions: University of California Santa Cruz (UCSC)-annotated gap sets, e.g. centromere, telomere, short arm, and nuclear mitochondrial (NUMT) sets containing mitochondrial sequences present in the nuclear genome (Qu et al. 2008). We assemble a total of 82 uniformly processed and annotated exclusion sets in the excluderanges R/Bioconductor data package and provide API access via BEDbase.org.

Implementation
An overview of the excluderanges data is shown in Fig. 1. To create this resource, we performed a systematic internet and literature search. The ENCODE project was the largest source of exclusion sets for human (11 sets) and mouse (6 sets) organisms, covering hg19, hg38, mm9, and mm10 genome assemblies. We also obtained exclusion sets generated by the Blacklist (Amemiya et al. 2019) and PeakPass (Wimberley and Heber 2020) software. Additionally, we obtained exclusion sets for Caenorhabditis elegans (C. elegans, ce10 and ce11 genome assemblies), Drosophila melanogaster (D. melanogaster, dm3 and dm6), Danio rerio (D. rerio, danRer10), and A. thaliana (TAIR10). Using the Blacklist software, we generated exclusion sets for the latest Telomere-to-Telomere (T2T-CHM13) human and Genome Reference Consortium Mouse Build 39 (GRCm39/ mm39) mouse assemblies (Table 1, Supplementary Table S1).
Mitochondrial DNA sequences (mtDNA, 100-600K mitochondria per human cell) transferred to the nucleus give rise to the so-called mitochondrial DNA sequences in the nuclear genome (NUMTs). These sequences are found in genomes of various species (Qu et al. 2008), suggesting NUMTs may be a pervasive phenomenon. In the settings of DNA/chromatin sequencing (e.g. ATAC-seq), up to 80% of mitochondrial sequencing reads (Montefiori et al. 2017) may pile up in the NUMT sequences. Similar to exclusion sets, genomic regions highly homologous to mtDNA can be masked to improve biological signal. The reference human NUMT sequences have been available in the UCSC genome browser for hg18 [RHNumtS.2 database (Simone et al. 2011)] and lifted over to hg19 human genome assembly. Similarly, mouse NUMTs [RMNumtS database (Calabrese et al. 2012)] are available for the mm9 mouse genome assembly. However, recent human, mouse, and other organism genome assemblies lack NUMTs annotations in the UCSC database. We collected NUMT sets for more recent human and mouse genome assemblies, including hg38, T2T-CHM13, mm10, generated by Caleb Lareau in the mitoblacklist GitHub repository (https://github.com/caleblareau/mitoblacklist).
Gaps in the genome represent another type of problematic regions. These include centromere and telomere sequences, short arms, gaps from large heterochromatin blocks, etc. While some are present in genome assemblies of most organisms (centromeres, telomeres, short arms, covering 2.47% 6 1.64, 0.01% 6 0.01, and 15.39% 6 3.66 of hg38 chromosomes, respectively), many are assembly-specific (e.g. gaps between clones, contigs, scaffolds in hg19 and hg38 assemblies). Gap data are available from the UCSC Genome Browser database or UCSChosted data hubs. The T2T-CHM13 assembly lacks assembly-specific gaps by the definition of telomere-to-telomere sequencing (Nurk et al. 2022); however, coordinates of centromeres and telomeres are available from the CHM13 GitHub repository (https://github.com/marbl/ CHM13). Additionally, we obtained T2T peri/centromeric satellite annotations, known to be associated with constitutive heterochromatin and span sites involved in kinetochore assembly or sequences epigenetically marked as centromeres (Yunis and Yasmineh 1971). We also included the rDNA gap regions and regions unique to T2T-CHM13 v2.0 as compared with GRCh38/hg38 and GRCh37/hg19 assemblies under the rationale that alignments within these previously problematic regions might warrant extra attention. We characterized hg38 exclusion sets for overlap with gap regions and found that hg38.Kundaje. GRCh38_unified_Excludable, hg38.Boyle.hg38-Excludable.v2, and hg28.Wimberley.peakPass60Perc_sorted cover 99.40%, 99.08%, and 59.60% of centromeric regions, respectively. Notably, relatively few large regions were responsible for these overlaps (e.g. 27 out of 910 in hg38.Kundaje.GRCh38_unified_Excludable). In contrast, over 60% of the hg38.Nordin.CandRblacklist_hg38 exclusion set for the CUT&RUN technology overlapped centromeres on chromosomes 1 and 13. Only sets generated by the Blacklist software overlapped centromeres, telomeres, and short arms, and there results were consistent across organisms and genome assemblies (Supplementary Table S2). Given the distinct properties of gap regions and inconsistency of their presence in exclusion sets, the aforementioned NUMTs and gap sets may be combined with other exclusion sets.
The large number of exclusion sets (e.g. nine for hg38 human genome assemblies) creates uncertainty in which set to use for a given genome assembly. We annotated exclusion sets by their creation methods, date of last update, width distribution, percent of the genome covered, and other properties (Supplementary Table S1, BEDbase.org; example of BEDbase overview screen for hg38.Kundaje.GRCh38_unified_blacklist: http://bed base.org/#/bedsplash/1a561729234c2844303a051b16f66656). Only sets generated by the Boyle's lab Blacklist (Amemiya et al. 2019) or PeakPass by Eric Wimberley (Wimberley and Heber 2020) software had published methods. While methods for some sets may be inferred (e.g. the hg38 Yeo.eCLIP_Excludableregions.hg38liftover set may have been lifted over from hg19), we advise against using poorly annotated sets. We also characterized hg38 exclusion sets and found they vary dramatically in terms of number (12 052-38) and width (median 10 151-30 bp) (Supplementary Fig. S1A  ð Þ to minimize the effect of set size differences. We similarly found Kindaje-generated sets showing the best overlap with other sets, followed by hg38.Boyle.hg38-Excludable.v2. We also observed hg38. Wold.hg38mitoExcludable and hg38.Lareau.hg38.full.Excludable sets overlapping hg38.Kundaje.GRCh38_unified_Excludable, suggesting it contains NUMTs (Supplementary Fig. S1D). Because of its agreement with other sets, we recommend hg38.Kundaje. GRCh38_unified_Excludable set and list other recommended sets Table 1. Figure 1. Schematic overview of the excluderanges package. Data for each type of problematic region [exclusion sets, gaps, nuclear mitochondrial (NUMT) sets] were obtained from public sources for each model organism and the corresponding genome assemblies. Exclusion sets for T2T-CHM13 and GRCm39 genome assemblies were de novo generated. Three vertical dots indicate more categories in the corresponding section

Discussion
Limited annotation remains the main problem when selecting exclusion sets as it remains unclear which method and/or data were used. Examples include Wold's lab-generated "mitoblack" sets for mm9 and mm10 assemblies. Their curation method is unknown, and the exact number (123 regions), width distribution, and other characteristics suggest that one may be a liftOver version of the other. Similarly, it remains unknown why Bernstein's lab-generated "Mint_Blacklist" hg19 and hg38 exclusion sets have a very large number of regions (9,035 and 12 052, respectively) as compared with under 1000 regions for other exclusion sets. Additionally, hg19 and hg38 "full.blacklist" sets were generated by Caleb Lareau as a combination of NUMTs and unknown ENCODE exclusion sets, the source of which we were unable to infer. Given annotation shortcomings, we discourage creating a union of all exclusion sets and recommend using exclusion sets generated with published methods and, if relevant, combining them with gap sets.
Most annotated exclusion sets were created via Blacklist, a tool for detecting regions with abnormally high signal and/or low mappability (Amemiya et al. 2019). These genomic properties are commonly accepted as problematic; however, they may not be exhaustive. The Peakpass algorithm was developed to learn genomic properties associated with problematic regions using a random forest model (Wimberley and Heber 2020). It reported distance to nearest assembly gap or gene, and frequency of unique 4-mers or softmasked base pairs, as the most predictive of problematic regions. A limitation of Peakpass is that its extensive collection of Python, R, and bash scripts is poorly documented. A limitation of Blacklist, on the other hand, is computational resource requirements (64þ GB; CPU: 24þ cores, 3.4þ GHz/ core) and disk storage (1 TB) due to a large number of required BAM files (hundreds). A recent preprint introduced the Greenscreen pipeline, a promising tool for identifying exclusion sets using as few as three ChIP-seq data. It reports a 99.9% overlap with a Blacklistgenerated exclusion set, identical performance on ChIP-seq quality metrics but a smaller genome footprint (Klasfeld and Wagner 2022). We utilized Blacklist as the most well-known tool to generate exclusion sets for the T2T-CHM13 and GRCm39 genome assemblies. The aforementioned tools detect problematic regions in ChIP-seq data; however, they may be different in data generated by other technologies due to different biochemical procedures (Nordin et al. 2022). Additional collaborative efforts are needed to develop a consensus approach for defining well-documented exclusion sets.