Benchmark of software tools for prokaryotic chromosomal interaction domain identification

Abstract Motivation The application of genome-wide chromosome conformation capture (3C) methods to prokaryotes provided insights into the spatial organization of their genomes and identified patterns conserved across the tree of life, such as chromatin compartments and contact domains. Prokaryotic genomes vary in GC content and the density of restriction sites along the chromosome, suggesting that these properties should be considered when planning experiments and choosing appropriate software for data processing. Diverse algorithms are available for the analysis of eukaryotic chromatin contact maps, but their potential application to prokaryotic data has not yet been evaluated. Results Here, we present a comparative analysis of domain calling algorithms using available single-microbe experimental data. We evaluated the algorithms’ intra-dataset reproducibility, concordance with other tools and sensitivity to coverage and resolution of contact maps. Using RNA-seq as an example, we showed how orthogonal biological data can be utilized to validate the reliability and significance of annotated domains. We also suggest that in silico simulations of contact maps can be used to choose optimal restriction enzymes and estimate theoretical map resolutions before the experiment. Our results provide guidelines for researchers investigating microbes and microbial communities using high-throughput 3C assays such as Hi-C and 3C-seq. Availability and implementation The code of the analysis is available at https://github.com/magnitov/prokaryotic_cids. Supplementary information Supplementary data are available at Bioinformatics online.

We used Armatus v. 2.1 and varied the gamma parameter from 0.01 to 5.00 with a step of 0.01 to find the optimal CIDs segmentation of a genome. [1]

Arrowhead
The Juicer Tools v. 1.11.09 arrowhead module with default parameters was used to determine CIDs. The nested domains were then discarded to remove the hierarchy. [2] CaTCH From the CaTCH package we used the domain.call function with no parameters to annotate CIDs. [3]

CHDF
We executed CHDF with parameters Length, Number and Size equal to the dimensions of the corresponding contact map. [4] chromoR The function segmentCIM from the chromoR was used on the raw contact maps to find the domains. [5]

Chromosight
The Chromosight package was used in the detect mode with parameter pattern = 'borders'. [6] ClusterTAD We used Java implementation of ClusterTAD with the parameters window set to 10 and minimum CID size set to 4 bins. The search for optimal CID boundaries was performed among CID sets identified with different K values used for clustering. [7] deDoc We used deDoc(M) algorithm implementation with default parameters to identify CIDs. [8]

Directionality_Index
We used the Directionality Index approach based on paired t-test. For the matrices with resolution higher than 10 kb, the number of bins considered for t-statistics calculation was set to 20, while for the lower resolution matrices it was set to 10. We then manually annotated CID boundaries using the directionality index profile and t-statistics threshold at significance level 0.1. [9]

EAST
The EAST script was executed with parameters maxW = 50 and minL = 4. [10] GMAP GMAP was executed on the raw contact maps with parameters dom_order set to 1, min_dp set to 4, max_dp set to 100, min_d set to 10 and fcthr set to 0.75. [11] HiCExplorer To identify CIDs with HiCExplorer, we used its hicFindTADs module with minDepth parameter ranging from 3 to 20 matrix resolutions and delta from 0.005 to 0.1 with a step of 0.005. A correction for multiple testing was performed using the FDR method. [12]

HiCseg
The HiCseg was executed with parameters distrib = "G" and model = "D". We varied the nb_change_max parameter from 2 to 100 to find the optimal domains. [13]

IC-Finder
In order to identify CIDs, IC-Finder was executed with default parameters. [14]

Insulation_Score
We used the original Perl implementation of the Insulation Score algorithm and varied the IS parameter from 5 to 15 resolution sizes with a step of 1 and the IDS parameter from 2 to 10 resolution sizes with a step of 2 in order to find the optimal segmentation of CIDs. [15]

lava.armatus
The Lavaburst package (https://github.com/nvictus/lavaburst) includes four domain calling algorithms: armatus, corner, modularity and variance. For each algorithm, we varied the gamma parameter in the range from 0.01 to 5.00 with a step of 0.01 to find the optimal CIDs.

MrTADFinder
We used MrTADFinder v. 1.2 and varied the res parameter from 0.5 to 3.5 with a step of 0.1 to find the optimal segmentation of CIDs. [16]

OnTAD
We used OnTAD with parameter minsz set to 4. For C.crescentus, B.subtilis and E.coli we then merged the annotated domains at levels 2, 3 and 4 into one set. [17] spectral For spectral algorithm we varied the lambda threshold parameter from 0.01 to 1.00 with a step of 0.01 to find the optimal CIDs annotation. [18]

SpectralTAD
SpectralTAD was executed using a raw contact map as input data and a minimum CID size of 4 bins. [19]

TADbit
We used a raw contact map as input for TADbit and set the resolution parameter to correspond to each matrix. [20] TADpole We used TADpole with parameter resol corresponding to each matrix size. [21] TADtool.directionality We used directionality index and insulation index algorithms from the TADtool. We varied the cutoff parameter from 0 to 0.02 with a step of 0.0001 and the window_size parameter from 20000 to 200000 with a step of 10000 in order to find the optimal CIDs. [22] TADtool.insulation

TADtree
In order to obtain domain annotation using TADtree, we set the S to 50 and the N to 100. The domains annotated at all N values were merged and nested domains were then discarded to remove the hierarchy. [23] TopDom For TopDom, we varied the window_size parameter from 2 to 20 with a step of 1 to find the optimal segmentation. [24] Supplementary