High-complexity regions in mammalian genomes are enriched for developmental genes

Abstract Motivation Unique sequence regions are associated with genetic function in vertebrate genomes. However, measuring uniqueness, or absence of long repeats, along a genome is conceptually and computationally difficult. Here we use a variant of the Lempel-Ziv complexity, the match complexity, Cm, and augment it by deriving its null distribution for random sequences. We then apply Cm to the human and mouse genomes to investigate the relationship between sequence complexity and function. Results We implemented Cm in the program macle and show through simulation that the newly derived null distribution of Cm is accurate. This allows us to delineate high-complexity regions in the human and mouse genomes. Using our program macle2go, we find that these regions are twofold enriched for genes. Moreover, the genes contained in these regions are more than 10-fold enriched for developmental functions. Availability and implementation Source code for macle and macle2go is available from www.github.com/evolbioinf/macle and www.github.com/evolbioinf/macle2go, respectively; Cm browser tracks from guanine.evolbio.mgp.de/complexity. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Since the 1960s DNA reassociation kinetics have been used to quantify the repetitiveness of DNA. In a pioneering study of the reassociation kinetics of CpG islands, Bird et al. (1985) discovered that the 1% of the mouse genome making up such islands was unique in the sense that it had no matches elsewhere in the genome. In subsequent years, CpG islands attracted a huge amount of interest as they are associated with the promoters of housekeeping genes (Bird, 1986; The ENCODE Project Consortium, 2012) and influence chromatin structure (Wachter et al., 2014). In addition, Elango and Yi (2011) found that promoters containing CpG islands longer than 2 kb were enriched for developmental genes. In the present study we directly search for unique regions by delineating intervals where exact matches to other parts of the genome are short.
Uniqueness and repetitiveness are complementary, and Haubold and Wiehe (2006) proposed an early measure of genome repetitiveness, the I r . This was based on the lengths of matches starting at every position in the genome. Regions with similarity elsewhere in the genome were characterized by long matches, unique regions by short matches. In a sliding window analysis they found that some regions in the human genome including the Hox clusters were characterized by extremely low I r . The Hox genes encode transcriptional regulators that specify the anterior/posterior axis in all animals (Raff, 1996). Moreover, in the publication of the first draft of the human genome the Hox clusters had been singled out as containing very few transposon insertions compared to the rest of the genome (International Human Genome Sequencing Consortium, 2001). Recent transposon insertions would create long exact matches and hence increase the I r .
As a statistic, the I r has two disadvantages: Its distribution is unknown, and its implementation too slow for convenient genomics. When Haubold et al. (2009) derived the null distribution of match lengths for a random sequence, where all bases are independently drawn given the GC content, this opened the way for constructing a match-based statistic with known null distribution.
The classical match-based statistic for strings is the Lempel-Ziv complexity (Lempel and Ziv, 1976). It is computed from the decomposition of a string, S, into a set of substrings, S½i . . . j, where S½i . . . j is the longest substring that has an exact match to the left of S½i. The number of such maximal matches divided by the length of S is the Lempel-Ziv complexity. In a refinement of this measure, Odenthal-Hesse et al. (2016) proposed the match complexity, C m , where maximal matches of S½i . . . j can occur to the left and the right of S½i. In contrast to the I r and the Lempel-Ziv complexity, C m has known bounds. Its lower bound is 0 for sequences with the minimum number of two matches and its upper bound is reached in long random sequences, for which the expectation of C m is 1.
Here we derive the null distribution of C m , which allows us to delineate unique genomic regions. These are defined as regions where the C m is indistinguishable from that found in random sequences. The computation of C m , like that of the I r and the Lempel-Ziv complexity, is based on suffix arrays (Ohlebusch, 2013, p. 59ff). A suffix array is essentially an index to some text, in this case the nucleotide sequence of a genome. A standard method for ensuring programs based on this technology are fast, is to separate index computation, which may take hours, from index querying, which is often a matter of seconds. Our implementation of C m , macle for MAtch CompLExity, makes use of this separation leading to querying times for the complete human genome of half a minute or less.
We apply macle to the human and mouse genomes. Since we are particularly interested in regions unique within these genomes, we first need to establish by simulation that the newly derived null distribution of C m is accurate. Next, we scan the human and mouse genomes and ask two questions: First, are highly complex regions enriched for promoters? Second, are the genes with promoters in high-complexity regions enriched for particular functions? We find that high-complexity regions are mildly enriched for promoters, but that these promoters are strongly enriched for developmental genes.

The match complexity
The match complexity, C m , was first described by Odenthal-Hesse et al. (2016). Consider a string S ¼ S½1 . . .L and extend it by S½L þ 1 ¼ $, where $ is a unique character. Given the first match factor S½1 . . .F 1 , where F 1 ¼ maxfk : S½1 . . .k matches elsewhere in S}, we define recursively the nth match factor S½F nÀ1 þ 1 . . .F n , which ends at F n ¼ maxfk : S½F nÀ1 þ 1 . . .k matches elsewhere in S}. We stop with the Nth match factor if F N ¼ L and set N L :¼ N. For example, S ¼ CGGGCGGGCT has N L ¼ 3 factors, CGGGC:GGGC:T.
Following Odenthal-Hesse et al. (2016), the match decomposition of a string is computed from its sorted suffixes. Table 1 shows the sorted suffixes of S as column suf½i. The suffix array, sa½i, abstracts from this the starting positions. It is 'enhanced' by the largest common prefix array, lcp½i, which denotes the length of the longest prefix match between suf½i and suf½i À 1; lcp½1 ¼ À1, as there is no suffix to compare with (Ohlebusch, 2013, p. 79ff). To decompose S, the lcp array is traversed in the order in which the suffixes appear in S. The mapping between positions in S and in sa is the inverse suffix array, isa½sa½i ¼ i. As summarized in Algorithm 1, the longest match starting at position i is determined by looking up lcp½isa½i and lcp½isa½i þ 1. The greater of these is the length of the desired match factor. The algorithm reports the factor, skips it and repeats until it has traversed the entire sequence.
If we apply Algorithm 1 to the enhanced suffix array of S in Table 1, we first look up lcp½isa½1 ¼ lcp½1, which is À1, and lcp½2, which is 5. Hence the first match factor S½1 . . . 5 is reported and the algorithm repeats by looking up lcp½isa½6 ¼ lcp½9 ¼ 4 and lcp½10 ¼ 0. The second factor S½6 . . . 9 is reported, and so on.
Computation of the lcp array is carried out in time proportional to the length of the corresponding sa by first computing its isa (Ohlebusch, 2013, p. 79ff). In practice, suffix array construction consumes the bulk of the resources necessary for match decomposition.
In order to define C m , we need the following three quantities for a sequence, S, of length L: First, C o ¼ N L =L is the observed number of match factors per base; second, C i ¼ 2=L is the theoretical minimum; third, C a is the expected match count per base in a random sequence of length L with the same GC-content as S, which we explain below. With these quantities, we define Subtraction of C i ensures that C m is bounded by 0 and an expectation of 1. l 1 lcp½isa½i 5: l 2 lcp½isa½i þ 1 6: j i þ maxðl 1 ; l 2 ; 1Þ À 1 7: reportMatchFactorðS½i . . . jÞ 8: i j þ 1 To compute C a , we use the distribution of the lengths, Y Ã i , of the longest matches starting at position i in a random sequence of GCcontent 2p (Haubold et al., 2009): Here, k is the vector of nucleotide counts, k ¼ ðk A ; k C ; k G ; k T Þ, which sum to the threshold length, x ¼ k A þ k C þ k G þ k T ; and p is the vector of nucleotide frequencies, p ¼ ð0:5 À p; p; p; 0:5 À pÞ. From Equation (1) we compute the mean, l, and variance, r 2 , of the match length distribution: Given the match decomposition of, say, the human genome, we wish to compute local values of C m by sliding a window of length W across the decomposition, and computing C o ; C i and C a with respect to the current window: We define highly complex regions as those that are indistinguishable from random. In order to detect such regions, we need to calculate appropriate threshold values, or quantiles, of C m . For this purpose we model the null distribution of C m by a normal distribution. This is justified by assuming that L ) W ) 1. Now let N i denote the number of factors up to position i. Then ðN i Þ i¼0;1;2;... is a renewal process, since its increments are-by assumption-independent and equally distributed according to the distribution of Y Ã i . According to the central limit theorem for renewal processes (Serfozo, 2009, Example 67), where B W is normally distributed with mean 0 and variance Wr 2 =l 3 ; B W $ Nð0; r 2 W=l 3 Þ. This leads to (2) which allows us to approximate quantiles for C m using the quantile function where p is the probability covered up to that point, say 5%, and erf is the error function.

Implementation
We used our program macle to compute C m in sliding windows of length 10 kb, which advanced in steps of 1 kb, thus generating sets of overlapping windows. Macle is written in Cþþ and calls the software library libdivsufsort (available from github) for suffix array computation. This library implements one of the fastest suffix sorting algorithms known, the divSufSoft algorithm recently described by Fischer and Kurpicz (2017). Given the human genome in FASTA format, it is first indexed, and can then be queried repeatedly.
We wrote the program macle2go in Go to annotate the output of macle. Macle2go implements three functions, quantile, annotate and enrichment.
Quantile implements the quantile computation outlined above.
Annotate first identifies the n windows of a given minimum C m . It then finds the g o genes whose promoters intersect one or more of these n windows; we defined the promoter of a gene as the 4 kb interval centered on its transcription start site (Saxonov et al., 2006). Annotate also repeatedly draws n random windows to determine the number of genes expected by chance alone, g e . In addition, it counts the number of times, f, that n windows are found containing ! g o genes in i iterations. Then the P-value of H 0 : Enrichment connects the genes found by annotate to the functional categories of the gene ontology (GO) (The Gene Ontology Consortium, 2000). The result is a list of GO terms and the number of genes observed in that category. Enrichment also carries out a randomization procedure similar to that used by annotate to test the significance of finding more genes than expected in a particular GO category.

Data
Human genome version GRCh38.p2 and mouse genome version GRCm38.p3 were used throughout. The RefGene annotation data for both organisms was downloaded from the UCSC genome browser. To connect genes to GO-terms, we used the files Homo_sapiens.gene_info, Mus_musculus.gene_info and gene2go from the NCBI website. In addition, we downloaded CpG islands from the UCSC genome browser. All the primary data files mentioned here are posted together with our C m browser tracks for human and mouse.

Null distribution
To check the accuracy of the null distribution in Equation (2), we simulated a random 100 Mb sequence and carried out a sliding window analysis with 10 kb windows. Figure 1 shows the distribution Fig. 1. The simulated and expected null distribution of C m . The simulated distribution was computed from a 100 Mb random sequence with a 10 kb sliding window. The expected distribution is given in Equation (2) of local C m values compared to Equation (2). The fit is not perfect, but reasonable.

Time and memory consumption
We investigated the resource consumption of macle using simulated sequences. Our test computer ran Ubuntu 18.04 on Intel Xeon 2.10 GHz processors with 256 GB RAM. Macle first computes a permanent index, which can then be queried repeatedly. Figure 2 shows that index construction is slightly more than linear in the length of the input sequence. Still, 1 Mb per s can be taken as a rule of thumb. Index traversal, on the other hand, is expected to take time proportional to index length, which in turn is proportional to sequence length. Figure 2 shows that querying the index by sliding a window across the entire input sequence is indeed linear in sequence length and takes 1.15 s for 256 Mb, while the corresponding index construction takes 302.3 s, that is, over 250 times longer.
Memory consumption of index construction and querying is strictly linear in the length of the input sequence (not shown

Application to the human and mouse genomes
Indexing the 3.1 Gb of the human genome took 1 h, 19 min, 3 s and 128.2 GB RAM. Similarly, indexing the 2.7 Gb of the mouse genome took 1 h, 8 min, 33 s and 111.1 GB RAM. The first thing we calculated off these indexes was genome-wide C m , which is 0.8071 in human and 0.7868 in mouse. In other words, the mouse genome is overall slightly less complex, or more repetitive, than the human genome. However, these genome-wide values hide a large diversity of chromosome-specific complexity. Figure 3 shows the C m for the 19 mouse autosomes, the 22 human autosomes and their sex chromosomes. In humans the chromosome-wide complexity varies between 0.40 in the Y chromosome and 0.85 in chromosome 3. Interestingly, shorter chromosomes have significantly lower complexity than longer chromosomes with a correlation of r ¼ 0.61 (P ¼ 1:5 Â 10 À3 ). In mouse, C m ranges from an extraordinarily low C m ¼ 0:06 in the Y chromosome to C m ¼ 0:86 in chromosome 11. In contrast to human, there is no correlation between chromosome length and complexity (r ¼ 0.17, P ¼ 0.45). Notice also that in mouse the X chromosome (C m ¼ 0:68) has the second lowest complexity, while the human X chromosome (C m ¼ 0:76) has a complexity similar to that of equally long autosomes.
We now zoom into the genomes by carrying out sliding window analyses. Figure 4 shows the frequency distribution of C m in 10 kb sliding windows across the human and mouse genomes. This is bimodal with a large mode at 0.91 representing the bulk of both  genomes, and a smaller mode at 0.05 for highly repetitive regions. Notice that mouse has a larger proportion of low-complexity regions than human, presumably due to the extremely low complexity of its Y chromosome (Fig. 3).
Next, we investigate how the C m values summarized in Figure 4 are distributed along individual chromosomes. Figure 5 shows C m along human chromosome 2, which contains one of the four human Hox clusters, HoxD, at 176.2 Mb. The green horizontal line at C m ¼ 0:9954 is the 5% quantile of the C m distribution in random sequences obtained from Equation (2) and delineates highcomplexity regions. There are 79 such regions ranging from 10 kb to 77 kb and totaling 1.2 Mb, or 0.50% of chromosome 2. Figure 5 depicts these regions as vertical lines. At the other extreme of the C m distribution is the centromere, which is characterized by 4 Mb of very low C m .
In order to visualize how C m highlights genes, Figure 6 shows our C m results integrated with the UCSC genome browser in the HoxD region. Notice the two 100 kb-spanning regions of high complexity. These overlap a large portion of the HoxD genes shown as the 'GENCODE' track. They also correspond to a high density of CpG islands and a low density of RepeatMasker elements.
In mouse there are 2908 high-complexity windows that merge into 772 distinct intervals (Supplementary Table S2). These range in length from 10 to 56 kb totaling 10.1 Mb or 0.37% of the mouse genome. We were surprised to find that the 56 kb interval chr2: 76 703 660-76 759 659 contains no promoter. It does, however, intersect the gene encoding titin, Ttn, a component of muscles (chr2: 76 703 983-76 982 547). Homozygous mutations in Ttn lead to developmental defects and premature death (www.informatics.jacx.org). In total, the promoters of 958 genes are found in these intervals, compared to an expectation of 401.94. This amounts to a 2.38-fold enrichment of genes, which include members of HoxA, HoxB and HoxD; HoxC is missing, as its C m remains slightly below the cutoff.

Functional enrichment
The 1443 human promoters in high-complexity regions cluster in 211 biological processes with at least 10 members. We ran our Monte-Carlo procedure to test whether the observed number of genes in a particular GO category is larger than expected by chance alone with 10 8 iterations. This resulted in 45 categories enriched with maximal significance (P < 10 À8 ). When Bonferroni-corrected for the 211 tests, this amounts to P < 2:1 Â 10 À6 . The degree of enrichment ranged from 18.7 to 2.1 (Supplementary Table S3). The enriched categories are involved in cell differentiation, morphogenesis and organ development. Table 2 lists the top 10 enriched categories. The genes underlying these functional categories contain many well-known transcription factors, including Lhx in the category 'spinal cord association neuron differentiation', Fox in 'dopaminergic neuron differentiation' and Hox in 'anterior/posterior pattern specification' (Supplementary Table S3). The 958 mouse genes in high-complexity regions cluster in 173 processes with at least 10 members, of which 51 are maximally significant (P < 10 À8 Â 173 ¼ 1:7 Â 10 À6 ) with enrichment factors ranging from 14.4 to 2.8 (Supplementary Table S4). Again, they are involved in a broad range of developmental processes. Table 3 lists the top 10 enriched categories. As in the case of human, the genes underlying these functional categories contain numerous widely studied transcription factors, such as Pax in 'cell fate determination', Gata in 'tissue development' and Hox in 'embryonic skeletal system morphogenesis'. Elango and Yi (2011) reported that CpG islands longer than 2 kb are also associated with developmental genes. So we asked, whether our high-complexity regions coincided with CpG islands in general, and specifically with long CpG islands. In human, 88% of highcomplexity regions intersect one or more of the 30 477 CpG islands. In mouse almost the same proportion, 87%, of high-complexity regions intersect one or more of its 16 023 CpG islands. This proportion drops if we restrict the analysis to long CpG islands, of which the human genome contains 1426. Only 35% of high-complexity regions intersect a member of this class of CpG islands. Similarly, in mouse only 19% of high-complexity regions intersect a long CpG island.

Discussion
The relationship between raw nucleotide sequence and its biological function has been at the center of molecular biology since the discovery of the double helix. An early insight was that the genomes of eukaryotes are riddled with non-functional sequences, especially transposons. Devising fast methods for finding repetitive elements has been a major concern of bioinformatics, as genomes are routinely delivered with repeats annotated (Fig. 6) or masked (Bao et al., 2015). In human, approximately half the genome is masked.
Instead of identifying repeats, we have concentrated on finding repeat-free regions, because uniqueness as defined by reassociation kinetics has been linked to CpG islands for decades (Bird et al., 1985), and CpG islands are functional markers in vertebrate genomes. We measure uniqueness using the match complexity, C m (Odenthal-Hesse et al., 2016), thereby effectively carrying out an in silico reassociation experiment. C m is calculated by augmenting suffix array techniques (Algorithm 1) with the mathematics of the match length distribution summarized in Equation (2). This equation is based on the assumption that the number of factors in a long window is approximately normally distributed, which fits the simulations (Fig. 1). Equipped with this formalism we computed C m across the human and mouse genomes, and connected the results with genes and their functions.
Our program macle is designed for efficiency. The enhanced suffix array it computes is written to disk in binary form to allow querying of arbitrary regions hundreds of times more quickly than the one-off index construction (Fig. 2). However, further speedup of index computation might be forthcoming due to the recent publication of a parallel version of the divSufSort algorithm on which macle is based (Labeit et al., 2016). In contrast, the hypothesis testing in macle2go already runs in parallel, as the problem of repeatedly drawing sets of windows easily lends itself to this type of optimization.
When querying individual chromosomes, the C m values for human in Figure 3 are more widely scattered than for mouse. The one exception to this rule is the mouse Y chromosome, which is a true outlier among the chromosomes studied with C m ¼ 0:06. Correspondingly, the sliding window graph of this chromosome contains long stretches of low C m and looks different from all other chromosomes (see online browser tracks). This might come as a surprise since the male-specific region of the Y chromosome is 99.9% euchromatic and contains approximately 700 protein-coding genes (Soh et al., 2014). However, these genes form an 'ampliconic' structure consisting of recently duplicated copies of genes involved in spermatogenesis.
The mouse Y chromosome illustrates a peculiarity of the C m : Regions with low complexity are usually assumed to be gene-poor and heterhochromatic. The mouse Y chromosome shows that this need not be the case. A low C m merely indicates a recent duplication, regardless of the length of the region involved, or its copy number.
In contrast, high C m , the focus of this study, has an unambiguous interpretation: It indicates the absence of recent duplication, perhaps due to selection against it. Approximately 0.50% of chromosome 2 is high-complexity (Fig. 5), which is close to the 0.56% highcomplexity across the entire human genome. Haubold and Wiehe (2006) had previously observed in a less systematic fashion that such regions contained developmental genes such as members of the four Hox clusters.
We carried out a comprehensive sliding window analysis to study this rigorously. Its most basic parameter is window length, which we arbitrarily set to 10 kb, as the enrichment for developmental genes remains highly significant in mouse and human regardless of whether windows of 5, 10 or 20 kb are analyzed: In humans the most highly enriched categories for 5 and 20 kb windows are 'spinal  cord association neuron differentiation', and 'proximal/distal pattern formation', respectively (Supplementary Tables S5 and S6), which fits with the top category for 10 kb windows, which like for 5 kb windows is 'spinal cord association neuron differentiation' (Table 2). Similarly, in mouse the most highly enriched category detected with 5 kb windows is 'embryonic skeletal system development', and with 20 kb windows 'anterior/posterior pattern specification' (Supplementary Tables S7 and S8). These developmental categories fit the category most highly enriched using 10 kb windows, 'cell fate determination' (Table 3).
However, with increasing window length the high-complexity fraction of the genome decreases. In human, 5 kb windows cover 63.8 Mb, 10 kb windows 17.3 Mb and 20 kb windows cover merely 4.7 Mb (for raw data see Supplementary Tables S1, S9 and S10). Similarly, in mouse, 5 kb windows cover 47.8 Mb, 10 kb windows 10.1 Mb and 20 kb windows cover merely 1.0 Mb (Supplementary Tables S2, S11 and S12). So the numerical details of our analysis depend strongly on the window size, but not the general conclusion that high-complexity regions in human and mouse are enriched for developmental genes.
Another potential issue with our analysis is our decision to count promoters intersecting the high-complexity regions rather than whole genes. However, we have programmed our annotation tool, macle2go, such that it can also use whole genes as the unit of comparison. Again, the choice makes no qualitative difference (not shown).
Finally, we investigated the relationship between highcomplexity regions and CpG islands. Over 85% of high-complexity regions in human and mouse contain CpG islands. The preponderance of high-complexity regions in GC-rich regions is perhaps not surprising, because fewer matches are found in regions where the local GC content is significantly higher than the global GC content, as is the case in CpG islands. 'General' CpG islands are not enriched in developmental genes, while CpG islands longer than 2 kb are (Elango and Yi, 2011). However, only between one fifth and one third of our high-complexity regions intersect long CpG islands, and the high-complexity region in mouse Ttn contains neither short nor long CpG islands. Still, we suspect that both attributes, highcomplexity and CpG enrichment, are tied to the same phenomenon, biological function; the difference being that high match complexity captures a particular subset of functions, those sensitive to transposon insertion and copy number variation.
We conclude that the match complexity can be used to identify genomic regions highly enriched in developmental genes. The type of analysis established in this study is applicable to any genome with complete sequence and reasonably comprehensive annotation. We therefore plan to analyze the high-complexity regions in other mammals and then across the vertebrates. Genomes with less complete annotations than human or mouse are likely to result in more regions lacking annotation. Among these, those with the highest complexity would be the most promising candidates for further, functional study.
Conflict of Interest: none declared.