## Abstract

A tandem repeat in DNA is two or more contiguous, *approximate* copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. we model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm's speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human β T cell receptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. a world wide web server interface at c3.biomath.mssm.edu/trf.html has been established for automated use of the program.

## Introduction

DNA molecules are subject to a variety of mutational events. One of the less well understood is *tandem duplication* in which a stretch of DNA, which we call the *pattern*, is converted into two or more *copies*, each following the preceding one in a contiguous fashion. For example we could have … TCGGA … → … TCGGCGGCGGA … in which the single occurrence of triplet CGG has been transformed into three identical, adjacent copies. The result of a tandem duplication event is termed a *tandem repeat*. Over time, individual copies within a tandem repeat may undergo additional, uncoordinated mutations so that typically, only *approximate* tandem copies are present.

Tandem repeats are presumed to occur frequently in genomic sequences, comprising perhaps 10% or more of the human genome. But, accurate characterization of the properties of tandem repeats has been limited by the inability to easily detect them. In recent years, the discovery of the trinucleotide repeat diseases has piqued interest in tandem repeats. These diseases, including fragile-X mental retardation (1), Huntington's disease (2), myotonic dystrophy (3), spinal and bulbar muscular atrophy (4) and Friedreich's ataxia (5), are the result of a dramatic increase in the number of copies of a trinucleotide pattern. In afflicted individuals, the copy number has been amplified from the normal range of tens of copies to hundreds or thousands, resulting in the disease. It has been suggested that the repeats themselves produce unusual physical structures in the DNA, causing polymerase slippage and the resulting amplification (6,7).

A more salubrious potential role for tandem repeats is gene regulation, in which the repeats may interact with transcription factors, alter the structure of the chromatin or act as protein binding sites (8–12). Tandem repeats have an apparent function in development of immune system cells. Breakpoints for immunoglobulin heavy chain switch recombination occur within tandem repeats preceding the heavy chain constant region genes (13). Because the number of copies in any specific tandem repeat is often polymorphic in the population, tandem repeats have proven useful in linkage analysis and DNA fingerprinting (14,15). Recent studies of allele diversity at tandem repeat loci have provided support for the ‘Out of Africa’ hypothesis of modern human evolution (16,17).

To date, much of the research on tandem repeats has focused on those with short patterns (2–5 nt), presumably because such repeats are relatively easy to spot by eye in printed sequences. Repeats with long patterns (sometimes called variable number of tandem repeats or VNTRs) are notoriously harder to detect [even when the copies are identical, for example see Benson (18) for the 101 bp repeats *undetected* in Hellman *et al.* (19), a paper on the role of tandem repeats as hot spots for recombination]. Given the importance of known and potential biological roles for tandem repeats and their usefulness in other biological studies, it seemed essential to us to develop an *efficient* and *sensitive* algorithm for detecting these repeats so that they may receive further study.

A number of algorithms already exist which either directly or indirectly detect tandem repeats. All suffer from significant limitations. One group of algorithms is based on computing alignment matrices (20–22). Their primary limitation is excessive running time. The best algorithm in this group (22) has time complexity *O*[*n*^{2} polylog(*n*)] for a sequence of length *n* and would not be useful for sequences much longer than several thousand bases. (In this paper we report on our analysis of sequences up to 700 kb in length.)

Another group of algorithms finds tandem repeats indirectly using methods from the field of data compression. An algorithm by Milosavljevic and Jurka (23) detects ‘simple sequences’, i.e. mixtures of fragments that occur elsewhere. Simple sequences may or may not contain tandem repeats and this algorithm makes no attempt to deduce a repeated pattern. An algorithm by Rivals *et al.* (24) bases the compression on the presence of small preselected patterns (all those of size 1–3) and is not readily generalized to longer patterns for which there is an algorithmic need. To their credit, both of these methods provide a measure of statistical significance based on the amount of compression.

Another collection of algorithms aim more directly at finding tandem repeats. Of these, one exact algorithm (25) is limited by its definition of approximate patterns, requiring that two copies differ either by *k* or fewer substitutions (Hamming distance) or by *k* or fewer substitutions and indels (unit cost edit distance). Besides treating substitutions and indels as equals, the requirement for a fixed number of differences rather than a *percentage* difference is unsatisfactory. Any fixed number of differences suitable for small patterns (say five differences for patterns of size 20) would be unreasonably restrictive for larger patterns (five differences for patterns of size 100). Conversely, any fixed number for large patterns would allow too much variability in small patterns. A heuristic algorithm by Karlin *et al.* (26) is similarly hampered by the use of matching blocks separated by error blocks of fixed size. The remaining two algorithms in this group require input from the user which limits their usefulness. An earlier heuristic algorithm by Benson (27) finds tandem repeats only if they have a pattern size which is specified in advance. An exact algorithm by Myers and Sagot (28) (limited to patterns with size of at most 40 bases) requires that the approximate pattern size and a range for the number of copies be specified.

The algorithm (29) presented in this paper is designed to overcome many of the aforementioned limitations: (i) it uses the method of *k-tuple matching* to avoid the need for full scale alignment matrix computations; (ii) it requires no *a priori* knowledge of the pattern, pattern size or number of copies; (iii) there are no restrictions on the size of the repeats that can be detected; (iv) it uses percentage differences between adjacent copies and treats substitutions and indels separately; (v) it determines a consensus pattern for the smallest repetitive unit in the tandem repeat. The program has already been used as a preprocessor in a new alignment algorithm where tandem duplication augments the standard mutation set of insertion, deletion and substitution (18).

A number of ideas incorporated into this new algorithm have been utilized in earlier homology detection programs (30,31), yet the goals and methods differ. Instead of looking for highest scoring homologous regions, the algorithm looks for tandem repeats which are often hidden in larger homologous regions or which may fall well below the level of significance required for other programs to report a match. The detection criteria are based on a stochastic model of tandem repeats specified by percent identity and frequency of insertions and deletions, rather than some minimal alignment score. Finally, the program aligns repeat copies against a consensus sequence, revealing patterns of common mutations. These patterns yield insight into the history of duplications that produced the tandem repeat, thus providing a potentially valuable tool for phylogenetic research.

The remainder of this paper is organized as follows. In Methods we present a probabilistic model of tandem repeats, an algorithm overview and the set of criteria that guide the recognition process. In the Discussion we present our analysis of the frataxin (Friedreich's ataxia) gene sequence, the human b T cell receptor locus and two yeast chromosomes. Finally, in the Conclusion we describe directions for future research.

## Methods

### Probabilistic model of tandem repeats

We model *alignment* of two tandem copies of a pattern of length *n* by a sequence of *n*-independent Bernoulli trials (coin tosses). The probability of success, *P* (*heads*), which we also call *p*_{M} or *matching probability*, represents the *average* percent identity between the copies. Each head in the Bernoulli sequence is interpreted as a match between aligned nucleotides. Each tail is a mismatch, insertion or deletion. A second probability, *p*_{I} or *indel probability*, specifies the average percentage of insertions and deletions between the copies. Figure 1 illustrates the underlying idea for the model.

While Figure 1 is an interpretation of a particular alignment as a Bernoulli sequence, we are more generally interested in the distribution of Bernoulli sequences and the properties of alignments that they represent when dealing with a specific pair (*p*_{M}, *p*_{I}), for example (*p*_{M} = 0.80, *p*_{I} = 0.10). Note that these *conservation parameters* serve as a type of extremal bound, i.e. as a quantitative description of the *most divergent copies* we hope to detect.

### Program outline

Our program has *detection* and *analysis* components. The detection component uses a set of statistically based criteria to find *candidate* tandem repeats. The analysis component attempts to produce an alignment for each candidate and if successful gathers a number of statistics about the alignment (percent identity, percent indels) and the nucleotide sequence (composition, entropy measure).

*Detection component*. We assume that adjacent copies of any pattern will contain some matching characters in corresponding positions. Just how many matches and how the distance between those matches should vary depend on the fixed values of *p*_{M} and *p*_{I}. In the next section, we develop the statistical criteria to answer these questions. Here, we describe how the matches are detected.

The algorithm looks for matching nucleotides separated by a common distance *d*, which is not specified in advance. For reasons of efficiency it looks for *runs of k matches*, which we call *k-tuple matches*. A *k-tuple* is a window of *k* consecutive characters from the nucleotide sequence. Matching *k*-tuples are two windows with identical contents and if aligned in the Bernoulli model would produce a run of *k* heads. Because we limit ourselves to *k*-tuple matches, we will *not* detect all matching characters. For example, if *k* = 6 and two windows contain TCATGT and TCTTGT we will *not* know that there are 5 matching characters because the window contents are not identical. Put in terms of the Bernoulli model, the aligned windows would be represented by the sequence HHTHHH, which is *not* a run of 6 heads.

The basic operation of the detection component is illustrated in Figure 2. Let *S* be a nucleotide sequence. We select a small integer *k* for the tuple or window size (*k* = 5 for example) and keep a list of all possible *k* length strings (there are 4^{k} for the DNA alphabet A,C,G,T) which we call the *probes*. By sliding the window across the sequence, we determine the probe at each position *i* in *S*. For each probe *p*, we maintain a *history list H*_{p} of the positions at which *p* occurs.

When a position *i* is added to *H*_{p}, we scan *H*_{p} for *all* earlier occurrences of *p*. Let one earlier occurrence be at *j*. Since *i* and *j* are the indices of matching *k*-tuples, the distance *d* = *i − j* is a *possible pattern size* for a tandem repeat. For the criteria tests, we need information about other *k*-tuple matches at the same distance *d* where the leading tuple occurs in the sequence between *j* and *i*. A *distance list D*_{d} stores this information. It can be thought of as a sliding window of length *d* which keeps track of the positions of matches and their total.

List *D*_{d} is updated every time a match at distance *d* is detected. Position *i* of the match is stored on the list and the total is increased. The right end of the window is set to *i* and matches that occurred before *j* = *i − d* are dropped from the list and subtracted from the total. Lists for other nearby distances are also updated at this time (Random Walk Distribution in the next section), but only to reset their right ends to *i* and remove matches that have been passed by the advancing windows. Information in the updated distance lists is used for the sum of heads and apparent size criteria tests as described in the next section. If both tests are passed, the program moves on to the analysis component.

*Statistical criteria*. The statistical criteria are based on runs of heads in Bernoulli sequences, corresponding to matches detected with the *k*-tuples and stored in the distance lists. The criteria are based on four distributions which depend upon: (i) the pattern length, *d*; (ii) the matching probability, *p*_{M}; (iii) the indel probability, *p*_{I}; (iv) the tuple size, *k*. For each distribution, we either calculate it with a formula or estimate it using simulation. Then, we select a cut-off value that serves as our criterion. Below we describe the distributions and criteria in more detail.

*Sum of heads distribution*. This distribution indicates how many matches are required. Let the random variable *R*_{d,k,pM} = the total number of heads in head runs of length *k* or longer in an iid Bernoulli sequence of length *d* with success probability *p*_{M}. The distribution of *R*_{d,k,pM} is well approximated by the normal distribution and we have previously shown that its exact mean and variance can be calculated in constant time (32). For the *sum of heads criterion*, we use the normal distribution to determine the largest number, *x*, such that 95% of the time *R*_{d,k,pM} ≥ *x*. For example, if *p*_{M} = 0.75, *k* = 5 and *d* = 100, then the criterion is 26. Put another way, if a pattern has length 100 and aligned copies are expected to match in 75 positions, then by counting only matches that fill a window of length 5, we expect to count at least 26 matches 95% of the time.

*Random walk distribution*. This distribution describes how distances between matches may vary due to indels. Because indels change the distance between matching *k*-tuples (Fig. 3), there will be situations where the pattern has size *d*, yet the distance between matching *k*-tuples is *d* ± 1, *d* ± 2, etc. In order to test the sum of heads criterion, we count the matches in *D*_{d ± Δd}, for Δ*d* = 0, 1, …, Δ*d*_{max} for some Δ*d*_{max}. In our model, indels are single nucleotide events occurring with probability *p*_{I}. Insertions and deletions are considered equally likely and we treat the distance change as a problem of random walks. Let the random variable *W*_{d,pI} = the maximum displacement from the origin of a one-dimensional random walk with expected number of steps equal to *p*_{I}·*d*. It can be shown (33) that 95% of the time, *W*_{d,pI} ranges between . We set Δ*d*_{max} = . For ⌋example if *p*_{I} = 0.1 and *d* = 100, then Δ*d*_{max} = 7.

*Apparent size distribution*. This distribution is used to distinguish between tandem repeats and non-tandem direct repeats (Fig. 4). For tandem repeats, the leading tuples in matching *k*-tuples will be distributed throughout the interval from *j* to *i*, whereas for non-tandem repeats, they should be concentrated on the right side of the interval near *i*. Let the random variable *S*_{d,k,pM} = the distance between the first and last run of *k* heads in an iid Bernoulli sequence of length *d* with success probability *p*_{M}. *S*_{d,k,pM} is the apparent size of the repeat when using *k*-tuples to find the matches and will usually be shorter than the pattern size *d*. We estimate the distribution of *S*_{d,k,pM} by simulation because we make it conditional on first meeting the sum of heads criterion. For given *d, k* and *p*_{M}, random Bernoulli sequences are generated using *p*_{M}. For every sequence that meets or exceeds the sum of heads criteria, the distance between the first and last run of heads of length *k* or larger is recorded. From the distribution, we determine the maximum number *y* such that 95% of the time *S*_{d,k,pM} > *y*. We use *y* as our *apparent size criterion*. For example, if *p*_{M} = 0.75, *k* = 5 and *d* = 100, then the criterion is 56. In order to test the apparent size criterion, we compute the distance between the first and last tuple on list *D*_{d}. If the distance between the tuples is smaller than the criterion, we assume the repeat is not tandem or that we have not yet seen enough of it to be convinced.

*Waiting time distribution*. This distribution is used to pick tuple sizes. Tuple size has a significant inverse effect on the running time of the program because increasing tuple size causes an exponential decrease in the expected number of tuple matches. If the nucleotides occur with equal frequency, then increasing the tuple size by Δ*k* increases the average distance between randomly matching tuples by a factor of 4^{Δk}. If *k* = 5, the average distance between random matches is ∼1 kb, but if *k* = 7, the average distance is ∼16 kb. Thus, by using a larger tuple size, we keep the history lists short. On the other hand, increasing the tuple size decreases the chance of noticing approximate copies because they may not contain a long, unbroken run of matches. Let the random variable *T*_{k,pM} = the number of iid Bernoulli trials with success probability *p*_{M} until the first occurrence of a run of *k* successes. *T*_{k,pM} follows the *geometric distribution of order k*. If we let *p* = *p*_{M} and *q* = 1 − *p* then the exact probability *P*(*T*_{k,pM} = *x*) for *x* ≥ 0 is given by the recursive formula (34)

*p*

_{M}= 0.75 and

*k*= 5 then we need at least 31 trials (coin tosses) to have a 95% chance of seeing a run of 5 heads. For patterns smaller than 31 characters, we need to use a smaller

*k*-tuple. The waiting time distribution allows us to balance the running time and sensitivity of our algorithm by picking a set of tuple sizes,

*each applying to a different range of pattern sizes*. The program processes the sequence once, simultaneously checking these different tuple sizes. We require that the smallest pattern for tuple size

*k*have a sum of heads criterion of at least

*k*+ 1. Table 1 shows the range of tuple sizes and the corresponding pattern sizes currently used by the program.

*Analysis component*. If the information in the distance list passes the criteria tests, a candidate pattern consisting of positions *j* + 1 … *i* is selected from the nucleotide sequence and aligned with the surrounding sequence using wraparound dynamic programming (WDP) (35,36). If at least two copies of the pattern are aligned with the sequence, the tandem repeat is reported. Several implementation details of the analysis component are described below.

*Multiple reporting of repeat at different pattern sizes*. When a single tandem repeat contains many copies, several pattern sizes are possible. For example, if the basic pattern size is 26, then the repeat may be reported at sizes 26, 52, 78, etc. We limit this redundancy in the output to, at most, three pattern sizes. Note that we do not automatically limit the output to the smallest period size because a much better alignment may come from a larger size (for example Table 5, indices 410172–410459).

*Narrow band alignment*. Alignments are the program's most time intensive calculations. To decrease running time, we limit WDP calculations to a narrow diagonal band in the alignment matrix for patterns larger than 20 characters. In accordance with the random walk results, the band radius is Δ*d*_{max}. The band is periodically recentered around a run of matches in the current best alignment.

*Consensus pattern and period size*. An initial candidate pattern *P* is drawn from the sequence, but this is usually not the best pattern to align with the tandem repeat. To improve the alignment, we determine a *consensus pattern* by majority rule from the alignment of the copies with *P*. The consensus is used to realign the sequence and this final alignment is reported in the output. Period size is defined as the most common matching distance between corresponding characters in the alignment and may not be identical to consensus size.

### Program usage and output

Input to the program consists of a sequence file and the following parameters: (i) alignment weights for match, mismatch and indels; (ii) *p*_{M} and *p*_{I}; (iii) a minimum size for patterns to report; (iv) a minimum alignment score to report. We have developed a web based interface for the program. Using an HTML form at c3.biomath.mssm.edu/trf.html, the user provides an input DNA sequence file. Defaults can be used for the remaining parameters. After program execution, two files are returned. The first is a summary table describing the location and statistical properties of the tandem repeats found. The second contains the alignment of each repeat with its consensus sequence. The files are linked so that selecting an entry from the table opens a second browser window which contains the proper alignment. The summary table includes the following information: (i) indices of the repeat in the sequence; (ii) period size; (iii) number of copies aligned with the consensus pattern; (iv) size of the consensus pattern (may differ from the period size); (v) percent of matches between adjacent copies overall; (vi) percent of indels between adjacent copies overall; (vii) alignment score; (viii) percent composition for each of the four nucleotides; (ix) entropy measure based on percent composition.

## Results

To demonstrate the capabilities of our program, we used it to analyze four sequences, the human frataxin gene sequence (Friedreich's ataxia) (5), the human β T cell receptor locus sequence (37) and two yeast chromosomes (I and VIII). [The frataxin gene sequence and the human β T cell receptor sequences were obtained from GenBank. The yeast chromosomes sequences were obtained via ftp from ftp.ebi.ac.uk directory pub/databases/yeast in files chri_230209.ascii and chrviii_562638.ascii. Indexing in this paper is relative to the sequences in these files. Data file accession numbers for these sequences are: frataxin gene promoter and intron 1, U43748; human T cell receptor, L36092; yeast chromosome 1, U12980, L20125, L05146, L22015, L28920; yeast chromosome 8, U11583, U11582, U11581, U10555, U10400, U10399, U00062, U00061, U10556, U00060, U00059, U10398, U10397, U00027, U00028, U00030, U00029.] In our analysis, we searched for all pattern sizes between 1 and 500 bases (the implementation's current upper size limit, to be extended in subsequent versions). We used one of two sets of alignment parameters (match, mismatch, gap), either (+2,−7,−7) or (+2,−5,−7). Only those repeats scoring at least 50 with these parameters are reported. Occasionally, the same repeat is reported at different pattern sizes. We have omitted these redundancies.

We performed two searches on each sequence, using different conservation parameter values, (*p*_{M} = 0.75, *p*_{I} = 0.20) and (*p*_{M} = 0.80, *p*_{I} = 0.10). While the first search is slower than the second, the detected repeats are nearly identical. Table 2 shows running times of the program and Tables 4–7 list the tandem repeats found.

### Human frataxin gene (Friedreich's ataxia), intron 1

Friedreich's ataxia is one of the triplet repeat diseases (5). It is caused by copy number expansion of the triplet GAA in the first intron of the frataxin gene. Table 4 lists the repeats found in the sequence. Besides the triplet repeat, our program found two others which were apparently unknown, a 44 bp pattern and a 14 bp pattern. Figure 5 shows the program's alignment of the 44 bp repeat.

### Human β T cell receptor locus sequence

This sequence (37) contains a family of immune recognition coding elements, the T cell receptor variable, diversity, joining and constant gene segments. It was selected for its size and because many tandem repeats within the sequence had already been identified. Table 5 lists the new repeats we found. Of the 83 repeats that we found, 38 were previously annotated and most of those were for patterns of size 5 or smaller. We missed 6 annotated repeats: 4 dinucleotide repeats and 1 tetranucleotide repeat (alignment scores were below our cut-off) and 1 repeat with period size 10 567 bases (beyond the current implementation's pattern upper size limit). Of the 45 unannotated repeats, 13 have short patterns (2–6 bp) and may be polymorphic and thus useful for linkage analysis. Six unannotated repeats have large pattern sizes (116, 65, 52, 49, 34 and 30 bp). The 116 base pattern is also reported at size 39 with a lower scoring alignment. The annotated 60 base pattern repeat (indices 12596–13266) is indicative of the program's ability to find repeats with substantial amounts of mutation between adjacent copies (74% matching characters and 7% indels overall).

### Yeast chromosomes

Tables 6 and 7 list the tandem repeats found for the yeast sequences. Of special interest are the clusters of tandem repeats which show up repeatedly at the ends of the chromosomes, suggesting recent swapping of the ends. Chromosome 8, in particular, has two different clusters on its right end.

### The (27, 21, 48, 15, 135) cluster

The *FLO1* gene and its paralogous pseudogenes in chromosomes 1 and 8 contain a cluster of 5 tandem repeats with pattern sizes 27, 21, 48, 15 and 135. We designate these clusters *C*_{1} and *C*_{2} (adjacent on the left end of chromosome 1), *C*_{3} (right end, opposite strand) and *C*_{4} (right end of chromosome 8). The 27, 48 and 135 base patterns are not reported in every cluster in Tables 6 and 7. Subsequent analysis of the surrounding sequences, however, revealed that every pattern is present but not necessarily as two or more copies (Table 3). For each pattern size, the number of copies varies among the four clusters. More specifically, no cluster is identical in its copy number to any other cluster. This implies that duplication or excision events (deletion of copies) have occurred since the separate clusters were incorporated into the chromosomes. The sequences around these clusters also reveal close homology. For example, *C*_{3} and C_{4} are nearly identical over 18 000 bases and C_{2} and C_{3} display homology over 15 000 bases.

### The (13, 10, 36) cluster

A cluster of 3 tandem repeats with pattern sizes 13, 10 and 36 bases appears on both ends of chromosome 8 (Table 7). The 36 bp pattern also appears on the left end (low index numbers) of chromosome 6 (not shown). For the 36 bp pattern, each occurrence has a different copy number. The 10 and 13 bp patterns are identical in their occurrences. Surrounding sequences comprising 4200 bases are nearly identical for these three clusters.

## Conclusion

In this paper, we have presented a new algorithm for finding tandem repeats in DNA sequences without the need to specify either the pattern or pattern size. The algorithm is based on the detection of *k*-tuple matches. It uses a probabilisitic model of tandem repeats and a collection of statistical criteria based on that model. We have demonstrated the speed and utility of the algorithm by analyzing four sequences ranging in size up to 700 kb. Several avenues for future research are raised by this work, including methods to estimate statistical significance for tandem repeats and algorithms to determine plausible mutational histories.

### Statistical issues

We have yet to develop a good statistical significance measure for tandem repeats. For now, we use a cut-off alignment score based on simulations with random sequences. Difficulties include the local variation in nucleotide content in real sequences, which is decidedly non-random, and the problem of accounting for copy number as well as total repeat length. Estimates of significance developed in Benson and Waterman (27) are too high in this application because they apply to tandem repeats of one pattern size only, rather than the range of sizes considered here.

### Mutational history

Analyzing the mutational history of tandem repeats requires utilizing the pattern of mutations among adjacent copies to describe the interwoven progression of substitutions, indels and duplication/excision events leading from a single copy of the pattern to the present day sequence. Such histories can suggest how the boundaries and size of the duplication unit vary and may reveal details about the duplication mechanism.

## Acknowledgements

The author would like to thank Xiaoping Su for his help in analyzing the sum of heads criterion, Astrid Jervis for her help in simulating many of the statistical measures and examining the program output and Lan Dong for her help with some of the programing and examining the output. Thanks also to Mike Waterman, Richard Arratia and Rolf Backofen for helpful discussions. This work was partially supported by NSF grant CCR-9623532.

## Comments