The centromeric and telomeric heterochromatin of eukaryotic chromosomes is mainly composed of middle-repetitive elements, such as transposable elements and tandemly repeated DNA sequences. Because of this repetitive nature, Whole Genome Shotgun Projects have failed in sequencing these regions. We describe a novel kind of transposon-based approach for sequencing highly repetitive DNA sequences in BAC clones. The key to this strategy relies on physical mapping the precise position of the transposon insertion, which enables the correct assembly of the repeated DNA. We have applied this strategy to a clone from the centromeric region of the Y chromosome of Drosophila melanogaster. The analysis of the complete sequence of this clone has allowed us to prove that this centromeric region evolved from a telomere, possibly after a pericentric inversion of an ancestral telocentric chromosome. Our results confirm that the use of transposon-mediated sequencing, including positional mapping information, improves current finishing strategies. The strategy we describe could be a universal approach to resolving the heterochromatic regions of eukaryotic genomes.
The centromeric and telomeric heterochromatin of eukaryotic chromosomes is mainly composed of middle-repetitive elements, such as transposable elements, and tandemly repeated DNA sequences. As these chromosomal regions play an important role in chromosome segregation and nuclear architecture, a comprehensive description of their sequences is essential for understanding chromosome behavior and chromosome evolution.
Recently, the Drosophila Heterochromatin Genome Project has shown that current shotgun strategies are capable of determining high-quality contiguous sequence of heterochromatic regions with a high density of transposable elements, but not of regions containing satellite DNA repeats (1). Likewise, in all ‘finished’ genomes, there are many heterochromatic regions, including all the centromeres, not yet sequenced. Sequencing such regions is likely to rely on both an understanding of the structural organization of centromeric regions and the development of special approaches to sequence individual large-insert clones (2).
Although BAC clones may be sequenced efficiently using shotgun-based sequencing strategies, in many cases when the insert is repetitive DNA, shotgun techniques fail. The reasons are varied: difficulties in the subcloning, elimination of certain repetitive sequences, rearrangements and principally, uncertainties in the final assembly process. At present, despite the substantial efforts made by specialized sequence finishing groups, the highly repetitive heterochromatin regions of genomes remain unfinished.
Currently, transposon-mediated sequencing is broadly used as an effective alternative to classical shotgun strategies, since it simplifies the making of clone libraries. However, no benefit has yet been derived from one of the most potentially powerful tools for sequencing by transposon: the possibility of mapping the position of the transposon insertion (3). Therefore, it is important to test whether a transposon-based sequencing strategy using positional information is an effective approach to the sequencing of large-insert heterochromatic clones.
Drosophila maintains its telomeres by occasional targeted transposition of telomere-specific non-LTR retrotransposons (HeT-A, TART and TAHRE in the D. melanogaster subgroup) to chromosome ends, and not by the more common mechanism of telomerase-generated G-rich repeats (4–8). However, arrays of telomeric elements have been found in centromeric and pericentromeric heterochromatin (9–15). The centromeric region of the Y chromosome of D. melanogaster comprises the heterochromatic bands h17–h18. Past cytomolecular studies of the region h18 showed that it contains a tandem array of HeT-A- and TART-related sequences, collectively named the 18HT satellite, as well as several degenerate HeT-A elements (Figure 1) (13). The discovery of telomeric retrotransposons at this centromeric site led to the suggestion that the centromere of the Y was derived from a telomere by an intra-chromosomal rearrangement (13,16). In support of this hypothesis, Berloco and collaborators (15) have found HeT-A-related sequences in the centromeric region of the Y chromosome of all species analyzed, independently of the position of the centromere within these chromosomes.
Since the initial finding of telomeric sequences in the centromeric region of the D. melanogaster Y chromosome, we have been working towards the determination of the sequence of this region. Once BACs from region h18 were identified (14), BACR26J21 was selected because it contains both 18HT satellite repeats and degenerate HeT-A elements. In addition, BACR26J21 represents the typical problematic heterochromatic BAC, making it an ideal candidate to test our transposon-based sequencing strategy.
We conclude that the use of transposon-mediated sequencing, including positional mapping information, allows a substantial improvement of the current finishing strategies. This could become a universal approach to resolving the heterochromatic regions of the genomes. Importantly, the complete sequence of this clone has allowed us to prove that this centromeric region evolved from a telomere, possibly after a pericentric inversion of an ancestral telocentric chromosome.
MATERIALS AND METHODS
Media and reagents
Bacterial cultures were grown in LB broth supplemented with 1.5% agar when appropriate. SOC medium (2% bacto-tryptone, 0.5% yeast extract, 10 mM NaCl, 2.5 mM KCl, 10 mM MgCl2, 10 mM MgSO4, 20 mM glucose) was used for recovery of cells following electroporation.
Restriction enzymes and T4 ligase were purchased from New England Biolabs (Beverly, MA) and Promega (Madison, WI). All oligonucleotides were purchased from Isogen (IJsselstein, The Netherlands). The pMOD<oriV/KAN-2> transposon construction vector was a generous gift from Epicentre (Madison, WI).
Qiagen (Valencia, CA) products were used for plasmid purification and DNA gel extraction. Hyperactive mutant Tn5 transposase, carrying the mutations EK54/MA56/LP372 (17), was a generous gift from W. S. Reznikoff.
Construction of plasmids carrying artificial transposons
To obtain the pMOD2<oriV/KAN-2/I-SceI> vector, a 0.6-kb EcoRI–SalI fragment containing the recognition sequence for I-SceI was prepared from pSCM522 (18), blunted, gel-purified, and ligated to HindIII-digested, blunted and dephosphorylated pMOD2<oriV/KAN-2>. A 0.75-kb HindIII–PstI fragment containing the recognition sequence for PI-SceI was prepared from pBSVDEX (19), blunted, gel-purified and ligated to SacI-digested, blunted and dephosphorylated pMOD2<oriV/KAN-2/I-SceI>, resulting in pMOD2<oriV/KAN-2/I-SceI/PI-SceI> vector (pJW599) constructed by J.W. (Supplementary Figure 1a).
Integration of Tn5-like transposons into BAC DNA
Transposome complexes were formed by incubating 1 μg of the 3.3-kb PshAI-released transposon DNA from plasmid pJW599 (Supplementary Figure 1b) with 10 μg/μl of hyperactive mutant Tn5 transposase for 2 h at 37°C in 40 μl of a buffer containing 50 mM Tris–acetate (pH 7.5), 150 mM potassium acetate, 10 mM magnesium acetate and 4 mM spermidine. Transposition reactions were carried out by incubating 14 μl of transposon–transposase complexes with 1 μg of BACR26J21 DNA for 2 h at 37°C in a 20 μl reaction volume. Reactions were stopped with 0.1% SDS and 10 min incubation at 70°C. The transposon–transposase complexes were desalted on agarose cones and transformed into EPI300T1R cells (Epicentre, Madison, WI) by electroporation. Cells were plated on LB medium containing 30 μg/ml of kanamycin and 12.5 μg/ml of chloramphenicol. Around 2500 transformants were obtained in a single experiment (Figure 2a-1).
Induction of oriV-controlled replication and DNA extraction
Overnight cultures grown in LB supplemented with 30 μg/ml kanamycin and 12.5 μg/ml chloramphenicol were used to inoculate fresh cultures that were grown in the same medium. At an A590 = 0.2–0.3, l-arabinose inducer was added to the final concentration of 0.01% and cultures were grown for an additional 4–5 h before the cells were harvested. DNA from the induced recombinant BACs carrying transposon insertions was prepared by the alkaline lysis method with the R.E.A.L. Prep 96 (Quiagen. Valencia, CA).
Mapping of Tn599 transposon insertions
To determine the position of the transposon insertion, each clone was digested with PI-SceI (Figure 2a-2). In each clone there are two PI-SceI recognition sites: one of them is at a fixed position in the vector used to construct the RPCI-98 library, and the other is the one we have introduced in the transposon Tn599. After digestion of the clones, two fragments were generated and fractionated by PFGE in a CHEF-DRII apparatus (Bio-Rad, Hercules, CA). 1% (w/v) agarose gels were run in 0.5× TAE (1× TAE is 40 mM Tris–acetate/2 mM EDTA) at 6V/cm, 14°C for 20 h with a pulse time ramp from 0.4–10 s (Figure 2a-3). However, in order to distinguish the two fragments, the DNA from the PFG was transferred to a Hybond™-N+ nylon membrane and hybridized with probe P (Figure 2a-4). This probe derived from a region of the plasmid vector used to construct the BAC library and was obtained by PCR using pBACe3.6 DNA as a template, the primers P-Fw: 5′ AAGGCCGTAATATCCAGCTG 3′ and P-Rv: 5′ CTTCGTGTAGACTTCCGTTG 3′ and the following cycling parameters: an initial denaturalization step at 94°C for 5 min, 30 cycles of 94°C for 1 min; 55°C for 30 s; 72°C for 1 min followed by a final extension at 72°C for 10 min. The obtained PCR product was labeled with [32P] dCTP using ‘Rad prime DNA radiolabeling kit’ (Invitrogen) according to supplier's recommendations. Figure 2a-1 displays an example where two clones with the transposon inserted at different positions, generate (after PI-SceI digestion) two sets of fragments of the same two sizes (Figure 2a-2). Taking the sizes alone one could not distinguish the relative position of the transposon from the fixed PI-SceI recognition site. It was only after hybridization with the probe P that one could discern the two fragments, and thus, place them in the correct order (Figure 2a-4).
Sequencing and sequence assembly
Transposon-based DNA sequencing was performed using the standard dye terminator chemistry with two different primers; TN-FP (5′-GCCAACGACTACGCACTAGCCAAC-3′) and TN-RP (5′-GAGCCAATATGCGAGAACACCCGAGAA-3′). The following PCR cycling parameters were used: initial denaturation step at 95°C for 5 min, 35 cycles of 95°C for 30 s; 55°C for 20 s; 60°C for 4 min, followed by a final extension step at 60°C for 5 min. Reactions were analyzed in an ABI Prism 3730 Sequencer (Applied Biosystems, Foster City, CA) at the sequencing facility of the Unidad de Genómica Antonia Martín Gallardo, PCM-UAM.
Sequence assembly of the transposon-based sequencing approach was carried out using preGap4 and Gap4 assembly tools, from the Staden package (20). After the sequences were clipped for transposon's ME, each mate pair was linked by aligning the 9-bp duplicated region, and minicontigs for each clone, of around 1500 bases long, were built. This union is very important when dealing with highly similar repetitive sequences (>99% identity among different repeat units), since it provides longer reads, where single nucleotide differences are more likely to show up, narrowing the chances of misassemblies. Then small assembly projects were generated separately for eight windows of 20 kb. The size of the window was decided considering that PFGE fragment size measurement is associated with an inevitable error. Each minicontig was placed in one subproject, according to its Tn599 insertion position. In each small project, assembly was done by sequence overlap, still bearing in mind the relative position within the window. Afterward, each subproject was joined to its neighbor subproject, and assembled by sequence overlap until all sequences had been assembled together into a single contig (Figure 2a-5). For the final sequence (FM992409), 933 reads (with an average read length of 817 nucleotides per read) were assembled, giving 4.7-fold coverage of BACR26J21.
The strategy used at The Wellcome Trust Sanger Institute for sequencing and finishing repetitive clones was the following: the clone BACR26J21 was shotgunned using a 4–6-kb pUC subclone library and an assembly was produced using Phrap (Philip Green) (Figure 2b-1). Gap4 (20), Orchid (Flowers) and Confirm (Attwood) were then used to assess the integrity of the assembly (Figure 2b-2). Where misassemblies occurred in the repeat, copies were compared against each other and base pair differences were identified and tagged. These differences were then used to identify regions of unique sequence within the repeat. Using paired read information it is often possible to then position reads accurately based on their overlap in unique sequence to contiguate the clone. Because this approach was not sufficient in the case of BACR26J21, a LIL–TIL strategy was employed. The LIL–TIL strategy involves the generation of a Large Insert Library (LIL), which is then combined with the existing data (Figure 2b-3). Large insert subclones (9–12 kb) are then chosen to generate a Transposon Insert Library (TIL), based on overlap in sequence believed to occur only once in the clone (assuming roughly uniform shotgun coverage). TILs were generated using a kit-based approach (Figure 2b-4). Transposons randomly inserted into the subclone were sequenced outwards from the transposon insertion site. TIL read-pairs were then joined up based on the 9-bp duplication site to give contiguous TIL read-pairs in excess of 1 kb. The combination of the LIL-TIL strategy, existing read-pair information and restriction digest data was used to build a correct assembly (Figure 2b-5). For the final sequence (CU076040) 6354 reads (with an average read length of 872 nucleotides per read) were assembled, giving12.2-fold coverage of BACR26J21.
The BACR26J21 assembled sequence was analyzed by BLAST through the FlyBase server (http://flybase.bio.indiana.edu) and by dot plot analyses using Dotter (21). The comparison of the two independent sequences for BACR26J21 was carried out by BLAST2.
In order to develop a technique for successful sequencing, in an automated manner, of any large heterochromatic insert clone, we modified an existing transposon used for sequencing. Most vectors used in the construction of BAC libraries contain a homing endonuclease recognition site to provide a tool for linearization (http://bacpac.chori.org/vectorsdet.htm). The introduction of the same recognition site into the transposon facilitates the mapping of its insertion by measuring the two fragments generated by a single digestion. Combining the positional information and the sequence data at the assembly stage, similar sequences that derive from different positions within the repetitive DNA would never be assembled together.
Our method relies on the use of a Tn5-like transposon to generate, in vitro, a library of clones, with the transposon inserted randomly along the entire length of the sequence of interest (22). Since repetitive sequences are rather unstable, a single-copy-number clone (BAC) is necessary. However, BACs have the disadvantage of a very low yield of DNA. Therefore, for the construction of the transposon, we have used the plasmid pMOD2<oriV/KAN-2> that carried the inducible origin of replication oriV, which, only upon production of TrfA in the host cell, transforms the clone from single-copy to multiple copies [for advantages of this system see Wild et al. (23)].
To allow the mapping of the insertion we have implemented the transposon with two rare sites, PI-SceI and I-SceI. These double stranded DNAses are extremely unusual in that they have large recognition sites (18 bp for I-SceI and 39 bp for PI-SceI). For instance, an 18-bp recognition sequence will occur only once in every 7 × 1010 bp of random sequence, which is equivalent to only one site in 20 mammalian-sized genomes (24). However, unlike standard restriction endonucleases, homing endonucleases tolerate some sequence degeneracy within their recognition sequence (25,26). In any case, the chance of finding their recognition sites in the target sequence of a large insert clone is extremely low. This makes the mapping strategy universal, independent of the nature of the genome DNA being used.
Taking all these facts into consideration, the plasmid construction was carried out as described in Methods, producing the final plasmid pMOD2<oriV/KAN-2/I-SceI/PI-SceI>, from which the artificial transposon Tn599 was obtained by digestion with the restriction enzyme PshAI. In summary, the final transposon comprised the following significant traits: (i) two mosaic-end (ME) sequences at the ends of the transposon [necessary for the transposon to be able to integrate (27)], (ii) known sequences adjacent to the ME, that serve as priming sites for sequencing the insert region contiguous to the transposon, (iii) a kanamycin-resistance gene, (iv) an inducible origin of replication, oriV, (v) a recognition site for I-SceI and (vi) a recognition site for PI-SceI (Supplementary Figure 1).
An optimized transposon-based strategy for sequencing highly repetitive DNA
To test whether our novel strategy improves upon the standard shotgun sequencing and directed finishing approach, a challenging BAC from the RPCI-98 D. melanogaster BAC Library was independently sequenced by both procedures. The cloning vector used in the construction of this library was pBACe3.6, which has a PI-SceI recognizion site at its position 11371–11407 (http://bacpac.chori.org/vectorsdet.htm). We chose BACR26J21 based on our previous physical analyses of the centromeric region h18 of the Y chromosome (Figure 1). According to our data, BACR26J21 should carry about 100 kb of 18HT satellite repeats and decayed HeT-As elements interspersed with transposable elements (13,14). The length of the insert of BACR26J21 was estimated to be around 160 kb.
The optimized transposon-based strategy consisted of: (i) in vitro transposition of Tn599 into the BAC clone (Figure 2a-1), (ii) physical mapping of 620 clones from the transposon insertion library, which provides around 5-fold coverage of the clone (Figure 2a-2–4, see Methods section) and (iii) sequencing and assembly using the positional information (Figure 2a-5). Figure 3a and b display an example of the information required for correct mapping of the transposon in each clone. Figure 3a is the image of a PFGE gel with PI-SceI-digested DNA, and Figure 3b displays the same DNA hybridized to the radioactive labeled probe P.
Although Kang and collaborators (22) observed randomness of the Tn5 transposition in the Escherichia coli genome, we decided to check the randomness of insertion of our Tn599 transposon into the highly repetitive DNA. In Figure 4 we have represented, in the x-axis, all the mapped BACR26J21-Tn599 clones and in the y-axis their correspondent sizes. As can be seen, the distribution of Tn599 insertions was even along the entire heterochromatic BACR26J21 (perfect randomness is represented by a black line). Thus, it seems that Tn5-like transposons can also be used to sequence heterochromatic BACs with satellite regions.
BLAST2 comparison of the two independent sequences of BACR26J21 obtained by the two sequencing approaches (CU076040 and FM992409) shows them to be nearly identical (Figure 5a). There are seven single-nucleotide differences, two additional A's in a poly(A), 33 extra nucleotides in the non-repeat region of the BAC clone and 3097 extra nucleotides in the tandem repeated region. The 33 extra nucleotides are present in the middle of a small palindrome within a HeT-A element. These 33 nucleotides correspond to the expected ones for such a position in the canonical HeT-A element (Figure 5b). The 3097 extra nucleotides in the tandem repeat region correspond exactly with one additional 18HT repeat unit. The standard restriction digest data, used as a quality control, does not provide enough information to be able to discern in tandem repeated regions. As the optimized transposon-based approach contains positional information, it seems reasonable to consider its consensus sequence to be more reliable.
Overall organization of the newly acquired sequence from the centromeric region h18 of the D. melanogaster Y chromosome and parsimonious reconstruction of its evolution
Dot plot analyses of the sequence obtained for BACR26J21 show, as we anticipated, that around two thirds of the insert corresponds to tandem repeats (large circle in Figure 6a). Specifically, this repeated region contains 29 18HT satellite repeats (Figure 6b). Looking at the transition between the 18HT repeats and the decayed telomeric elements, we were able to precisely determine the DNA region that underwent amplification, generating the more abundant 3.1-kb repeat unit (Supplementary Data). In addition, three types of smaller repeats (around 2 kb) were also detected (see asterisks in Figure 6b). These repeats were probably generated by major deletions, and two of them subsequently amplified by unequal crossovers (Figure 7d to a). Along the repeats we found three transposons: 1360, 412 and copia with 95%, 99% and 99% nucleotide identity to the canonical elements, respectively (Figures 6b and 7a). After the satellite repeats we found degenerated HeT-A elements (87% nucleotide identity), interspersed with several transposable elements: one mdg1, one diver, one 1731 and four F-elements with 95%, 96%, 100% and 97% nucleotide identity, respectively. We also found a small segmental duplication from the euchromatic region 42A present at two different sites at the distal end of the sequence (Figure 7a). These observations showed that, in addition to deletions and transposon insertions, several regions, indicated by circles in Figure 6a, had gone through amplification events. Thus, based on regional sequence homology, the four F-elements appear to derive from a single transposition event, and the region where it was inserted would have been subsequently amplified. The same would have happened to the region containing the duplication from 42A. (Figure 7 and Supplementary Data).
Although the evolution of this chromosomal region could have followed an intricate path, the most parsimonious reconstruction suggests that the ancestral structure was a head-to-tail array of nine telomeric elements: four truncated HeT-As (elements 1, 2, 3 and 5 in Figure 7e), one truncated TART (element 4) and four full-length HeT-As (elements 6, 7, 8 and 9). The HeT-As 1, 2 and 3 and part of the TART element conformed the 3.1-kb segment that was originally amplified to form the 18HT satellite, which had already been described (10,13). We know that four of the HeT-As are full-length because for each of them we see its ORF sequence, its 3′ UTR sequence and in its most 5′UTR region we can even see the characteristic short sequences that derive from the 3′ end of the upstream element to the master copy from which each HeT-A element was retrotranscribed (28). In fact, it is the difference in the number and composition of these short sequences at the 5′UTR that has been used to distinguish among the HeT-A elements. There are several aspects of the reconstructed ancestral telomeric array that make it resemble a current Drosophila telomere. First of all, the existence of complete HeT-A elements is a typical feature of D. melanogaster telomeres (29,30). Second, the head-to-tail orientation of the telomeric retrotransposons is also a characteristic feature of telomeric arrays. Third, the reconstructed ancestral telomeric array has a similar length to the ones found at present Drosophila telomeres (29,30). For all these evidences we claim that this centromeric region has a large array of telomeric retrotransposons, and thus, that it evolved from an ancestral telomere.
The possibility of sequencing tandemly repeated regions by the presented transposon-based approach represents major progress in the development of a universal technique that will enable the characterization of the centromeres of eukaryotic genomes.
Although the sequences obtained by the two approaches are similar, the use of positional information reduces the number of reads required. Thus, the optimized transposon-mediated sequencing method needed seven times less sequences than the currently used strategy (933 reads versus 6354 reads), which represents a decrease in the cost. In addition, it is important to highlight that the final assembly of the highly repetitive DNA in BACR26J21 was achieved by the presented optimized method with coverage as low as 4.7-fold, thanks to the inclusion of positional information. When comparing the coverage obtained by both methods (4.7-fold versus 12.2-fold), the current method coverage is only increased 2.6 times (while the number of reads used was seven times larger). This means that the presented transposon method generates a less piled up assembly where the reads were distributed more homogeneously along the sequence. Taken altogether, these observations show the optimized transposon-mediated strategy to be a very effective and efficient method for sequencing highly repetitive regions.
The potential of this technique resides in the feasible automation of the transposon insertion mapping. We envision that the sizes of the PI-SceI-digested fragments can be easily measured with high resolution by optical mapping (31). This technique would serve not only to measure the sizes but also to make the necessary distinction between the two fragments. Instead of using probes to hybridize and mark one of the fragments upon measuring the sizes, we would suggest the use of triple helix formation with fluorescently labeled oligonucleotides (32), prior to extending the DNA fragments into the surface. In this manner, the transposon insertion mapping could be done at the single molecule level by automated fluorescent microscopy. Developing a single molecule approach to obtaining the positional information would increase the accuracy of the mapping, which implies a reduction in the number of clones needed to obtain a high quality sequence.
Consistent with the major morphological changes undergone by Drosophila Y chromosomes, Berloco et al. (15) found that the Y chromosome in the melanogaster subgroup are either telocentric or metacentric (Supplementary Figure 2). However, they have always found HeT-A-related sequences in the centromeric region, independently of the position of the centromere. It seems that intra-chromosomal rearrangements have recurrently occurred in these species, probably by means of an inversion.
Understanding the presence of telomere-specific retroelements (HeT-A and TART) at the centromeric region of the Y chromosome is a long standing issue. Two hypotheses have been considered so far: one favors the idea that small fragments of telomeres could have been moved into the Y chromosome (10,30) while the other considers the possibility of an entire or almost entire telomere being present (13,14). The two scenarios resemble one another, with the only apparent difference in the size of the fragment being moved. However, the mechanisms that would give rise to each are different. It is well known that Y chromosomes frequently acquire segmental duplications during evolution (33). Consequently, one could consider that the telomeric sequences in non-telomeric regions of the Y chromosome derive from small telomeric fragments, moved into the Y via transposition. On the other hand, if we found evidences of the presence of an entire telomere at the centromere of the Y chromosome, then the most likely origin would be an intra-chromosomal rearrangement. To definitively resolve this issue the sequence of a large portion of this region has been determined in this work, and its analysis suggests that the centromeric region h18 evolved from a telomere, possibly after a pericentric inversion of an ancestral telocentric chromosome. Finally, the genomic structure of this centromeric region could also be considered to support the hypothesis that centromeres were derived from telomeres during the evolution of the eukaryotic chromosome (34,35).
Supplementary Data are available at NAR Online.
Ministerio de Ciencia e Innovación [BFU2008-02947-C02-01/BMC to A.V.]; an institutional grant from the Fundación Ramón Areces to the Centro de Biología Molecular ‘Severo Ochoa’; the Molecular Genetics Fund established by W. Szybalski at the UW Foundation; and The Wellcome Trust. Funding for open access charge: Ministerio de Ciencia e Innovación (BFU2008-02947-C02-01).
Conflict of interest statement. None declared.
We would like to thank Drs William S. Reznikoff and I. Y. Goryshin for the kind gift of the hyperactive Tn5 transposase.