Burge, Campbell, and Karlin (1992)<$REFLINK> observed that the relative frequencies of di- and trinucleotides characterize a genome, independent of its base composition and the coding and noncoding capacity of the regions analyzed. Species thus differ with regard to this genomic signature, which is constant in a given genome and shows similarities between related species (Gentles and Karlin 2001<$REFLINK> ). The variation in the relative abundance of dinucleotides is interpreted as reflecting differences between species in the cellular machinery for replication and repair, which may select specific dinucleotides in the sequence (Campbell, Mrázek, and Karlin 1999<$REFLINK> ). A tendency toward the suppression of CG is often observed and is interpreted as resulting from the action of methylation activities (Bird 1986<$REFLINK> ). The dinucleotides pattern of the mitochondrial genome has also been shown to differ from that of the nuclear genome, and the explanation suggests that nuclear and mitochondrial genomes use independent DNA polymerase machinery and different methods of replication (Campbell, Mrázek, and Karlin 1999<$REFLINK> ). We therefore wanted to find out whether transposable elements (TEs), which have been shown to have a greater AT content than their host genes in various species (Shields and Sharp 1989<$REFLINK> ; Lerat, Capy, and Biémont 2002<$REFLINK> ), have the same dinucleotides pattern as their host.
TEs are repeated sequences that are able to move from one position to another along chromosomes. They were first discovered in maize by Barbara McClintock (1984)<$REFLINK> in the 1950s and seem to exist in all living organisms. They are divided into two main classes, according to the transposition intermediate they use (Capy et al. 1997, pp. 1–197<$REFLINK> ). Class I consists of retrotransposons that use an RNA intermediate and are subdivided into two subclasses according to whether they do or do not have long terminal repeats (LTRs) at their extremities, LTR retrotransposons and non-LTR retrotransposons, respectively. Class II consists of transposons that use a DNA intermediate for transposition and code for a transposase. There is a third class that consists of foldback elements and MITEs, the transposition mechanism of which has not yet been elucidated.
The complete genomes of Saccharomyces cerevisiae, Caenorhabditis elegans, and Drosophila melanogaster, chromosomes 2 and 4 of Arabidopsis thaliana, and chromosomes 21 and 22 of Homo sapiens were downloaded from the Genome On Line Database site (wit.integratedgenomics.com/GOLD/) (Kyrpides 1999<$REFLINK> ). Entire sequences of transposons, LTR retrotransposons and non-LTR retrotransposons, and of class-III elements from C. elegans, D. melanogaster, H. sapiens, and A. thaliana were downloaded from GenBank. Other Arabidopsis TEs were obtained from the Arabidopsis transposable element database (soave.biol.mcgill.ca/clonebase/main.html). The positions of TEs in the sequenced genome of Saccharomyces were obtained from the site transposable element resources (www.public.iastate.edu/∼voytas/resources/resources.html). The TE data set, thus available, consisted of 40 sequences from D. melanogaster, 50 from S. cerevisiae, 19 from C. elegans, 25 from H. sapiens, and 31 from A. thaliana. The TE sequences for each species were concatenated. Of the 25 TE sequences from H. sapiens, 10 were retroviruses (HERV-K,HERV-K-T47D,HERV-K101,HERV-KC4,HIV1,HIV2,HTLV1,HTLV2,HSRV, and v-oncogene), which are class-I elements and can be considered to belong to the LTR retrotransposon family.
We used the indices defined by Burge, Campbell, and Karlin (1992)<$REFLINK> . For a dinucleotide XY, the indices ρXY = fXY/fXfY were computed for each sequence, where fX and fY are the frequencies of bases X and Y, respectively, and fXY the frequency of the dinucleotide XY. When the coding sequences of TEs and genes were used, the indices were only calculated from single-stranded DNA. For complete sequences, we took into account the antiparallel and complementary structure of double-stranded DNA (Burge, Campbell, and Karlin 1992<$REFLINK> ). We thus computed f*A = f*T = 1/2(fA + fT) for base A and its associated T nucleotide in the double-stranded sequence and f*G = f*C = 1/2(fG + fC) for base G and its associated C nucleotide. The frequency of the GT dinucleotide was computed as f*GT = (1/2fGT + 1/2fAC), and the indices ρ*XY = f*XY/f*Xf*Y were estimated. According to Karlin and Burge (1995)<$REFLINK> , the XY dinucleotide was considered to be underrepresented if ρ*XY ≤ 0.78 and overrepresented if ρ*XY ≥ 1.23.
The relative distance between two sequences, f and g, was calculated as the sum of the differences between the ρ*ij indices for each ij dinucleotide between the two sequences: δ*(f,g) = (1/16)Σij |ρ*ij(f) − ρ*ij(g)| (Karlin and Ladunga 1994<$REFLINK> ; Karlin and Mrázek 1997<$REFLINK> ). Relative distances were computed for the genomic sequences and the concatenated TEs for all species, the fragments of genomic sequences and complete TEs for all species, and the host genes and coding parts of TEs for each species separately. The distance matrix obtained was analyzed using a principal coordinates analysis, a specific multivariate analysis which transforms distance matrices into euclidean matrices before extracting the principal components (Gower 1966<$REFLINK> ). This analysis makes it possible to visualize neighboring sequences in terms of their relative abundance of dinucleotides. These analyses were done using the ADE-4 package (Thioulouse et al. 1997<$REFLINK> ).
The relative abundances of dinucleotides in TE and genomic sequences were calculated for the five species listed previously (detailed data available upon request). Whatever the species, the dinucleotide TA appeared to be underrepresented in both genomes and TEs, except in the yeast retrotransposons. The dinucleotide CG was underrepresented in both genomes and TEs in A. thaliana and H. sapiens and in the LTR retrotransposons Ty1, Ty4, and Ty5 in Saccharomyces. In the Caenorhabditis and Drosophila genomes, AA/TT was overrepresented. For a given species, the TE and genomic sequences displayed the same global pattern of relative dinucleotides abundance, as revealed by the positive correlation coefficients for the relative abundance of dinucleotides between TEs and host genomes (r = 0.98, P < 0.05 for Arabidopsis; r = 0.93, P < 0.05 for Caenorhabditis; r = 0.94, P < 0.05 for Drosophila; r = 0.87, P < 0.05 for H. sapiens). For Saccharomyces, the coefficient of correlation between the genome and TEs was not different from zero (r = 0.54, P = 0.40).
To check for a codon signature in coding regions, we calculated the relative abundance of dinucleotides according to their position in codons along the single-stranded DNA (data available upon request). The strong positive correlation detected at position 1–2 of codons between genes and TEs for each species (r = 0.93, P < 0.05 for Arabidopsis; r = 0.90, P < 0.05 for Caenorhabditis; r = 0.70, P < 0.05 for Drosophila; r = 0.77, P < 0.05 for human; r = 0.91, P < 0.05 for Saccharomyces) suggests that there were only a few differences between TE and gene sequences in the relative abundances patterns of dinucleotides. The correlation was also positive at position 2–3 for Arabidopsis (r = 0.88, P < 0.05), for Caenorhabditis (r = 0.64, P < 0.05), for human (r = 0.80, P < 0.05), and for Saccharomyces (r = 0.64, P < 0.05) but was not statistically different from zero in D. melanogaster (r = 0.17, P = 0.40). In D. melanogaster and S. cerevisiae, the relative abundance of dinucleotides at position 3–1 (r = 0.40, P = 0.40; r = 0.50, P = 0.40 for Drosophila and Saccharomyces, respectively) showed no correlation to that found in other species (r = 0.87, P < 0.05 for Arabidopsis; r = 0.77, P < 0.05 for Caenorhabditis; r = 0.90, P < 0.05 for human). The dinucleotide TA was strongly underrepresented at all positions in both genes and TEs in all the species, except Saccharomyces, where TA was underrepresented only at position 1–2 of the codons. TT and TC were strongly overrepresented, and CG and GT were underrepresented at position 1–2 in all the data sets. The TG and CA dinucleotides were well represented at position 2–3 and 3–1: ρTG and ρCA were often greater than 1 and sometimes reached values indicative of overrepresentation (ρ > 1.23).
Figure 1 shows the projection of TEs and genomes onto the plane defined by the two first axes of a principal coordinates analysis of the distance matrix between the dinucleotide relative abundance indices of genomic and TE sequences. TE and genomic sequences from one species were close, except for Saccharomyces, which presented no correlation between TE and genomic sequences for dinucleotide relative abundance. In this analysis, we compared TE sequences from genomic sequences likely to include TEs, and we therefore carried out a more detailed principal coordinates analysis on complete TE sequences and on TE-free genomic fragments. To do this, genomic sequences were broken down into genomic fragments of 9,000 bp size, which was roughly equivalent to the mean length of the complete TEs. For each species, 100 fragments were randomly selected and a BLASTN analysis (Altschul et al. 1<$REFLINK> 997) was done to compare the genomic fragments and TE sequences and allow us to eliminate the genomic fragments including TEs. In this way, we obtained a total of 459 TE-free genomic fragments and 165 complete TE sequences for the five species. The distances between the indices of relative dinucleotides abundance were then computed. The relative abundances of dinucleotides in the genomic fragments were nearly the same as the values obtained for the overall genomic sequences. With the exception of Saccharomyces, TE sequences and genomic fragments from a given species were found to be clustered (figure available upon request).
Figure 2 shows the plot of the dinucleotide relative abundance distances between genes and coding parts of TEs for each species separately. Coding regions of the TEs and host genes appeared to be located together in Caenorhabditis and Arabidopsis. In H. sapiens, some of the TEs were located with the host genes, whereas the rest, corresponding to retrovirus sequences, formed a distinct group. In Drosophila and Saccharomyces, the TEs were not located with host genes. In Drosophila, the TEs furthest from the host genes corresponded to LTR retrotransposons with an env gene, e.g., retrovirus-like elements (Tirant,297,ZAM and in a lowest way 17.6,gypsy,idefix, and nomad).
In the five species analyzed, A. thaliana, C. elegans, S. cerevisiae, D. melanogaster, and H. sapiens, TEs appear to display a similar pattern of the relative abundances of dinucleotides as their host genome. In all our analyses, we found that the TA dinucleotide was underrepresented in both genomes and TEs. Such underrepresentation of TA, which seems to be a general feature, is attributed to (1) the avoidance of the inappropriate terminate codons TAA or TAG in coding sequences, (2) the selection of mRNA stability by avoiding UpA, which is susceptible to RNAse activity (Beutler et al. 1989<$REFLINK> ), or (3) the avoidance of having too many transcription signals (Burge, Campbell, and Karlin 1992<$REFLINK> ). We also observed CG suppression in both genomes and TEs in Arabidopsis and human. Such global CG suppression is believed to reduce the stacking energies of DNA, thus facilitating replication and transcription (Karlin and Burge 1995<$REFLINK> ). The fact that no CG suppression was observed in C. elegans, S. cerevisiae, and D. melanogaster suggests, however, that this explanation is far from universally applicable. We show here that CG suppression, which has been already reported in small eukaryotic viruses (Karlin, Doerfler, and Cardon 1994<$REFLINK> ), also exists in the elements Ty1, Ty4, and Ty5 of Saccharomyces, in many LTR retrotransposons of Arabidopsis, and in all the LTR retrotransposons of H. sapiens. In Drosophila, however, LTR retrotransposons with an env gene do not exhibit this underrepresentation of CG. The combination of these findings suggests that CG suppression does not affect all kinds of transposable elements and is not related to the size of the TE sequence.
Multivariate analysis showed that the retroviruses of H. sapiens and the LTR retrotransposons with env genes of Drosophila were very distant from their host genes. This specific grouping of the coding parts of retrovirus-like elements and of retroviruses relative to the host genes was not found when entire sequences were used, suggesting that there are differences in the transcription mechanisms for the coding parts of these elements. The coding parts of HERV (human endogenous retrovirus) were also located with the other retroviruses, although such endogenous retroviruses are not infectious because of deletions or the presence of stop codons in their coding parts (Bock and Stoye 2000; Tristen 2000<$REFLINK> ). It has been shown, however, that the HERV-K element can theoretically be trans-complemented and then becomes infectious (Bock and Stoye 2000<$REFLINK> ). If the large dinucleotide relative abundance distances observed between host genes and retroviruses and some LTR retrotransposon genes is an indication of their infectivity, then we can expect the Drosophila elements, 297,Tirant,17.6, and idefix to be infectious or to have been infectious in the recent past. Infectious capacity has been clearly demonstrated for gypsy (Kim et al. 1994<$REFLINK> ), but the other five elements are only suspected of being retroviruses (Dessat et al. 1999<$REFLINK> ; Canizares et al. 2000<$REFLINK> ). Experimental evidences are therefore required to test the theoretical expectation of the present analysis.
Wolfgang Stephan, Reviewing Editor
Keywords: transposable elements retrovirus dinucleotide abundance
Address for correspondence and reprints: Christian Biémont, Laboratoire de Biométrie et Biologie Évolutive, UMR CNRS 5558, Université Lyon 1, 69622 Villeurbanne Cedex, France. firstname.lastname@example.org .