We traced the sequence evolution of the active lineage of LINE-1 (L1) retrotransposons over the last ∼25 Myr of human evolution. Five major families (L1PA5, L1PA4, L1PA3B, L1PA2, and L1PA1) of elements have succeeded each other as a single lineage. We found that part of the first open-reading frame (ORFI) had a higher rate of nonsynonymous (amino acid replacement) substitution than synonymous substitution during the evolution of the ancestral L1PA5 through the L1PA3B families. This segment encodes the coiled coil region of the protein-protein interaction domain of the ORFI protein (ORFIp). Statistical analysis of these changes indicates that positive selection had been acting on this region. In contrast, the coiled coil segment hardly changed during the evolution of the L1PA3B to the present L1PA1 family. Therefore, selective pressure on the coiled coil segment has changed over time. We suggest that the fast rate of amino acid replacement in the coiled coil segment reflects the adaptation of L1 either to a changing genomic environment or to host repression factors. In contrast, the second open-reading frame and the nucleic acid–binding domain of the first open-reading frame are extremely well conserved, attesting to the strong purifying selection acting on these regions.
L1 (LINE-1) non–long terminal repeat (LTR) retrotransposons (fig. 1A ) constitute the major family of autonomously transposing elements in mammals (Smit 1999<$REFLINK> ). The human genome contains ∼500,000 L1 elements that account for 17% of its mass (Lander et al. 2001<$REFLINK> ), attesting to the profound effect that L1 replication has had on mammalian genomes. L1 retrotransposition generates mostly defective copies which remain in the genome and accumulate mutations at the pseudogene rate (Voliva et al. 1983<$REFLINK> ; Hardies et al. 1986<$REFLINK> ; Pascale et al. 1993<$REFLINK> ). Novel replication-competent L1 variants are also produced, and can subsequently generate a family of several hundreds or thousands of copies that share the diagnostic features of their progenitor (or group of closely related progenitors) (reviewed in Furano 2000<$REFLINK> ). This process is exemplified in humans, where five major families (L1PA5, L1PA4, L1PA3B, L1PA2, and L1PA1) have succeeded each other as a single lineage over the last 25 Myr in the primate ancestors of modern day humans (fig. 1B ) (Boissinot, Entezam, and Furano 2001<$REFLINK> ).
About 70% of the full-length (potentially active) members of the ancestral L1PA5–L1PA2 families have apparently been cleared from the genome (Boissinot, Entezam, and Furano 2001<$REFLINK> ). As this loss was more profound from recombining regions of the genome when compared with regions of low or no recombination (e.g., the Y chromosome), we suggested that the full-length elements have been deleterious enough to be subjected to purifying selection. This has not yet occurred for L1PA1 which is actively transposing in humans, generating new inserts that cause both neutral polymorphisms and genetic defects (reviewed in Kazazian and Moran 1998<$REFLINK> ). Although these latter events would reduce the fitness of the host, we have proposed that a more global (dysgenic) effect of L1 activity accounted for the selection against the full-length L1 elements (Boissinot, Entezam, and Furano 2001<$REFLINK> ).
Whatever the case, the fact that L1 activity is deleterious enough to be subject to purifying selection suggests that control of L1 transposition would be important for maintaining host fitness. Aside from evidence suggesting that the general transcriptional inhibition imposed by DNA methylation (Jones 1999<$REFLINK> ) could as well silence L1 transcription (Nur, Pascale, and Furano 1988<$REFLINK> ; Thayer, Singer, and Fanning 1993<$REFLINK> ; Hata and Sakaki 1997<$REFLINK> ; Woodcock et al. 1997<$REFLINK> ), neither the regulation of L1 replication nor any other possible role played by the host has been examined in detail. Host factors that specifically repress or reduce L1 activity would be highly advantageous. In turn, such factors would constitute a selective pressure on L1 to evade repression. Thus, L1 evolution may in part reflect interactions between the element and its host.
To identify regions of L1 that could be involved in host-L1 interactions, we examined the evolutionary changes that occurred in the evolution of the active lineage of L1 elements from the ancestral L1PA5 family to the currently active L1PA1 family. In particular, we identified a region of the first open-reading frame (ORFI) that uniquely shows a high rate of nonsynonymous (amino acid replacement) substitutions, which is the typical signature of positive selection. The fact that this region of ORFI encodes a coiled coil domain that has been shown to mediate protein-protein interaction (Hohjoh and Singer 1996<$REFLINK> ; Martin, Li, and Weisz 2000<$REFLINK> ), suggests that the ORFI protein (ORFIp) could be involved in host-L1 interaction.
Materials and Methods
Full-length elements belonging to the five most recent human L1 families (L1PA5, L1PA4, L1PA3B, L1PA2, and L1PA1 following the nomenclature in Smit et al. 1995<$REFLINK> ; Boissinot, Chevret, and Furano 2000<$REFLINK> ; Boissinot, Entezam, and Furano 2001<$REFLINK> ) were collected from the GenBank public database (table 1 ). These families encompass the last ∼25 Myr of L1 evolution in humans. We limited our analysis to these five families because they are less likely to be interrupted by more recent internal insertions than older families. The selected full-length elements were part of our previous work in which their classification was confirmed by phylogenetic analysis (Boissinot, Entezam, and Furano 2001<$REFLINK> ). L1 elements differ from each other by the family-specific mutations that they inherit from their progenitors, and by the mutations that they have accumulated at the neutral rate since insertion in the genome. As we were interested in the evolution of the active lineage, we focused only on the family-specific differences. We eliminated the mutations that arose after insertion by deriving a consensus sequence for each family. Mutations in the highly mutable CpG dinucleotides were eliminated from the consensus except when they corresponded to fixed differences among families. Alignment and consensus were created using the GCG program (Wisconsin Package, Version 10.0, Genetics Computer Group, Madison, Wis.). The alignment is available from the EMBL-ALIGN database (accession number, ALIGN_000165).
Phylogenetic reconstructions were performed using the maximum-likelihood (ML) method as implemented in the PAML program package (Yang 2000<$REFLINK> ). Variations in the rate of evolution along a sequence can result from either selection or recombination. We used the program PLATO 2.0 (Grassly and Holmes 1997<$REFLINK> ) to identify such regions. First, an ML tree based on the entire coding region of L1 was calculated with PAML (Yang 2000<$REFLINK> ). Using the ML tree as null hypothesis, PLATO employs a sliding window through which deviations from the ML generated branch-lengths are calculated, thereby identifying regions that differ in evolutionary rate from the complete coding sequence.
ORFIp was analyzed for coiled coil domains using the program COILS (Lupas, Van Dyke, and Stock 1991<$REFLINK> ) at http://www.ch.embnet.org/software/COILS_form.html. COILS compares a given sequence to a database of sequences which are known to form coiled coil structures. COILS calculates the probability that the sequence of interest will adopt a coiled coil conformation.
Test for Selection
The effect of selection on a coding sequence can be estimated by comparing the synonymous (dS) and nonsynonymous (dN) substitution rates (for a review, see Yang and Bielawski 2000<$REFLINK> ). The value of the ratio ω = dN/dS is an indicator of the type and strength of selection. If nonsynonymous mutations have no effect on fitness, they are going to be fixed at the same rate as synonymous mutations and a value of ω = 1 is expected. If nonsynonymous mutations are deleterious, they are going to be fixed at a lower rate than synonymous mutations (i.e., negative or purifying selection) and ω will be <1. If nonsynonymous mutations are advantageous, they are going to be fixed faster than synonymous mutations (i.e., positive or adaptive selection) and ω will be >1. The parameter ω was estimated using the ML method of Goldman and Yang (1994)<$REFLINK> . In this method, parameters of a model of codon substitution are estimated from the data by ML and are used to calculate dN and dS. To test if dN is significantly different from dS, ω was fixed at 1 in the null model (i.e., neutrality), whereas ω was estimated as a free parameter in the alternative model (Yang 1998<$REFLINK> ). The double of the log-likelihood difference between the two models is compared with a χ2 distribution with one degree of freedom to test whether ω is different from 1. All these calculations were performed using the codem1 program of the PAML package (Yang 2000<$REFLINK> ).
Consensus sequences for the L1PA5 (13 elements), L1PA4 (13), L1PA3B (14), L1PA2 (22), and L1PA1 (14) families were derived from the elements listed in table 1 . We used the Ta-1d subset of L1PA1 to derive the L1PA1 consensus because it is the most recently evolved version of L1PA1 (Boissinot, Chevret, and Furano 2000<$REFLINK> ). The alignment (see Materials and Methods) contains 282 nt substitutions and seven indels (table 2 ). Figure 1B shows that the five families have been succeeding each other and that a single lineage of L1 families best describes the evolution of the active lineage of L1.
Positive Selection on the Coiled Coil Domain of ORFI
The PLATO analysis identified a single region of ORFI that has undergone nucleotide substitutions at a statistically different rate than the complete coding sequence (Z = 9.65, P < 0.05, sliding window size = 5). Increasing the window size or modifying the parameters of the analysis identified the same approximate region and yielded similar Z values, all of which were statistically significant. This region starts 153 nt from the beginning of ORFI and is 217 nt long. It coincides almost perfectly with the coiled coil (C-C on fig. 1A ) encoding region of the protein-protein interaction domain of ORFI. All but one of the 29 substitutions that differentiate L1PA5 from L1PA3B in this region are nonsynonymous (fig. 1C and table 2 ). This fast rate of amino-acid replacement indicates that this region is either evolving under relaxed selection (i.e., neutrally) or under positive selection.
We distinguished between these possibilities by analyzing the nonsynonymous to synonymous rate ratio (ω, see Materials and Methods). The parameter ω was calculated independently for the coiled coil encoding region (from codon 51 to 148) and for the entire coding sequence (ORFI and ORFII), excluding the coiled coil domain (table 3 ). Because some of the pairwise comparisons in the coiled coil domain include very few if any synonymous substitutions, ML estimates of the parameter κ (=transition-transversion rate ratio for synonymous substitutions) were first estimated based on the complete coding sequence. Values of κ for the entire L1 range from 1.8 to 3.1 and several values within this range were incorporated into the ML model to calculate ω. These analyses gave congruent results (only values with κ = 2.5 are shown in table 3 ). Pairwise comparisons among L1PA5, L1PA4, and L1PA3B give values for ω significantly higher than 1, indicating that nonsynonymous mutations have been fixed at a faster rate than synonymous mutations (i.e., faster than if they had been neutral). Thus, by these criteria, the coiled coil domain of ORFI has evolved under positive selection during the evolution of L1PA5 to L1PA3B. Although higher than 1, values of ω are lower when L1PA5 or L1PA4 are compared with L1PA2 or L1PA1 because of purifying selection acting between the evolution of L1PA3B to L1PA1 (see later). Purifying selection increases the relative number of synonymous substitutions between the older (L1PA5 or L1PA4) families and the younger (L1PA2 or L1PA1) families. As the number of nonsynonymous mutations between the older and younger families has hardly changed, the value of ω derived from these comparisons is lower.
In contrast to the fast rate of amino acid replacement from L1PA5 to L1PA3B, the coiled coil domain has remained almost unchanged from L1PA3B to L1PA1. Indeed, the coiled coil domains of L1PA2 and L1PA1 are identical although they differ in other parts of the sequence (fig. 1C and table 3 ). The ratio, ω, is lower than 1 although not significantly so. It is difficult to determine if the conservation of recent L1 families is because of purifying selection, the absence of selection, or recombination that would have homogenized the sequence of different families. In any case, it seems that positive selection has not played a significant role in the evolution of the L1 families (i.e., L1PA2 and L1PA1) derived from the L1PA3B family. The alternation of a high (L1PA5 to L1PA3B) and a low (L1PA3B to L1PA1) amino acid replacement rate indicates that the nature of the selective pressure acting on the coiled coil domain has changed over time.
By comparison, the amino acid sequence outside the coiled coil domain has been always highly conserved (table 3 , above diagonal); in all comparisons ω is significantly lower than 1. This low rate of amino acid replacement indicates that strong purifying selection has been acting on most regions of the L1 proteins. In ORFII, sequence conservation is not limited to the endonuclease (EN) and reverse transcriptase (RT) encoding domains (fig. 1C ). The segments that separate EN, RT, and the 3′ terminal region of ORFII are also very conserved, suggesting that these regions are functionally important because they either encode for some yet to be described function or they play a role in the conformation of the ORFII protein.
Pattern of Amino Acid Replacements in the Coiled Coil Domain
Coiled coil structures are formed by the intertwining of two or more α-helical peptide chains that have a repeating arrangement of nonpolar side chains (reviewed in Lupas 1996<$REFLINK> ). Typically, domains that can form coiled coil structures consist of seven-residue repeats (heptads), with nonpolar or hydrophobic residues in the first (a) and fourth (d) positions of the heptad (fig. 2 ). The coiled coil domain of ORFIp ranges from amino acid 52 to 131 and consists of a first group of four or five heptads (depending on the family) separated from a group of six heptads by three amino acids (fig. 2 ). The COILS program indicates a 90%–100% probability that these heptads will adopt a coiled coil conformation (Lupas, Van Dyke, and Stock 1991<$REFLINK> ).
Because the probability that a sequence will form a coiled coil structure depends on the position of hydrophobic or nonpolar residues, change in the polarity of an amino acid can affect the conformation of the protein. Figure 2 shows the distribution of the amino acid replacements that occurred between L1PA5 and L1PA1. Conservative changes (indicated by a + in fig. 2 ) are substitutions between two polar or two nonpolar amino acids (following the classification in Li 1997<$REFLINK> , pp. 13–17). The replacement of nonpolar amino acids by C or Y (usually considered polar) at position (a) or (d) of the heptad is also considered conservative because these two amino acids are found preferentially at positions (a) or (d) of known coiled coils (Lupas, Van Dyke, and Stock 1991<$REFLINK> ). Out of the 28 amino acid replacements that differentiate L1PA5 from L1PA1, 26 are conservative and therefore not likely to affect the potential coiled coil conformation of the protein (fig. 2 ). Only the two amino acid substitutions, V83T and M96K, are not conservative with regard to polarity. Substitution M96K is at position (g) of heptad VI, which is not as critical to the coiled coil conformation as positions (a) and (d). On the other hand, V83T is a radical change at position (d) of heptad V and this change does disrupt heptad V in L1PA1 (fig. 3). Indeed, T is very rarely found at position (d) of any known coiled coil (Lupas, Van Dyke, and Stock 1991<$REFLINK> ) and substitution V83T could affect the coiled coil conformation of the protein. However, probabilities of forming coiled coil (as calculated by COILS, with a window of 21 or 28) are not significantly affected by any of the 28 amino acid changes, including these two radical changes.
Our results demonstrate that the coiled coil domain of ORFI has been subjected to an episode of intense positive selection (L1PA5 → L1PA4 → L1PA3B) resulting in a high rate of amino acid replacement. Since then (L1PA3B → L1PA2 → L1PA1), the coiled coil domain has been highly conserved. Thus, either the strength or nature of the selective pressure on this region has changed over time. As the major amplification of L1PA2 occurred only in the African apes (subfamily Homininae, i.e., gorillas, chimpanzees, and humans; unpublished data), the change in selective pressure on ORFI occurred after the divergence between Asian (e.g., orangutan) and African apes. Interestingly, the same region of ORFI is hypervariable in murine rodents and galagos (prosimians) as indicated by a high rate of amino acid replacement, duplications, and deletions (Kolosha and Martin 1995<$REFLINK> ; Cabot et al. 1997<$REFLINK> ; Mayorov, Rogozin, and Adkison 1999<$REFLINK> ; Furano 2000<$REFLINK> ). However, whether these changes are the result of positive selection is yet to be determined (Mayorov, Rogozin, and Adkison 1999<$REFLINK> ; Furano 2000<$REFLINK> ).
Because many aspects of L1 biology are still unknown, we can only speculate about the possible causes of positive selection. As the 5′UTR (untranslated region) of L1 evolves at a very high rate, including its wholesale replacement (Adey et al. 1<$REFLINK> 994; Furano 2000<$REFLINK> ), adaptive changes in ORFI could be a response to changes in the 5′UTR. Although the number of base changes between the 5′UTR of L1PA4 and L1PA3B and between L1PA2 and L1PA1 are roughly the same, ORFI has evolved under positive selection between L1PA4 and L1PA3B but remained very conserved between L1PA2 and L1PA1. Thus, positive selection in ORFI is not correlated with the global rate of base substitution in the 5′UTR. Although changes in the amino acid sequence of the ORFI may be a response to particular sequence changes in the 5′UTR, the selective pressure on ORFI may also lie elsewhere.
Most of the genes for which positive selection has been documented are involved in interactions between the organism and its environment (see Yang and Bielawski 2000<$REFLINK> ). By analogy, we propose that positive selection in L1 may reflect an interaction between the L1 element (the organism) and the host (its environment). For instance, the rapid evolution of the coiled coil domain could have been driven by L1 adaptation to a host factor required by L1 for replication. Rapid evolution of the putative host factor might have occurred for a number of reasons, including avoidance of recruitment by L1. Alternatively, rapid evolution of the coiled coil domain could have resulted from the evasion by L1 of a host-encoded repressor of L1 replication. This would be similar to positive selection in pathogenic genes that evade a host's immune system (Zanotto et al. 1999<$REFLINK> ; Haydon et al. 2001<$REFLINK> ).
In both cases, the alternation between periods of positive and purifying selection on ORFI can be correlated with changes in L1 activity. In rodents (Pascale, Valle, and Furano 1990<$REFLINK> ) and primates (unpublished data), L1 activity (amplification) is episodic, and therefore its deleterious effect on the host changes over time. Possibly, very active (deleterious) families would induce a strong response by the host, leading to intense positive selection for both the host and the element. Conversely, families that generate just enough copies to persist in the genome, but not enough to cause serious damage, would probably be ignored by the host, and the action of positive selection would be very limited.
Figure 2 shows that positive selection in ORFI resulted in substitutions among amino acids that share similar physicochemical properties. Therefore, the effects of positive selection on the coiled coil domain have been limited by structural constraints, i.e., the ability to form a coiled coil structure. This suggests that the potential to form a coiled coil structure is an important functional feature of ORFIp. This conclusion is supported by the fact that, although the N-terminal one-third of ORFI shows no sequence homology among murine rodents (old world rats and mice), rabbits, galagos, and humans (Kolosha and Martin 1997<$REFLINK> ), all possess the potential to form coiled coil structures (data not shown; Martin, Li, and Weisz 2000<$REFLINK> ). The ability of ORFIp to form a coiled coil structure is also shared by nonmammalian L1-like elements, like the Xenopus Tx1L (cited in Pont-Kingdon et al. 1997<$REFLINK> ), the teleost Swimmer (Duvernell and Turner 1998<$REFLINK> ), and the bird CR1 elements (Haas et al. 1997<$REFLINK> , unpublished data). Coiled coils often mediate protein-protein interactions with themselves or other proteins. In mouse and human L1, the coiled coil domain mediates ORFIp binding to itself (Hohjoh and Singer 1996<$REFLINK> ; Martin, Li, and Weisz 2000<$REFLINK> ) but the possibility of interactions with other proteins has not been explored. The ORFIps of two divergent mouse L1 families (Tf and L1MdA) readily interact (Martin, Li, and Weisz 2000<$REFLINK> ) suggesting that conservative changes in the coiled coil domain would not significantly affect ORFIp interaction with itself. Thus interactions between ORFIp and other proteins could well be responsible for positive selection on ORFI.
Thomas Eickbush, Reviewing Editor
Keywords: L1/LINE-1 human retrotransposon positive selection
Address for correspondence and reprints: Anthony V. Furano, NIH, Building 8, Room 203, 8 Center DR MSC 0830, Bethesda, Maryland 20892-0830. firstname.lastname@example.org .