We report the 897 kb sequence of a cluster of olfactory receptor (OR) genes located at the distal end of the major histocompatibility complex (MHC) class I region on mouse chromosome 17 of strain 129/SvJ (H2bc). With additional information from the mouse genome draft sequence, we identified 59 OR loci (∼20% pseudogenes) in contrast to only 25 OR loci (∼50% pseudogenes) in the corresponding centromeric OR cluster that is part of the ‘extended MHC class I region’ on human chromosome 6. Comparative analysis leads to three major observations: (i) most of the OR subfamilies have evolved independently in the two species, expanding more in the mouse, and resulting in co-orthologs—subfamilies of highly similar paralogs that keep orthologous relationships with their human counterparts; (ii) three of the mouse OR subfamilies have no orthologs in humans; and (iii) MHC class I loci are interspersed in the OR cluster in mouse but not in human, and were subjected to co-duplication with OR genes. Screening of our sequence against the available sequences of other strains/haplotypes revealed that most of the OR loci are polymorphic and that the number of OR loci may vary among strains/haplotypes. Our findings that MHC-linked OR loci share duplication with MHC class I loci, have duplicated extensively and are polymorphic revives questions about potential reciprocal influences acting on the dynamics and evolution of the H2 region and the H2-linked OR loci.
The major histocompatibility complex (MHC) is known for the crucial roles it plays in immunity, and it is one of the most studied regions of the genome, in human and in many other vertebrate species (1). The MHC is also known for the extensive polymorphism of the genes encoding the histocompatibility antigens, the MHC class I and class II molecules. These molecules play a central role in immune recognition by presenting peptides to T lymphocytes. Hughes and Nei (2,3) showed that, while a conservative (negative) pressure is exerted over the backbone of the molecule, the codons that encode the peptide-binding region (PBR) are subject to positive selection increasing polymorphism. Many hypotheses have since been put forward to translate this observation into population genetics and natural selection. Pathogen-driven selection, i.e. the generation of positive selection pressure through resistance to infection, could result in the preferential selection of heterozygotes over homozygotes (heterozygote-advantage or overdominance) or in the preferential selection of relatively rare haplotypes (negative frequency-dependent selection) or, more likely, in some combination of the two (4–7). Furthermore, disassortative mating preference has been suggested to maintain MHC diversity (8). This theory is based on observations that mice preferentially mate with a partner with dissimilar H2-haplotype (9,10). The detection of this dissimilarity appears to be mediated by olfaction through the ability of mice to distinguish urine from conspecifics of different MHC type (11,12). Attempts to test the mating preference hypothesis in humans (13–15) produced controversial results (16). Another form of sexual selection observed is that of selective abortion (17,18). For instance, abortion is triggered in female mice if they are exposed to an odortype dissimilar of that of the sire (19).
The traditional definition of the MHC and its partition into three regions, class II, class III and class I from centromere to telomere (Fig. 1), was extended to the flanking regions that contain genes in tight linkage with the MHC, some of them but not all involved in immunity. In the human genome, the classical MHC region ends at the HLA-F gene. The 800 kb region distal to HLA-F, which extends up to the end of the conserved synteny between human Chr 6p21.3 and mouse Chr 17, belongs to the ‘extended MHC class I region’. The evolutionary history of the MHC is not uniform: the class II and class III regions appear quite stable, with the strict exception of the MHC class II and C4 genes, which were subjected to species-specific local duplications, while the class I region is especially plastic among and within species (reviewed in 20). This reflects the diversification of the MHC class I genes by rounds of duplication and deletion, leading to the loss of orthologous relationships between mammalian orders (e.g. human and mouse) (21–24) and to differences in the number of MHC class I genes even between populations and strains (25–28). Non-MHC class I genes, orthologous between human and mouse, were identified among the MHC class I sequences (29), and the MHC class I region as a whole was eventually revealed as stable too, the plasticity being strictly limited to the MHC class I genes. A model suggesting that MHC class I expansions occur within a framework of conserved non-MHC class I genes was therefore proposed (30).
Since their discovery by Buck and Axel in 1991 (31), olfactory receptor (OR) genes have been the subject of many studies. In addition to offering ‘a molecular basis for odor recognition’, OR loci became a model gene family for studying genome dynamics and evolution. They are the largest superfamily of genes in mammals, mainly organized in clusters disseminated over all the chromosomes (32–34). This distribution probably reflects macro-evolution of the genome, such as segmental duplications, and also suggests active duplications. OR genes were an obvious target for whole genome analyses after the release of the draft sequences of the human and mouse genomes. Independent studies offer different approaches and interesting views of the ‘olfactory subgenome’ (32–37).
The finding of OR genes right next to the MHC in both humans and mice (38–40) revived the interest in the ‘mating preference’ hypothesis, with suggestions that the MHC-linked OR genes could be specifically involved in the detection of MHC diversity, thereby contributing to mate choice. The conservation of MHC-linked OR genes in mouse also raised a number of other questions, including:
Are the OR genes members of the ‘framework’, i.e. stable and conserved, or rather like the MHC class I genes, plastic and diversified?
Do the OR genes exhibit polymorphism and would polymorphism, if any, be concentrated in codons defining the ligand binding region like it is for MHC class I genes?
Do OR genes vary in copy number between strains as do MHC class Ib genes?
Do OR genes play a role in the MHC-associated mating preference?
Does the proximity of MHC class I and OR loci induce or reflect reciprocal evolutionary influences?
To tackle these questions, the MHC-linked OR clusters of human and mouse were extensively analyzed. Of the human ‘extended class I region’, the 800 kb region telomeric of HLA-F has been sequenced, and found to contain a cluster of 25 OR genes, with limited individual polymorphism (41,42). We have now sequenced 897 kb of the telomeric part of the mouse MHC of strain 129/SvJ and report here the analysis of the MHC-linked OR cluster, the comparison to the human counterpart and polymorphism of the OR genes between mouse strains. Our data yield insights into the evolution of both OR genes and MHC class I genes.
Gene content and genomic environment
From the contig described in the methods section, eight overlapping clones from strain 129/SvJ were selected for sequencing (from centromere to telomere: CT7-573K1, RP21-374N22, CT7-87K14, CT7-332P19, RP21-383N7, RP21-538M10, RP21-639N14 and CT7-350K7), which provided a final sequence of 897 212 bp. A total of 17 non-OR genes and pseudogenes and 46 OR-related sequences were identified. The gene content and features of the sequence are presented in Figure 2. Details about the genes are reported as supplementary data.
The OR-like sequences identified within this sequence consist of 38 potentially functional, full-length ORFs and eight pseudogenes. Two additional OR genes (Olfr136 and Olfr138) were sequenced from strain 129/SvJ after subcloning of a series of clones that were positive with the D17Leh89 probe (a fragment of an OR sequence) (40) but that could not be linked to the contig. The genes were localized by restriction mapping based on YACs 91E7 and 117H3 (40,43). We resorted to the ENSEMBL mouse genome draft sequence (April 2002 version) from strain C57BL/6J to locate our markers and complete the cluster. This analysis identified nine additional OR genes and two fragments, resulting in a cluster of 59 OR sequences that span 1.3 Mb. The size and content cannot be considered final because the mouse draft genome sequence includes unfinished sequence and possible misassemblies (34). The distal end of the cluster cannot currently be defined because both ENSEMBL and CELERA assemblies (April 2002 versions) lack sequence in this region.
The OR sequences have been named here according to the recommendations of the Mouse Genome Nomenclature Committee, until a unified nomenclature for OR genes is agreed upon. We shall also refer to the subfamily nomenclature used by Zhang and Firestein (37) in their phylogenetic analysis on the whole-genome scale. Some peculiarities should be noted:
Olfr94 does not have an ATG start codon in the vicinity of the first conserved motif but instead a CTG. An ATG is present 228 nucleotides upstream and, if translated, would result in a large extracellular N-terminal domain of 80 amino acids. Such a long extracellular domain might be related to those of the V2R vomeronasal receptors and the metabotropic glutamate receptor, in which they are involved in ligand binding (44).
The three genes Olfr104, Olfr105 and Olfr106 lack a starting methionine within the 10 amino acids preceding the first conserved residue. They could, however, use a splicing mechanism to add a methionine. Examples of splicing of OR genes have been reported in rat, human and mouse (45–48). Although these spliced exons generally are in the 5′ untranslated region, Linardopoulou and collaborators (46) suggest that 5′ exons might be involved in producing coding sequence. Alternatively, these genes may use a start codon downstream of the conserved ‘FILLG’ (or equivalent) motif in the first extracellular domain (EC1) (Fig. 6). In the case of the human OR gene hs6M1–16, the methionine at amino acid position 79, between the second and third transmembrane (TM) regions, is likely to be used as an alternative start codon in some transcripts (42,48), and Olfr104–106 may use a similar mechanism.
Within these 897 kb of finished sequence, ≈20% of the OR loci being pseudogenes is consistent with the ratio estimated from the whole OR repertoire (36,37). However, the status of pseudogenes is not always clear and may vary between strains, as it has been shown to vary between haplotypes in humans (41,49).
The genomic environment of the MHC-linked OR genes is remarkably similar to that of the human counterpart. The mouse OR cluster is characterized by a low GC content (average between 0.35 and 0.40). This low GC content is characteristic of an L isochore (50), rich in long interspersed nuclear elements (LINE), which comprise 74% of the total repeat content here. Long terminal repeat (LTR) elements, also known as retroviral-like elements, are also prevalent within this region: they contribute 15% of the repeat content. These repeat elements are thought to mediate duplication and relocation (51,52), albeit only few supporting data have yet been published (e.g. 53,54). By contrast, the 150 kb between Gabbr1 and Ubiquitin D is richer in short interspersed nuclear elements (SINE). SINEs account for 23% of the repeat content, while LINEs account for 38% and LTRs for 25%. The GC content rises to 0.45 as in the proximal neighboring region (55) and similar to that of the 200 kb equivalent region in man, distal to HLA-F (42).
Amplification of OR genes through tandem duplications
It has been proposed that the OR repertoire has been amplified through a number of large genomic duplications (32,56–59). OR clusters also expanded by local duplications, generating contiguous, highly similar genes (42,60,61). Both mechanisms are corroborated by observations of the entire mouse OR genomic repertoire (36,37). In the case of the MHC-linked cluster, local duplications are well illustrated by the color code identifying the subfamilies in both the maps and the phylogenetic tree (Figs 2–4). Local tandem duplications are easily identified by a self-versus-self dot-plot of the masked sequence, while a dot-plot with the unmasked sequence reveals the repeats included in the duplications (not shown):
Olfr92 and Olfr92-L, which share 96.4% identity, are included in the duplication of an 11 kb segment, which contains a Smt3ip2-L sequence and is flanked by MIRs and a B1 SINE.
Olfr97 and Olfr98 (95.1% identity) were generated by a direct duplication of a 6–8 kb segment flanked by B1_MM and BGLII repeats.
Within family 250, genes Olfr104, Olfr105 and Olfr106, which all lack a starting methionine, were generated by direct duplications of a 10 kb segment; the Olfr104 and Olfr105 duplication units are flanked by an RSINE and an L1_MM repeat, while the Olfr104 and Olfr106 duplication units share the same SINE at one end but no obvious repeat at the other. Similarly, the Olfr101 and Olfr102 duplication units comprise a B4A and PB1D10 repeat, while the Olfr100 and Olfr101 duplication units involve a RMER1A repeat at one end but no shared repeat at the other.
Olfr110 and Olfr111 (98.1% identity) were created by the duplication of a 7 kb segment that contains an imperfect (GA)n repeat, but no repeat shared by the borders of the two regions was identified. The segment containing Olfr110 is extended by 2 kb of SINE, LTR and LINEs inserted closely 5′ of the OR.
Olfr112–116 were generated by the duplication of an 8–9 kb segment flanked by L1_MM and Lx5 repeats. The core duplicated segment shared by all genes has a length of 3 kb, but a larger segment, enlarged by the insertion of LINE repeats, can be recognized.
Olfr119, Olfr121 and Olfr122 were generated by duplication of a 15 kb segment flanked by Lx repeats at both ends; from Olfr121, a smaller segment of 11–13 kb flanked by a B2_Mm2 SINE and a tract of (CT)n repeats has duplicated twice to produce Olfr120 and Olfr118.
Olfr126 and Olfr127 were generated by the duplication of a 5 kb segment, which includes a B1_MM.
These few examples illustrate how difficult it can be to track the history of the duplications by the repeats, as long as the mechanisms driving local duplication remain obscure. Some duplicated segments end with similar repeats, suggesting that repeats may be involved in the process, possibly through unequal recombination. A role for the repeats, if any, cannot, however, be dominant because as many duplicated segments are not flanked by shared repeats. Multiple mechanisms are thus likely to have driven the local expansion of the OR genes.
Complex duplications including the MHC class I gene H2-M3
Repeats inserted within the duplicated segments often help to date the copies, but can as well make the deduction of an evolutionary schema more difficult. Complex events have involved the OR families 121 and 156 and the MHC class I sequence H2-M3: a segment containing one member of each family has duplicated several times. Figure 5 shows a pairwise analysis of the sequence similarity through exons and repeats of the duplicated units. It suggests that a primary block, block 3 (H2-M3–Olfr96-L2–Olfr99) was duplicated to produce block 4 (H2-M3-L2–Olfr96-L3–Olfr107). Insertions and deletions of repeats then turned H2-M3-L2 into a pseudogene while the first part of Olfr96-L3 was lost. Part of block 4 was duplicated to create block 5 (H2-M3-L3–Olfr96-L5–Olfr108). The latter genes were then separated by the insertion of repeats and Olfr96-L4.
The events that produced the proximal segments, blocks 1 and 2, are less easy to unravel; the puzzle is illustrated in Figure 5, where block 1, composed of the single pair (Olfr96–Olfr97) is compared to both block 2 (H2M3L1–Olfr96-L1–Olfr98) and block 3 (H2M3–Olfr96-L2–Olfr99). Olfr96 and Olfr96-L1 share a 14–18 kb duplicated segment that contains SINES, B1_MM and B2_Mm2 repeats and also a MYSERV retroviral element (marked by a green triangle) that is not present in Olfr96-L2. On the other hand, both Olfr96-L1 and Olfr96-L2 are disrupted by a RMER4 LTR element (red triangle) at exactly the same position and a duplicated segment can be delineated by a SINE at one end and an L1_MM at the other end. One possible schema is that block 1 was created by a partial duplication of block 3 that did not include H2-M3, followed by the insertion of RMER4 into the ancestral Olfr96, turning it into Olfr96-L2, and part of the block duplicated to create block 2, leaving out half of H2-M3. New L1 insertions would then have taken place between Olfr96-L1 and Olfr98. This schema is consistent with the observed traces of the H2-M3 members, but would imply that the MYSERV shared by block 1 and block 2 but not block 3 came from independent insertions. The sequence similarity, however, favors a direct filiation between Olfr96, Olfr96-L1 and Olfr96-L2. It is very unlikely that the RMER4 element can have been deleted from Olfr96-L1 without leaving scars in Olfr96. The flow would thus have been from Olfr96 to Olfr96-L1, followed by the RMER4 disruption and a duplication into Olfr96-L2, which then would have lost the MYSERV element. This schema would thus imply an H2-M3 ancestor in either block 1 or block 2, which would have been deleted, the copy in block 3 remaining functional. An alternative could be proposed, hypothesizing a gene conversion event between block 2 and block 1: the MYSERV sequences are very similar and so are the intervening sequences, between the ends of the Olfr96 ORF and the starts of MYSERV. The sequence alignment (data not shown) reveals, however, that within the MYSERV, the similarity drops abruptly after 200 bp; there is a 250 bp sequence of block 2 and 200 bp sequence of block 3 that do not harbor similarity before the sequences can be aligned again. A gene conversion involving the MYSERV element may explain the sequence similarity between the intact Olfr96 and the RMER4-disrupted Olfr96-L1. Unfortunately, the phylogenetic analysis of the Olfr genes did not provide help: the Olfr96-L pseudogenes are too degenerated, and family 156, lacking a human homolog for correct rooting, offers poor bootstrap values. No certain schema can thus be deciphered. The on-going analysis of the rat MHC, where three RT1-M3 loci (homologous to H2-M3) have been identified (62,63), should help to unravel the relationships between H2-M3 and OR genes.
Phylogeny and comparison to human
The phylogenetic relationships between the OR genes of this cluster illustrate expansion by duplication and correlate well with the whole genome analysis published by Zhang and Firestein (37). We therefore used the nomenclature that they established for the OR subfamilies. Comparison to the homologous human cluster on 6p21.3 (Figs 3 and 4) highlights that the two clusters are orthologous, and that the mouse cluster contains many more OR genes than the human counterpart. For most of the subfamilies, a single gene in human corresponds to several in mouse. Analysis of the sequential duplications and of the repeat content supports a recent expansion in the mouse rather than deletion in the human genome. The exception is the single Olfr137 (family 256 group 5), which is orthologous to a group of four genes in human, the pairs hs6M1-4 and hs6M1-3 and hs6M1-5 and hs6M1-6, which have been shown to result from a tandem duplication (42). Four out of the nine genes that are single copy in the human cluster are pseudogenes.
All the human OR genes have orthologs in the mouse MHC-linked cluster, with the exception of hs6M1-23(p) and hs6M1-24(p). These two pseudogenes group with hs6M1-15 and hs6M1-8(p), which, based on BLAST similarity and the conserved synteny, appear to have orthologs on mouse Chr 13. This suggests that hs6M1-23(p) and hs6M1-24(p) were duplicated from hs6M1-15 and hs6M1-8(p) (located distal to the HSA6p21.3-MMU17 conserved synteny breakpoint) and relocated proximally after the rodent divergence. Similarly, hs6M1-16 was predicted to have duplicated from hs6M1-12/13 after the rodent divergence (42).
Three subfamilies of the mouse do not have human orthologs, neither within the 6p21.3 cluster nor in the whole genome draft:
Subfamily 263 group 1 that surrounds the Scoc-L putative gene—this family is likely to be the ortholog of the rat SCR-D family of OR genes (for Spermatid ChemoReceptor), which are expressed in spermatids and undergo 5′-splicing (45). The 5′-UTR sequence of SCRD-8 mRNA (AF034896) matches 170 bp upstream of Olfr118, while the 5′-UTR of SCRD-9 (AF034899) matches sequences upstream of genes Olfr119, Olfr121 and Olfr122.
Subfamily 256 group 8 that is constituted by the single Olfr131.
Subfamily 156, the members of which are all included in the duplication units that involve H2-M3.
A remarkable feature of the comparative scheme in Figure 3 is that, despite the extensive duplications in the mouse, the order of the genes and their transcriptional orientation are conserved between species. The only exception is the relative inversion of Olfr124 and Olfr125 with regard to hs6M1-28 and hs6M1-22. A single inversion in the mouse, marked by a double arrow, would re-establish the order and transcriptional orientation, the breakpoints being located between Olfr128 and Olfr129, and between Olfr116 and Olfr118. Such an inversion would reunite the members of family 263 group 1 and the members of family 218, all with the same transcriptional orientation within families. However, the transcriptional orientations of the orthologous mouse Olfr123 and human hs6M1-25 would then be reversed, suggesting subsequent rearrangements involving the families 263 group 1 and 256 group 3.
Analysis of polymorphism among strains
Ehlers et al. (41) showed that the human MHC-linked OR genes exhibit limited polymorphism. We therefore wondered if the mouse OR genes would differ similarly between mouse strains. A comparison between strains might show two kinds of change: a difference in the composition of the cluster, by the number and/or nature of the genes, and polymorphism at the level of individual OR genes.
We used the ENSEMBL mouse GoldenPath assembly (April 2002 version), which is from C57BL/6J (H2b) and the CELERA mouse fragments, which are annotated with their strain of origin: 129X1/SvJ, 129S1/SvlmJ (both H2bc), DBA/2J (H2d) and A/J (H2a), and searched for polymorphism for each of the 49 OR genes exhibiting full-length ORFs, and for Olfr92-L, which is categorized as a pseudogene. The detailed results are presented as electronic supplementary data. The analysis of the polymorphism provides two main pieces of information:
The great majority of the OR genes exhibit polymorphism, either at the protein level or as SNPs. For 11 genes, three alleles each were identified. Of the 41 OR genes for which we have determined the sequence in strain 129, only one, Olfr116, exhibits no polymorphism between the four strains. The data were judged insufficient to make a conclusion for four genes: Olfr96, Olfr98, Olfr100 and Olfr103. In contrast, the genes located at the distal end of the contig, which were extracted from the C57BL/6J mouse genome assembly, did not exhibit polymorphism, whereas their human counterparts do (41). The fragments in the CELERA database are raw sequence, and we only considered polymorphism when at least two sequences exhibited the same difference. These fragments are also often short, and the four strains are not equally represented: the total database is 5× coverage and the contribution of DBA/2J is 10×106, of A/J 11×106, of 129X1 8×106 and of 129S1 0.13×106. The consequence for our study is that individual genes are not always entirely represented in all four strains. We have therefore ignored many sequence differences that were not well supported, and the extent of polymorphism and the number of alleles has certainly been underestimated.
Three genes have alleles with mutations affecting highly conserved residues and may therefore be functionally inactive (Fig. 6): the R118 of the conserved motif MAFDRYVAIC within cytoplasmic loop 2 (CP2) is changed to H in gene Olfr114 in DBA/2J, and to C in gene Olfr118 in 129/Sv; for the gene Olfr122 in C57BL/6J, T237 of the conserved motif KAFSTC within cytoplasmic loop 3 (CP3) is changed to I. The R118 residue is supposed to play a crucial role in signaling-related conformational changes. Interestingly, drastic changes of this highly conserved arginine (R) were also reported for the human genes hs6M1-17 and hs6M1-20 (41), and for OR3A1 on Chr 17p13.3 (49).
The distribution of the mutations along the OR molecules has been investigated. Figure 6 presents a summary of the polymorphic positions plotted on the consensus of the 50 genes of the cluster. The color-coded consensus sequence illustrates the diversity among the OR genes (see legend). Non-synonymous changes observed among alleles are reported above the consensus while synonymous changes are shown below. Table 1 reports the number and frequency of the non-synonymous (N) and synonymous (S) mutations grouped according to the degree of conservation. On residues >90% conserved among genes (red), negative selection pressure is expected; on the other hand, residues highly variable between ORs (<50% conserved, black) could be subject to less constraints and a larger proportion of non-synonymous mutations would be expected. The frequency of non-synonymous mutations does not, however, differ substantially from that of synonymous changes at these positions.
The ratio dS/dN (rate of synonymous changes per synonymous sites/rate of non-synonymous changes per non-synonymous sites) was calculated between alleles. If a gene is subject to negative selection pressure, the rate of synonymous mutations, which do not change the residues in the protein, is expected to exceed the rate of non-synonymous mutations, which may alter the structure/function of the protein. In the case of the classical class I and class II MHC genes, substitutions affecting the codons encoding the peptide binding region are mainly non-synonymous, testifying to selection pressures that favor diversity. The ratio dS/dN could only be calculated for 19 out of the 37 genes that harbor SNPs: it ranges from 0.83 (Olfr125) to 16.26 (Olfr99) with an average of 3.75 (±4.24); for the remaining genes, the number of SNPs was either inadequate or the full-length sequence was not available. No trend could be extracted from this data, probably because of the few substitutions observed; the genes, however, seem rather under neutral selection, or mild purifying selection. We also examined the dS/dN ratio with regard to the conservation of the residue and across the domains of the proteins (data not shown), but again, no trend was observed.
Another question that we attempted to address was whether the OR genes would exhibit differences in the number of copies between strains as do, for example, the nearby MHC class I genes of the H2-Q region (64). In the case of highly homologous genes, like Olfr101 and Olfr102, or Olfr107 and Olfr108, BLAST screening of the CELERA database, even with very stringent E-values, produced groups of matching fragments, mixing highly similar genes and alleles, and the fragments had to be sorted out by eye. Some fragments could not be assigned and more likely determine new genes rather than odd alleles. Two fragments identical in sequence, one from A/J and the other from DBA/2J, present a sequence intermediate between Olfr127 and Olfr128. Four fragments, three from DBA/2J and one from A/J, determine a sequence intermediate between Olfr118 and Olfr120. If confirmed with finished genomic sequence, these observations would argue that, within a species, OR genes, like MHC class I genes, diversify not only by mutations, but also by duplication and/or gene conversion. One case of gene conversion can be suspected: it concerns gene Olfr112 and the two conserved motifs MAFDRYVAIC (CP2) and KAFSTC (CP3). Genes of family 218, which have been separated in two groups by a 300 kb inversion, share motifs different from the consensus. The sequence of Olfr112 is close to the distal group regarding the first motif, but identical to the proximal group regarding the second motif (Fig. 7).
Expansion of the mouse repertoire of OR genes
As the MHC region is prone to species-specific, local duplications, particularly involving the MHC class I genes (55,65,66), we wondered whether the observed expansion of the MHC-linked mouse OR genes (59 versus 25 in human) was higher compared with other OR clusters in the mouse genome. A review of other studies on human–mouse orthologous OR clusters does not provide a clear answer to this question. The OR clusters located on human Chr 17 and mouse Chr 11B3–B5 and analyzed by Lapidot et al. (61) contain seven orthologous groups: one group shows an increase of the number of OR genes in the mouse (two versus five), one group shows a significant increase in man (four versus one) while the five other groups show a one-to-one or one-to-two relationship. Three other analyses (47,60,67) suggest as well that the one-to-one relationships are prevalent. On the other hand, Young et al. (36), analyzing an OR cluster on mouse Chr 2 and its ortholog on human Chr 9, reported that two human OR genes correspond to six and eight co-orthologs, respectively, in the mouse; we compared another OR cluster on mouse Chr 2 with its ortholog on human Chr 11, and found that one group of OR genes increased its members from two in the mouse to five in man. Even if the extent of duplication in the MHC-linked OR cluster were greater than in other genomic locations, it is not exceptional.
Three subfamilies in the mouse MHC-linked cluster do not have orthologs in human. A similar example was reported by Lane et al. (60). It suggests that the estimated 50% increase of the functional OR repertoire in mouse compared with humans (36) results from both more genes within the orthologous groups and more OR groups in the mouse. The question remains whether these groups arose in the rodent lineage or were lost in the primate lineage.
In the human cluster, four out of the nine subfamilies represented by a single locus are pseudogenes. The MHC-linked OR genes seem therefore to be subject to less local tandem duplications and less selective pressure in humans than in mice. This is again consistent with the analyses of the whole OR repertoire and the general assertion that humans have lost smelling acuity compared to other mammals (36,37,68).
‘Rodent specific’ H2-M and OR genes
In comparisons of the human and mouse MHCs, the MHC class Ib genes have always occupied a peculiar position: by sequence similarity, they group within species and no clear relationship can be established between individual human and mouse class I genes (20). Functional similarities suggest that the Qa1 molecule, encoded by the H2-T23 gene, could be the homolog of HLA-E, which is encoded in a similar position within the human MHC (30,69). Several studies also imply functional similarities between HLA-G and Qa2 (70), but no clear evidence has yet been produced, and the genes have distinct locations and evolutionary histories. No human counterpart has ever been suggested for H2-M3, which presents N-formylated peptides in an early response to bacterial infection (71,72). The question of a human counterpart to H2-M2 (73) is even more difficult as its function has not yet been elucidated.
Up to now, the H2-M genes appear rodent-specific and the question therefore is whether they are murid creations, recently derived from MHC class Ia loci, as suggested by Hughes and Nei (74), or whether their ancestors were lost in non-murid lineages, as was recently proposed by Doyle et al. (75).
The tripartite duplication unit (H2-M3–Olfr family 121–Olfr family 156) was subjected to complex events, more complex than the usual local tandem duplications of OR genes. These events are more similar to the duplications observed for the mouse H2-Q (64) and H2-M (55) or the human HLA MHC class I genes (65,66). For the (H2-M3–Olfr family 121–Olfr family 156) unit, one can only wonder if the MHC class Ib gene was the driving force.
Our analysis has shown that H2-M3 was involved in complex duplication events together with an ‘ancestral’ OR locus: family 121, orthologous between human and mouse, and a ‘rodent-specific’ OR locus: family 156. According to the analysis of the whole mouse OR repertoire made by Zhang and Firestein (37), all the members of these families are located in the MHC-linked cluster: family 121 is estimated to include four members and family 156 five members. OR genes of families 121 and 156 could thus be the ideal markers that will help tracking H2-M3 relatives in murid and non-murid species.
No polymorphism was observed for 10 OR genes out of the 50 examined, and data are missing for two genes, so at least 76% of the MHC-linked OR genes harbor polymorphism. The total number of SNPs observed in this study is 138 (3.8/gene), of which 67 are synonymous (49%) and 71 non-synonymous (51%). The level of polymorphism that we observed is difficult to put into perspective, due to the lack of comparable studies. Two analyses have been performed on human clusters. Ehlers et al. (41) have reported polymorphism in all 13 OR genes of the human MHC-linked cluster that they analyzed. A total of 52 SNPs was observed (range one to 10 per gene, average four per gene), 16 synonymous (31%) and 36 non-synonymous (69%). Sharon et al. (76) studied the polymorphism of the 17 OR genes of the cluster located on human Chr 17p13.3. Fourteen genes were polymorphic (82%) and 21 SNPs were reported (1.5/gene), seven synonymous (33%) and 14 non-synonymous (67%). It appears that the degree of polymorphism of MHC-linked OR genes is not particularly elevated.
Two observations suggest that the MHC-linked OR genes do not have a high rate of mutation: (i) Olfr92-L, which harbors the exact same STOP codon within TM3 in all strains, did not accumulate mutations at a higher rate than the potentially expressed genes; and (ii) genes Olfr101 and Olfr102 both use the GTC codon for V130 in strain 129, while in B6 both genes use GTG. Either this synonymous mutation and the strain divergence arose before the gene duplication, or these genes have been subject to gene conversion. If the rate of mutation is not enhanced, the polymorphism must be fixed or selected, as is the case for the MHC genes. It is difficult, however, to speculate on the selection pressures responsible for this preservation without conclusive functional data. Olfr90 and Olfr91 match ESTs from skin, and Olfr112 and Olfr126 match ESTs from testis. If, as is widely thought (77), some ORs also have functions other than odorant recognition in the olfactory epithelium, the blind analysis of the polymorphism, including the dS/dN ratio, may remain inconclusive. More work on the expression of these genes may shed light on the function-associated conservation of these genes: for example, Branscomb et al. (78) compared the evolution of OR genes orthologous between rat and mouse and found that OR genes transcribed in the testis are more conserved than the OR genes transcribed in the olfactory epithelium.
A general view about duplication is that the duplicated copies are free to accumulate mutations and diversify. We used the opportunity of the multicopy OR families to check the dS/dN ratio among members within the subfamilies. The mean values are reported in Table 2. The result clearly shows that the synonymous substitutions greatly exceed the non-synonymous, even within family 256, the members of which are very diverse.
OR genes and MHC haplotypes
The polymorphism that we describe is likely to represent an underestimate of both the divergence and the number of alleles. It is clear, however, that B6 is as different from 129 as is DBA/2. On the other hand, B6 and A/J are often similar and different from 129. A pairwise comparison of the fraction of OR genes showing at least one mutation is presented in Table 3. From these data, we conclude that the polymorphism of the mouse MHC-linked OR genes is not dependent on the classical MHC: C57BL/6J and 129 share the b haplotype over about 1.5 Mb of the classical MHC from H2-K to H2-D, whereas DBA/2J is of the d haplotype and differs in both the classical and non-classical MHC class I and class II genes. If there is an association between the haplotypes of the classical MHC and of OR genes, one might expect DBA/2J but not C57BL/6J to differ from 129. Furthermore, DBA2/J shares the d haplotype over part of the classical MHC, at least from Eb through to H2-D with strain A/J.
In humans, Ehlers et al. (41) suggested that extended human MHC haplotypes include the polymorphism of the OR genes, in line with the study of Malfroy et al. (79). This view has been challenged on the basis of studies of one OR gene in the Hutterites (80), but the issue is controversial (81). In the mouse, however, the situation appears different, based on the results that we present here. The lack of association between the MHC and the MHC-linked OR cluster may be related to the difference in the structure of the two MHCs (Fig. 1): polymorphic classical class I genes, HLA-B, HLA-C and HLA-A, are distributed over the MHC class I region in human while, in the mouse, the most polymorphic MHC class I genes, H2-K and H2-D, are located in the proximal part of the MHC, the most distal 1-Mb part being occupied by non-polymorphic H2-M class Ib genes.
A second major difference is that no extended linkage disequilibrium has been observed in the mouse. A major chromosomal event occurred within the OR cluster, which split the ancestral synteny unit into Chr 17 and Chr 13 (29). The mouse homologs of the human genes hs6M1-1, hs6M1-15 as well as RFP belong to the block of conserved synteny on mouse Chr 13. Evolutionary breakpoints in the vicinity of OR genes have also been observed in other areas of the mouse genome: in a large study of human Chr 19 and the homologous regions in the mouse, Dehal et al. (82) noted that OR clusters were associated with evolutionary rearrangements. OR clusters have also been implicated in translocations with patho-physiological consequences (83). The increased polymorphism observed over the MHC overall has been often viewed as a probable consequence of the selection pressures applying to the polymorphic MHC genes, through hitch-hiking (84). It is therefore striking that no polymorphism was observed for the 10 OR genes located distal to the 300 kb inversion (Olfr117–Olfr128), whereas the human corresponding genes do harbor SNPs (41). As much as the break of synteny marks the end of the physical linkage, the 300 kb inversion may mark the end of the tight genetic linkage with the MHC.
The lack of association between MHC haplotypes and OR alleles in the four strains of mice studied here, therefore, does not support the putative involvement of the MHC-linked OR genes in olfaction-associated mate choice mechanisms, considering that the primary observations were based on the classical MHC haplotypes (9,10,85). However, it would be possible to test directly such involvement by comparing the mating choice of C57BL/6J with that of various congenic ‘MHC-KO’ (knock-out) mice, which carry the MHC (albeit with one or another gene deletion) and OR genes of 129/Sv on a C57BL/6J background. As 129 and B6 have a similar classical MHC, the MHC-KO mice differ thus from B6 by the MHC-linked OR genes and the non-classical MHC loci. Comparing the mating choices of B6, 129 and MHC-KO mice will differentiate classical MHC from non-classical MHC and OR genes, and MHC from non-MHC genetic background.
MATERIALS AND METHODS
The BAC clones (CT7-) are from the CITB-CJ7 library from strain 129/SvJ (H2bc) (Research Genetics, Huntsville, AL, USA) and were already described (43,86). The PAC clones (RP21-) are from the RPCI-21 library from strain 129/SvJ. The library was provided as nylon filters by the Roswell Park Cancer Institute. The BACPAC Resources Center is now accessible at www.chori.org/bacpac/home.htm. The filters were screened by hybridization: 2 h prehybridization at 65°C in 5× Denhardt's, 6× SSC, 1% SDS, 100 µg/ml blocking DNA, hybridization overnight, then two washes in 2× SSC, 0.1% SDS. The library was screened with Olfr94 (subcloned from CT7-573K1), Olfr121 (subcloned from CT7-350K7), the H2-M2 cDNA, D17Leh525, and D17Leh89 (Olfr136/138). Additional series of clones were obtained by hybridizing with the whole YACs, 91E7 and 117H3 (described in 43) as probes after blocking the repeats by a 2 h incubation with cot-1 mouse DNA. A backbone of overlapping clones was constructed by restriction mapping using the probes described above and various fragments obtained by subcloning, and by fingerprinting after hybridization with cot-1 mouse DNA. The final contig was established by restriction digest fluorescent fingerprinting (87): the method is based on labeling HindIII sites created in a double digest of the clones and using the restriction patterns from the various clones to assess the degree of overlap between clones.
The clones from the selected tiling path were each subcloned randomly and sequenced (88). For the shotgun phase, pUC plasmids with inserts of 1.4–2 kb were sequenced from both ends using the dideoxy chain termination method (89) with big dye terminator chemistry (90) and analyzed on ABI3700 capillary sequencing machines. The resulting data were processed by a suite of in-house programs (www.sanger.ac.uk/Software/sequencing/) prior to assembly with the PHRED (91,92) and PHRAP (www.phrap.org/) algorithms. For the finishing phase, we used the GAP4 program (93) to help assess, edit and select reactions to eliminate ambiguities and close sequence gaps. Sequence gaps were closed by a combination of primer walking, PCR, short/long insert sublibraries (94), oligo screens of such sublibraries and, in rare cases, transposon sublibraries. Unless annotated otherwise, each clone has been finished according to the agreed international finishing standard (http://genome.wustl.edu/gsc/Overview/finrules/hgfinrules.html). The 897 212 bp sequence was built up as follows: CT7-573K1 from 1 to 154 614; RP21-374N22 from 152 615 to 180 744; CT7-87K14 from 178 743 to 281 658; CT7-332P19 from 281 559 to 452 307; RP21-383N7 from 450 307 to 491 106; RP21-538M10 from 489 107 to 679 843; RP21-639N14 from 677 844 to 736 277; CT7-350K7 from 734 278 to 897 212. The sequences reported here have been submitted to the EMBL/GenBank/DDBJ database under the following accession numbers: AL078630 (CT7-573K1), AL590433 (RP21-374N22), AL359381 (CT7-87K14), AL133159 (CT7-332P19), AL450393 (RP21-383N7), AL136158 (RP21-538M10), AL365336 (RP21-639N14), AL359352 (CT7-350K7).
Assessing Celera assembly
We have compared our data with the set of OR genes extracted from the CELERA genome assembly by Zhang and Firestein (37) (see electronic supplementary data). Nine out of 42 genes (21%) that belong to the cluster are a mixture of alleles. This risk was not ignored by the authors, who stressed that the sequences represent a consensus and possibly not the actual genes. More dramatically, four of Zhang and Firestein's set of OR genes belonging to this cluster are in fact assembled from different members of a subfamily (subfamily 256 group 1 and subfamily 263 group 1). It is clear from the annotations at the CELERA web site that the computer-directed assembly is unable to discriminate between very similar and tandemly duplicated genes. A similarly entanglement was observed for the tandemly duplicated H2-Q class I genes (A. Kumánovics, unpublished data).
To check whether the four mouse strains (129, A/J, C57BL/6J and DBA/2J) could indeed be expected to differ in the OR region, we first examined the polymorphism of known markers distributed over the region: D17Tu42, D17Mit232, D17Mit148, D17Leh525, D17Mit24, D17Leh89, D17Mit64 and D17Mit234 (40,43). The location of the first five markers is reported in Figure 2; the sequence of D17Leh89 marker matches to both Olfr136 and Olfr138, and D17Mit64 and D17Mit234 map distal to the OR cluster (95). Each of these markers is recorded in the Mouse Genome Database (www.informatics.jax.org) as having an allelic difference between at least two of the strains concerned.
We also checked the polymorphism for non-OR genes within the region. There is no change in the UbiquitinD gene among the four strains; for Smt3ip2-L, the sequence is identical in C57BL/6J and 129/Sv; for the Rex-2-L pseudogene, no comparable sequence was available from ENSEMBL but only the actual Rex-2 gene. From CELERA, fragments from A/J were found, which present SNPs with regard to Rex-2-L and harbor the TGA STOP codon in the middle of the C2H2 motif. No comparison was possible for Scoc-L because both databases only contain the sequence of the actual Scoc gene on Chr 8 (Genbank AF115778).
Repeats were identified by REPEATMASKER (http://ftp.genome.washington.edu/cgi-bin/RepeatMasker); genes were identified with the BLAST programs (96) (www.ncbi.nlm.nih.gov/BLAST), with the FGENE prediction program (http://genomic.sanger.ac.uk/gf/gf.shtml) and by comparison to the human sequence. Dot matrix comparisons were made with PipMaker (97) (http://bio.cse.psu.edu/pipmaker). The OR were aligned and the consensus sequence was determined by Multalin (98) (http://genopole.toulouse.inra.fr/multalin/multalin.html). To circumvent the variable size of both the N-terminal and C-terminal ends of the olfactory receptors, the protein sequences were cropped to 310 amino acids, starting with the first conserved asparagine (N1). The phylogenetic tree was built after alignment of the OR amino acid sequences by Multalin and rooted with the rat G protein-coupled receptor 27 (P34987) (99), using the DNASTAR package (Madison, WI, USA) and MEGA2 (100). A similar tree was obtained rooted on the Danio rerio odorant receptor NM_131739. The dS/dN ratio was calculated with SNAP at http://hiv-web.lanl.gov/content/hiv-db/SNAP/WEBSNAP/SNAP.html. In Table 2, the number given for the dS/dN is the average of the pairwise dS/dN ratios. The deletion in gene Olfr104 of strain DBA/2 was handled by changing it into a non-synonymous substitution (TTC→GTC).
The sequences reported here have been submitted to the EMBL/GenBank/DDBJ database under the following accession numbers: AL078630 (CT7-573K1), AL590433 (RP21-374N22), AL359381 (CT7-87K14 ), AL133159 (CT7-332P19), AL450393 (RP21-383N7), AL136158 (RP21-538M10), AL365336 (RP21-639N14), AL359352 (CT7-350K7). Individual OR sequences are recorded at MGI (www.informatics.jax) and have been submitted to GenBank (accession numbers AY374988-AY375040). The details of the polymorphism of the OR genes described are reported as online Supplementary Material.
Supplementary Material is available at HMG Online.
We thank all members of the central sequencing and finishing facilities at the Sanger Institute, in particular, Michael Kay, Jamieson Lovell, Ben Phillimore and Anthony Tromans, and the nomenclature group at the Jackson Laboratory, in particular Robert Sinclair and Lois J. Maltais. S.S., L.H.M., J.R. and S.B. were funded by the Wellcome Trust. R.M.Y. was supported by a studentship from the UK Medical Research Council (MRC). C.A. was supported by HHMI, then by her parents and husband. A.Z. was supported by the VolkswagenStiftung (I/75 196) and a Biomedical Research Collaboration Grant from the Wellcome Foundation (jointly with S.B.).
To whom correspondence should be addressed. Tel: +1 2146485007; Fax: +1 2146485095; Email: firstname.lastname@example.org
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Present address: 10 Avenue du Mont Frouzy, F-31810 Venerque, France.
Present address: Department of Zoology, University of Oxford, South Parks Road, Oxford, OX1 3PS, UK.
|Black||Blue (≥50% identity)||Red (≥90% identity)|
|N||40/149 (27%)||22/106 (21%)||5/55 (9%)|
|S||30/149 (20%)||31/106 (29%)||10/55 (18%)|
|Black||Blue (≥50% identity)||Red (≥90% identity)|
|N||40/149 (27%)||22/106 (21%)||5/55 (9%)|
|S||30/149 (20%)||31/106 (29%)||10/55 (18%)|
|Number of genes||dS||dN||dS/dN||SD|
|Number of genes||dS||dN||dS/dN||SD|
dS, Jukes–Cantor correction for multiple hits of the proportion of observed synonymous substitutions.
dN, Jukes–Cantor correction for multiple hits of the proportion of observed non-synonymous substitutions.
dS/dN, average of the pairwise dS/dN ratios.
SD, standard deviation.
|129||10/40a (25%)||21/44 (48%)||29/46 (63%)|
|C57BL/6J||21/50 (42%)||8/52 (15%)|
|129||10/40a (25%)||21/44 (48%)||29/46 (63%)|
|C57BL/6J||21/50 (42%)||8/52 (15%)|
aThe number of genes with allelic differences/total number of genes that could be compared.