The Reproducibility of an Inferred Tree and the Diploidization of Gene Segregation after Genome Duplication

Abstract We previously introduced a numerical quantity called the stability (Ps) of an inferred tree and showed that for the tree to be reliable this stability as well as the reliability of the tree, which is usually computed as the bootstrap probability (Pb), must be high. However, if genome duplication occurs in a species, a gene family of the genome also duplicates, and for this reason alone some Ps values can be high in a tree of the duplicated gene families. In addition, the topology of the duplicated gene family can be similar to that of the original gene family if such gene families are identifiable. After genome duplication, however, the gene families are often partially deleted or partially duplicated, and the duplicated gene family may not show the same topology as that of the original family. It is therefore necessary to compute the similarity of the topologies of the duplicated and the original gene families. In this paper, we introduce another quantity called the reproducibility (Pr) for measuring the similarity of the two gene families. To show how to compute the Pr values, we first compute the Pb and Ps values for each of the MHC class II α and β chain gene families, which were apparently generated by genome duplication. We then compute the Pr values for the MHC class II α and β chain gene families. The Pr values for the α and β chain gene families are now low, and this suggests that the diploidization of gene segregation has occurred after the genome duplication. Currently higher animals, defined as animals with complex phenotypic characters, generally have a higher genome size, and this increase in genome size appears to have been caused by genome duplication and diploidization of gene segregation after genome duplication.


Introduction
In the previous paper, we introduced a numerical quantity called the stability for measuring the accuracy of an inferred tree from empirical data (Katsura et al. 2017). This stability refers to the bootstrap probability of the subtree to be defined (Ps), and we have shown that for the inferred tree to be reliable both the Ps values and the usual bootstrap values (Pb) must be high (Katsura et al. 2017). For the people who are not well acquainted with the statistical method used, we present the essence of Ps and Pr computation in figure 1. In this paper, we introduce another quantity called the reproducibility (Pr) of the inferred tree when two homologous gene families exist. The Pr value refers to the similarity of the topologies of the two homologous gene families. When genome duplication occurs in a species, a gene family of the genome should also duplicate. If the duplicated gene family shows the same topology as that of the original family, Pr is 100%, whereas if the two topologies are entirely different, it is 0%. A low value of reproducibility indicates that the topologies of the ancestral or duplicated gene families have been altered considerably by the postduplication events such as the partial gene deletion or partial gene duplication (Scannell et al. 2006;Nei 2013, pp. 139-140).
In this paper, we first explain the computational principle in an example case. We then use the a and b chain genes of the major histocompatibility complex (MHC) class II genes as the numerical example, because the a and b chain genes were apparently generated by genome duplication (see Nei and Hughes 1991;Katsura et al. 2017). In the class II region of the MHC, the a and b chains in the mammal immunoglobulins are encoded by separate genes in the DP, DQ, and DR gene regions. The genes in the DP, DQ, and DR gene regions are responsible for presenting the peptides derived from extracellular pathogens to cytotoxic T cell. We do not consider the DM and DO regions because they do not present antigens (Ting and Trowsdale 2002). In humans, the DP and DQ regions possess one functional gene for each of the a and b chains, and the DR region contains one copy of the functional a chain gene and two functional copies of the b chain gene (b1 and b3 genes) (Nei and Hughes 1991;Ting and Trowsdale 2002). In this study, we used the a1 and b1 genes in all of the DP, DQ, and DR gene regions (Nei and Hughes 1991). The a and b chain genes of the DP region (DPA and DPB genes, respectively) are absent from many currently known mammalian genomes except in primates, rodents, and elephants, and the DPB genes are pseudogenes in rodents (Katsura et al. 2017). In this paper, we consider the primate MHC class II genes (DNA sequences) only to avoid the problem of the gene deletion or gene duplication that has occurred in nonprimate species. We also compare the gene sequences from the same number of species for each of the DP, DQ, and DR gene regions. We used the nucleotide sequences from a total of nine primate species in which the entire genome sequence is available. These species included three hominoids, three Old World monkeys, two New World monkeys, and one Prosimian. The nine species used here are human (Homo sapiens), chimpanzee (Pan troglodytes), gorilla (Gorilla gorilla), rhesus macaque (Macaca mulatta), drill (Mandrillus leucophaeus), black snub-nosed monkey (Rhinopithecus bieti), common marmoset (Callithrix jacchus), Ma's night monkey (Aotus nancymaae), and gray mouse lemur (Microcebus murinus). The phylogenetic trees of these species were previously studied by Perelman et al. (2011), Springer et al. (2012, and Perez et al. (2013); and the topology of our tree was generally in agreement with that of their trees.  Figure 2 shows that the topology of the tree is generally consistent with that of the primate trees obtained in the previous studies (Perelman et al. 2011;Springer et al. 2012). The topology of DQA genes (sequences 10-18) is identical with a correct topology supported by the previous studies (Perelman et al. 2011;Springer et al. 2012). The topologies of the hominoids in DPA and DRA genes (sequences 1-3 and sequences 19-21) are not identical with that of DQA genes (sequences 10-12). The topology of the hominoid cluster is often different from that of the previous studies, although the Pb values of the hominoid cluster are always high. The Ps value of the hominoid cluster in the DPA gene is 40%, and this low Ps value suggests that the subtree of the hominoid cluster is unreliable in the DPA gene. The instability of the tree can be detected by computing both Pb and Ps values, and this conclusion is similar to that of figure 3. Figure 3 shows the Pb and Ps values for an interior branch of MHC class II b genes. The Ps values of primate MHC class II b genes are mostly lower than the Pb values. This was true also with our previous results obtained by using mammalian MHC class II b genes (Katsura et al. 2017).

Results
We now compute the Pr values for the trees of three groups of the a and b chain genes, that is, the DP, DQ, and DR family genes in figures 2 and 3. Let us explain how to compute the Pr values by considering the subtree for the DP gene region genes. The Pr value for this subtree is obtained by computing the bootstrap probability of obtaining the topology of the DPA subtree in the subtree of DPB genes when one of the close outgroups is used. Let us designate the topology of DPA genes in figure 2 by T1, and the topology of DPB The number of bootstrap replications used for each subtree topology is assumed to be 1,000. The probability of obtaining a subtree topology is shown as a percentage for each of the three subtree topologies (85%, 10%, and 5%). The Ps value is the bootstrap value for the tree topology showing the highest bootstrap probability among the three subtree topologies (85%), and in the present case we have assumed that it is the subtree topology having ((A,B), C). In this study, we compute the reproducibility of topologies between the original and the duplicated gene families. The reproducibility value (Pr) is the average of the bootstrap probability (Pr1) showing T2 among the replicates of the subtree of T1 and the bootstrap probability (Pr2) showing T1 among the replications of the original tree of T2. This figure shows that Pr1 is 10%, and Pr2 is not shown. genes in figure 3 by T2. Here we have two ways of computing the reproducibility for a given topology of DP genes. One way of computation is to consider the reproducibility of topology T2 of DPB genes in the subtree for DPA genes. In this case, we compute the Pr value by considering how often topology T2 is obtained by bootstrapping the subtree of DPA genes with the closest outgroup. Let us call this reproducibility Pr1. The Pr value is also computable by considering the subtree of DPB genes for obtaining the topology T1 with the closest outgroup. We designate this bootstrap probability by Pr2. The Pr value is then obtained by the average value of Pr1 and Pr2. Using this way of computation, we also calculate Pr1, Pr2, and Pr for DQ and DR genes. For this computation, the program RESTA can be useful.
The Pr1, Pr2, and Pr thus obtained are presented in table 1A. The topologies of the a and b chains are slightly different in DR and DP genes. The Pr values for DR and DP genes therefore become low (0%). The topologies of the genes are not reproducible between the a and b chains. On the other hand, the Pr for DQ genes between the a and b chains is 37.4% and relatively higher than that for DR and DP genes because the topologies of DQA and DQB genes are exactly the same. However, the Pr for DQ genes was not 100% but the Pr2 is 66.8% and Pr1 is 8.0%. This relatively HumanDPA (1) ChimpDPA (2) GorillaDPA (3) RhesusMacaqueDPA (4) DrillDPA (5) BlackSnubNosedMonkeyDPA (6) MarmosetDPA (7) NightMonkeyDPA (8) GrayMouseLemurDPA (9) HumanDQA (10) ChimpDQA (11) GorillaDQA (12) RhesusMacaqueDQA (13) DrillDQA (14) BlackSnubNosedMonkeyDQA (15) MarmosetDQA (16) NightMonkeyDQA (17) GrayMouseLemurDQA (18) HumanDRA (19) GorillaDRA (20) ChimpDRA (21) RhesusMacaqueDRA (22) DrillDRA (23) BlackSnubNosedMonkeyDRA (24) MarmosetDRA (25) NightMonkeyDRA (26) GrayMouseLemurDRA ( (Saitou and Nei 1987;Yoshida and Nei 2016). The number of nucleotides used was 939 bp per sequence. The Pb value is shown for each interior branch, and the number of bootstrap replications used was 1,000. The Ps value was also obtained by the 1,000 replications for each gene family using the RESTA program (Katsura et al. 2017) and is shown as a bold and italic number below the Pb value for each relevant interior branch. The aligned sequences are in Supplementary Material online. low Pr1 value is caused by the instability of the subtree in DQA gene. The Pb is 100% for the subtree of DQA genes (sequences 10-18), but the Ps is only 8% for the subtree (fig. 2). The low Ps value makes it hard for the subtrees in DQA genes to reproduce the topologies of DQB genes. Our results show that the Pr values become high if the topologies of the duplicated gene families are exactly the same. Furthermore, if trees of the duplicated gene families are highly reliable in terms of both Ps and Pb values, Pr values can become higher.

Discussion
We quantified the reproducibility (Pr) between the a and b chain genes of MHC class II genes. This could be done because MHC class II genes function as heterodimers of a and b chains, and we could identify the a and b chain genes as the duplicated gene families. Grimholt (2016), who studied the immune system in fish extensively, believes that the class II immune system originated in teleost fish about 250 Ma. Figures 2 and 3 show that the gene duplication events in the DP, DQ, and DR gene regions occurred more recently, so that all of them include the a and b chain genes. It is believed that the duplication of MHC class II a and b chain genes originated from the whole genome duplication event in the teleost ancestor (Grimholt 2016). Soon after a genome was duplicated, the topology of the duplicated gene families can be identical with that of the ancestral gene family. Then, HumanDPB (1) ChimpDPB (2) GorillaDPB (3) RhesusMacaqueDPB (4) BlackSnubNosedMonkeyDPB (5) DrillDPB (6) MarmosetDPB (7) NightMonkeyDPB (8) GrayMouseLemurDPB (9) HumanDQB (10) ChimpDQB (11) GorillaDQB (12) RhesusMacaqueDQB (13) DrillDQB (14) BlackSnubNosedMonkeyDQB (15) MarmosetDQB (16) NightMonkeyDQB (17) GrayMouseLemurDQB (18) HumanDRB (19) ChimpDRB (20) GorillaDRB (21) RhesusMacaqueDRB (22) BlackSnubNosedMonkeyDRB (23) DrillDRB (24) MarmosetDRB (25) NightMonkeyDRB (26) GrayMouseLemurDRB ( Pr values between the duplicated gene families would be high. Actually, however, the Pr values between the a and b chain genes are low in DP and DR gene regions since the topologies of the a and b gene families are different. We computed the Pr value for all pairs of the DP-DQ, DP-DR, and DQ-DR (table 1B). The Pr values are low (0.0-2.7%) in three pairs of the genes (DPB-DQA, DPB-DQB, and DRB-DPA), and the topologies of these duplicated genes are different at two interior branches for each pair (table 1B). Although the Pr values are very rarely 0% and low (1.4-22.1) in the other nine pairs of the gene families showing the topological difference at one interior branch (table 1B). The Pr values became lower depending on the difference of topologies in the trees of the duplicated and original gene families, but the Pr values are relatively low in any pair. Britten and Davidson (1969) showed that higher animals generally have a higher genome size, and this continuous increase of genome size has made it possible that higher organisms have complex phenotypic characters. Nei (1969) suggested that this continuous increase in genome size was possible if the majority of the increase occurred by means of unequal crossing over. However, the increase in genome size was also possible if the gene segregation of autotetraploids became diploidized (Ohno 1970, pp. 101-104). Scannell et al. (2006) showed that this type of diploidization indeed occurred in yeast. In this paper, the duplicated a and b chain genes maintained similar functions, and therefore it was easy to identify the a and b chains as the duplicate genes.
Following the accumulation of mutations, insertions, and deletions of the genes, the sequences of the duplicated genes diverged from those of the original genes (Nei 1969(Nei , 2013. The low Pr values of the primate a and b chain genes support the occurrence of diploidization after genome duplication. The Pr computation using duplicate genes can quantify the extent of diploidization in tetraploids in addition to the topological similarity of the gene families. In this study, we calculated the Pr for MHC class II a and b chain genes, but the computation of the Pr can be done in any duplicated gene families ( fig. 1).
Associate editor: George Zhang