Evolutionary Changes in Vertebrate Genome Signatures with Special Focus on Coelacanth

With a remarkable increase in genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-organizing map (SOM) is a powerful tool for clustering high-dimensional data on one plane. For oligonucleotide compositions handled as high-dimensional data, we have previously modified the conventional SOM for genome informatics: BLSOM. In the present study, we constructed BLSOMs for oligonucleotide compositions in fragment sequences (e.g. 100 kb) from a wide range of vertebrates, including coelacanth, and found that the sequences were clustered primarily according to species without species information. As one of the nearest living relatives of tetrapod ancestors, coelacanth is believed to provide access to the phenotypic and genomic transitions leading to the emergence of tetrapods. The characteristic oligonucleotide composition found for coelacanth was connected with the lowest dinucleotide CG occurrence (i.e. the highest CG suppression) among fishes, which was rather equivalent to that of tetrapods. This evident CG suppression in coelacanth should reflect molecular evolutionary processes of epigenetic systems including DNA methylation during vertebrate evolution. Sequence of a de novo DNA methylase (Dntm3a) of coelacanth was found to be more closely related to that of tetrapods than that of other fishes.


Introduction
Oligonucleotide composition varies significantly among species even with the same genome GþC% and designated as a 'genome signature', which characterizes the genomic sequences of individual species. 1 -3 Kohonen's self-organizing map (SOM) is a powerful tool for clustering high-dimensional complex data on one map. 4,5 For codon and oligonucleotide compositions handled as high-dimensional data, we have modified the conventional SOM to BLSOM. 6,7 BLSOMs for oligonucleotide compositions can classify genomic fragments (e.g. 100 kb) from a wide range of species according to species, without species information during the BLSOM calculation. 8,9 BLSOM can also visualize diagnostic oligonucleotides responsible for the species-specific clustering, supporting an efficient knowledge discovery from a big sequence data.
The present paper studies vertebrate genome signatures and their evolutionary changes with special focus on coelacanth. Coelacanth is one of the oldest known living Sarcopterygii and of the nearest living relatives of tetrapods. 10 Therefore, analysis of this genome is believed to provide information concerning molecular evolutionary processes during the transition from fishes to tetrapods, and the genome of African coelacanth (Latimeria chalumnae) has recently been sequenced. 11,12 Oligonucleotide BLSOM for genome sequences from a wide range of vertebrates, including the coelacanth, shows that CG suppression (i.e. underrepresentation of CG dinucleotide) is the highest among fishes and equivalent to that of tetrapods. The evident CG suppression in the mammalian genomes has been connected with the methylation at C in CG dinucleotide, which is a well-characterized epigenetic maker. Therefore, phylogenetic tree analyses on DNA methylases of fishes and tetrapods have been conducted. The classification and visualization power of the oligonucleotide BLSOM is very high and can efficiently unveil a wide range of genome characteristics from massive amounts of genomic sequences without prior knowledge, and therefore, this unsupervised data-mining method is a powerful and timely tool useful for genome studies in the era of big sequence data obtained from high-throughput DNA sequencers.

BLSOM
SOM is an unsupervised clustering algorithm that non-linearly maps high-dimensional vectorial data onto a two-dimensional array of lattice points. 4,5 We previously modified the conventional SOM for genome informatics to make the learning process and resulting map independent of the order of data input, on the basis of batch-learning SOM (BLSOM). 6,7 The initial weight vectors were defined by principal component analysis instead of random values, and BLSOM learning for oligonucleotide composition was conducted as described previously. 8 The map size was chosen assuming that the average number of data per lattice point was 10. Diagnostic oligonucleotides for species-dependent separation were visualized as described previously. 8 The BLSOM programme was obtained from takaabe@ie.niigata-u.ac.jp.

Sequences
Coelacanth genome sequences were obtained from http://coelacanth.nig.ac.jp, and other genome sequences were obtained from the UCSC ftp site (http://www.ncbi. nlm.nih.gov/genomes/). When the number of undetermined nucleotides (Ns) in a fragment sequence (e.g. 100 kb) exceeded 20% of the sequence, the sequence was omitted from the analysis. In the case where the number of Ns was ,20%, the oligonucleotide frequencies were normalized to the length without Ns and included in the analysis. Amino acid sequences of coelacanth Dnmt3a, Dnmt3b and Dnmt1 (GENE ID: R03136, A01856 and A13991) were obtained from the Coelacanth Genome Project (http://coelacanth. nig.ac.jp/).

BLSOM for six fish genomes
To compare the coelacanth's genome signature with that of other fishes and to study molecular evolutionary mechanisms for the establishment of their signatures, we constructed BLSOMs with di-, tri-and tetranucleotide compositions in all 100-kb sequence fragments from six fish genomes, including the African coelacanth genome. The fishes other than coelacanth were chosen, since a major portion of their genomes was sequenced and registered in the International DNA Sequence Databanks (DDBJ/EMBL/GenBank) as complete or draft genome sequences. In the Databanks, only one strand of complementary sequences is registered, and the strand is chosen rather arbitrarily in the registration of fragment sequences (e.g. draft genome sequences). Our previous BLSOM analysis of a wide range of species 7 revealed that sequences from a single genome often gave a mirror-symmetrical split on BLSOM in the vertical direction, e.g. according to the replicational direction of genomic fragments. 13 When investigating the general characteristics of genome sequences such as genome signatures, differences between two complementary strands are not important. Furthermore, the obtained map should not be affected by the choice of strands registered to the Databanks. Therefore, we constructed a BLSOM, for which the frequencies of a pair of complementary oligonucleotides in each fragment were summed up (e.g. AAC and GTT; abbreviated as AACþGTT). 8 The BLSOM for this degenerate set of a pair of complementary di-, tri-and tetranucleotides is abbreviated as DegDi, DegTri and DegTetra. The DegTri is listed in Fig. 1A; other two are listed in Supplementary Fig. S1A and B. Lattice points containing sequences from a single species are coloured to indicate the species, whereas those containing sequences from multiple species are indicated in black. With no information concerning species during BLSOM calculation, species-specific clustering (self-organization) of the fragment sequences is clear; i.e. sequences from individual species have formed their own territories. The number of black lattice points on DegTri and DegTetra is smaller than on DegDi, showing their higher clustering power; for details of their differences (Fig. 1A). One of the basal characteristics of BLSOM separation for genomic sequence has been attributed previously to the GþC%. 7,9,14 Sequences with higher or lower GþC% are located on the left or right side of the map (Supplementary Figs S1Aii and S4Aii), showing that the GþC% is reflected primarily in the horizontal direction.
BLSOM can visualize diagnostic oligonucleotides responsible for species-specific clustering (self-organization) and provides information concerning possible molecular evolutionary mechanisms for establishing the present genome signatures. For this visualization,

460
Genome Signature of Coelacanth [Vol. 21, we have previously developed a method to unveil diagnostic oligonucleotides in a way unaffected by a simple difference in GþC% of genomic sequences. 8 After calculating the frequency of each oligonucleotide expected from the mononucleotide composition for each lattice point, the observed/expected ratio is indicated in red (over-represented) or blue (under-represented) ( Fig. 1Aii and Supplementary Fig. S1Aiii and Bii). We focus on the oligonucleotides whose occurrence in coelacanth (red in Fig. 1Ai) differs from that in other fishes, i.e. the oligonucleotides, for which the transition between red and blue coincides with the border between the coelacanth's and the other fishes' territories in Fig. 1Aii. CCCþGGG is over-represented (red) only in the coelacanth territory; and CG-possessing trinucleotides are more under-represented in the coelacanth territory (dark blue) than in the other fishes' territories ( pale blue). Importantly, all other CG-possessing trinucleotides ( Supplementary Fig.  S2A), as well as CG ( Supplementary Fig. S1Aiii) and all CG-possessing tetranucleotides ( Supplementary Fig.  S2B), are more under-represented in the coelacanth territory (dark blue) than in the other fishes' territories ( pale blue). This shows higher CG suppression in coelacanth than in other fishes; a numerical analysis of the CG suppression will be shown later.

BLSOM for a broad phylogenetic range of vertebrate genomes
Oligonucleotide BLSOM for fish genomes shows that a notable characteristic of coelacanth distinct from other fishes is the highest CG suppression. Importantly, this finding can be obtained with no prior knowledge, showing a prominent ability of this unsupervised datamining method in knowledge discovery. Once this type of information is obtained, we can focus on detailed research of the respective genome characteristics. A major mechanism for the evident CG suppression in the mammalian genomes has been connected with methylation at C in CG dinucleotide, which is followed by spontaneous deamination. As a result, CG steadily mutates to TG, resulting in CG suppression and in CA and TG over-representation. In the case of fish genomes, by analysing a wide range of animal genomes from various aspects, Simmen 15 has found that their CG suppression is also accountable in terms of 5-methylcytosine mutation; in the study, the author has proposed the following four criteria which support the 5-methylcytosine mutation. (i) Genome-level CG suppression. (ii) Genome-level TG and CA excess. (iii) Positive correlation between the local CG relative abundance (i.e. the observed/expected ratio of CG) and the local GþC% level. This is because regions of high GþC content possess a higher DNA melting temperature than low GþC content regions, and therefore, these are less susceptible to strand separation, resulting in a lower rate of deamination and a lesser degree of CG suppression. (iv) Negative correlation between local relative abundance of CG and TG (CA). We have found that all four criteria are applicable to coelacanth (Supplementary Fig. S3), showing that a major mechanism for the CG suppression in coelacanth is related to the 5-methylcytosine mutation, as previously reported for other fishes. 15 Methylated C is a well-characterized epigenetic marker, and therefore, the highest CG suppression in coelacanth among fishes should be of particular interest in terms of evolutionary changes in epigenetic systems. We next included five additional vertebrate genomes covering a broad phylogenetic range of tetrapods, for which complete or draft genome sequences were available. On DegTri and DegTetra (Figs 1B and  2A), as well as DegDi ( Supplementary Fig. S1C), for 100-kb sequences from the 11 vertebrate genomes, species-specific clustering is apparent. In Figs 1Bii and 2Aii, we mark the lattice points corresponding only to fish territories, which are located primarily in the lower part of the map.
The genomes of warm-blooded vertebrates are known to be composed of a long-range segmental GþC% distribution designated as 'isochore', which has been connected with chromosomal bands. 16 -20 A single genome, especially of a warm-blooded vertebrate, has multiple territories on DegTri [e.g. four chicken (Gallus gallus) territories marked in pink in Fig. 1Bi]. As previously mentioned, sequences with high or low GþC% are located on the left or right side of the map, and therefore, the split into several sub-territories in the horizontal direction should most likely relate to isochore structures.
On DegTri and DegTetra, the suppression of CG-possessing oligonucleotides in coelacanth and tetrapods and the high occurrence of CCþGG-possessing oligonucleotides in coelacanth and most tetrapods except chicken are clear (Figs 1Biii and 2Aiii); other examples of diagnostic oligonucleotides for phylotype-dependent clustering are listed in Supplementary Fig. S4. It should also be noted that the oligonucleotides diagnostic for one species are primarily common among the sub-territories of the species, e.g. the characteristic over-representation of CCCþGGG in the tetrapods other than chicken is observed regardless of intragenomic GþC% differences such as isochores of chicken (Fig. 1Biii). In addition to the characteristic occurrence of CG-and CCþGG-possessing tetranucleotides, phylotype-dependent characteristics that cannot be explained by the characteristics of the dinucleotide composition are observed on DegTetra, e.g. underrepresentation of GTCAþTGAC in coelacanth and tetrapods (Fig. 2Aiii). Such oligonucleotides' biological significance will be discussed later.

Distinction between unique and repeat sequences
Higher eukaryote genomes are composed of unique and repeat sequences with different biological significances. Since repeat sequences usually have peculiar oligonucleotide composition, the composition in repeat sequences might be a major causal factor for genome signatures detected by the present oligonucleotide BLSOMs. This possibility is tested as follows. In the UCSC Genome Browser, repeat sequences identified by RepeatMasker and Tandem Repeats Finder (with a period of 12) are specified in lower-case letters to distinguish from unique sequences specified in uppercase letters. We thus separately constructed oligonucleotide BLSOMs for repeat and unique sequences of the 11 vertebrates; we first concatenated repeat and unique sequences separately and divided these artificially concatenated sequences into 100-kb fragments, followed by counting oligonucleotide composition. DegTri for unique sequences shows the clear species-specific separation (Fig. 2B), proving that their genome signatures are not the peculiar characteristics of oligonucleotide compositions in repeat sequences; the same finding was obtained on DegDi and DegTetra (data not shown). In Fig. 2Ci, unique plus repeat sequences have been analysed on one DegTri; using the same map, unique and repeat sequences are separately marked (Fig. 2Cii and iii). Clear separation is observed among species and between unique and repeat sequences, showing differential oligonucleotide compositions among these different categories.

CG suppression in a wide phylogenetic range
of vertebrates A difference in CG suppression observed among different lineages of vertebrates may relate to the differential features in epigenetic systems. 21,22 Considering the fact that biologically important genome characteristics tend to be maintained stably during evolution, we analysed the CG suppression of many more vertebrates than those analysed with BLSOMs, because CG suppression can be analysed even if their contiguous sequences are rather short (e.g. shorter than 100 kb). Specifically,

462
Genome Signature of Coelacanth [Vol. 21, we analysed CG suppression for 44 vertebrate genomes covering a broad phylogenetic range. The CG/GC ratio (often called the CpG/GpC ratio) has long been used to examine CG suppression for individual species, because this ratio is not affected by the genome GþC% difference. Strictly speaking, however, this ratio should be affected by the possible species-specific GC dinucleotide occurrence. Therefore, the observed/ expected ratio of CG (i.e. normalization with the occurrence level expected from the mononucleotide composition) was analysed (Fig. 3A) Fig. S5A). Supplementary  Fig. S5A also shows that the evident CG suppression of coelacanth among fishes is observed in both unique and repeat sequences. Supplementary Fig. S5B and C shows that CA/AC and TG/GT ratios of coelacanth are higher than those of other fishes and equivalent to most of tetrapods, as expected from the mechanism responsible for the CG suppression; CA and TG overrepresentation. CpG islands, which belong to the unique sequence category, have been connected with transcriptional regulatory mechanisms. Figure 4 analysed the CG level in the sequences within and outside CpG islands, which were identified by 'newcpgreport' (http://emboss. sourceforge.net/apps/cvs/emboss/apps/newcpgreport. htm). The CG level outside CpG islands varies significantly among fishes and the coelacanth has the lowest CG level, as reflecting its global genome characteristics. In contrast, the CG level within CpG islands is rather similar among fishes including coelacanth. This level varies significantly among tetrapods; the level of lizard is similar to that of coelacanth, but those of chicken, mouse and human are clearly higher than that of coelacanth. This difference observed even between closely related species may reflect the difference in actual detailed molecular mechanisms in the transcriptional regulation. The higher CG suppression outside CpG islands (Fig. 4B) is consistent with the hypothesis that hypermutation of methylated C is the main cause of higher CG suppression in coelacanth, because methylation of C generally occurs outside CpG islands.  Fig. 1Bi and ii. (Aiii) Examples of tetranucleotides whose occurrence differs between coelacanth and other fishes. Lattice points are marked as described in Fig. 1Aii. (B) DegTri for 100-kb unique sequences from 11 vertebrate genomes. DegTri for repeat sequences also has shown the speciesspecific separation (data not shown).

DNA methylase genes
The CG occurrence in lamprey (Petromyzon marinus) is the highest (0.78) and rather similar to a level of the very slightly methylated genome of drosophila (Drosophila melanogaster), 23,24 but significantly differs from medaka (Oryzias latipes; 0.46) (Fig. 3A). This wide range of CG suppression among fishes indicates that various changes in DNA-methylation and -demethylation systems have occurred during the fish lineage evolution. Taking this into account, we next consider the possible molecular mechanisms responsible for the highest CG suppression of coelacanth among fishes. Since CG suppression in vertebrate genomes is believed to relate to methylation at CG dinucleotide, we first compared de novo DNA methylases found in the coelacanth genome with those of other fishes and tetrapods; de novo DNA methylases are responsible for the new addition of a methyl group to DNA. As opposed to the two catalytically active de novo DNA methylase genes (dnmt3a and dnmt3b) of mammals, zebrafish (Danio rerio) is known to have six de novo DNA methylase genes and lacks dnmt3l which is catalytically inactive but required for the establishment of genomic imprinting. 25,26 It is also known that the expression pattern of zebrafish dnmt6 and dnmt8, which more closely resemble mammalian dnmt3a in amino acid sequence, more closely resembles that of mammalian dnmt3a than dnmt3b, but the expression pattern of the other four zebrafish genes (dnmt3, dnmt4, dnmt5  and dnmt7) resembles the pattern of mammalian dnmt3b, i.e. dnmt6 and dnmt8 have been shown to more ubiquitously express in developing and adult tissues than the other four zebrafish genes. 27 In the coelacanth genome, we can find two dnmt3 genes (dnmt3a and dnmt3b) as observed for mammals but no dnmt3l as observed for zebrafish. Using multiple sequence alignment (MUSCLE) in Mega6, 28 phylogenetic analyses of coelacanth Dnmt3a and Dnmt3b were conducted at a protein-sequence level (Fig. 5 and Supplementary Fig. S6). All three programmes (maximum likelihood, minimum evolution and neighbour-joining) show that the coelacanth Dnmt3a is clustered with tetrapods' Dnmt3a, but not with other fishes' Dnmt3a. In the case of Dnmt3b, maximum likelihood shows that the coelacanth Dnmt3b is clustered with tetrapods' Dnmt3b as observed for Dnmt3a (Fig. 5), but minimum evolution and neighbour-joining show it to be outside of tetrapods' and other fishes' clusters (Supplementary Fig. S6). In the case of Dnmt1, which preferentially methylates hemimethylated DNA for maintaining the methylation pattern in the newly synthesized strand, all three programmes show the coelacanth Dnmt1 to be outside of tetrapods' and other fishes' clusters (Fig. 6).
The difference found among these three methylases, which have differential biological functions, may indicate the differential level of contribution to the CG suppression during evolution; the ubiquitously expressed de novo methylase Dnmt3a may have contributed most significantly to the evident CG suppression in coelacanth. For the adequate understanding of CG suppression, however, additional molecular mechanisms such as RNA-directed DNA methylation 29 and DNAdemethylation systems have to be considered. Since the CG-suppression level has stably been conserved phylogenetically ( Fig. 3 and Supplementary Fig. S5), analyses of this level from various aspects may provide novel knowledge about fundamental biological mechanisms related to epigenetics.

Oligonucleotides unrelated to CG
If we know in advance that a fundamental characteristic of the coelacanth genome distinct from other fishes is related to CG suppression, a simple analysis of dinucleotide composition should be enough to prove this characteristic. However, when we have not such background knowledge, an unsupervised data-mining method such as the present BLSOM becomes valuable because it can unveil a wide range of characteristics hidden in the genome sequence. Actually, the occurrence of various CG-non-possessing oligonucleotides in coelacanth was found distinct from that in other fishes, but equivalent to that in almost all tetrapods. A clear example is the over-representation of (CCþGG)possessing tri-and tetranucleotides on DegTri and DegTetra (Figs 1Biii and 2Aiii). Figure 3B presents the   Fig. S7A and B presents the results for CCCþGGG and CCCCþGGGG. As observed on BLSOMs ( Fig. 1Aii and Biii, and Supplementary Fig. S1Aiii, Bii and Ciii), their occurrences in coelacanth are clearly higher than in other fishes. The same is true for most tetrapods, including Marsupialia but excluding chicken and turkey (Meleagris gallopavo). It should also be noted that the occurrence of CCCCþ GGGG in coelacanth is the highest among the vertebrates analysed ( Supplementary Fig. S7B).

Discussion
The genome characteristics maintained stably in a certain phylogenetic lineage should reflect its fundamental biological significance. An apparent causative factor for the genome signature is the context-dependent DNA mutation and repair mechanisms. It should also be mentioned that the oligonucleotides longer than tetranucleotides often represent motif sequences responsible for sequence-specific protein binding (e.g. transcription factor binding). The occurrence of such motif oligonucleotides should differ from the level expected from the mononucleotide composition in the respective genome and may differ among genomic portions within one genome. We have recently found that DegPenta-and DegHexa-BLSOMs for the human genome can effectively detect characteristic enrichment of many transcription factor-binding motifs in pericentric heterochromatin regions. 30 The present study has found evident CG suppression and CCþGG over-representation in coelacanth and tetrapods, by using DegDi, DegTri and DegTetra. If the BLSOM analysis extends to oligonucleotides longer than tetranucleotides, we may assign in silico the oligonucleotides with important biological functions such as transcription factor-binding in the genome of coelacanth and other species, for which experimental studies other than DNA sequencing have been poorly done.
BLSOM can analyse big sequence data such as millions of sequences simultaneously 31 and, therefore, can analyse genome sequences from a wide range of species and visualize characteristics of their oligonucleotide composition on a single map. This can support an efficient knowledge discovery from the big data with no prior knowledge. Pointing out the oligonucleotides whose occurrence is peculiar commonly (and thus stably) in a certain lineage should provide information concerning biological roles of these oligonucleotides in the lineage. In conclusion, BLSOM is a powerful and timely tool useful for genome studies in the era of big sequence data obtained from high-throughput DNA sequencers.