Advances in genomics of bony fish

In this review, we present an overview of the recent advances of genomic technologies applied to studies of fish species belonging to the superclass of Osteichthyes (bony fish) with a major emphasis on the infraclass of Teleostei, also called teleosts. This superclass that represents more than 50% of all known vertebrate species has gained considerable attention from genome researchers in the last decade. We discuss many examples that demonstrate that this highly deserved attention is currently leading to new opportunities for answering important biological questions on gene function and evolutionary processes. In addition to giving an overview of the technologies that have been applied for studying various fish species we put the recent advances in genome research on the model species zebrafish and medaka in the context of its impact for studies of all fish of the superclass of Osteichthyes. We thereby want to illustrate how the combined value of research on model species together with a broad angle perspective on all bony fish species will have a huge impact on research in all fields of fundamental science and will speed up applications in many societally important areas such as the development of new medicines, toxicology test systems, environmental sensing systems and sustainable aquaculture strategies.


INTRODUCTION
In the recent years there have been tremendous advances in genomic studies of many vertebrate species. In these studies the attention to various representatives of the bony fish species (the superclass of Osteichthyes) has been increasing enormously, especially focussing on the infraclass of Teleostei that represent approximately 96% of the species of this superclass. This increase in attention is partly the result of the fact that this superclass with about 27 000 living species represents more than 50% of all known vertebrate species [1][2][3][4]. In our opinion, it also reflects the trend that fundamental and applied scientific interests in the genomics of bony fish are now converging. On the one hand, fish species such as zebrafish and medaka have clearly shown their broad applicability for studies of fundamental processes underlying development and disease. The tremendous attention these fish species have obtained for an extensive range of fundamental and applied research purposes have earned them the qualification of model fish species. On the other hand, the economical value of the bony fish for food resources coincides with their applicability for biomedical applications and toxicology studies. Together, these fundamental and applied scientific purposes have made it possible that the most advanced genomics technologies have been used for studies of many bony fish species, ranging from the model fish species zebrafish and medaka to 'living fossils' such as the coelacanths and the fresh water eels [5][6][7][8][9][10][11]. The fresh water eels have only recently been termed living fossils since apparently they have retained most of the genome duplication that occurred after the radiation of the bony fish from the common ancestor with the mammals. This is an example that these studies already are giving an unprecedented insight into the evolution of all bony fish species. The teleost species are extremely interesting for evolutionary studies because they are widespread in an incredible range of microenvironments containing water, ranging from the deepest levels of the oceans, to caves completely devoid of any light or even in environments which most of a year do not contain any water. This has led to remarkable adaptations to life at extreme conditions as exemplified by the tilapia species that can survive at 44 C at very high salinity, Antarctic toothfish that can thrive at temperatures below 0 C and deep sea fish such as from the genus Coryphaenoides that can stand pressures of more than 60 MPa [2,12]. This has made bony fish species very attractive for studies on the effects of adverse conditions such as high gravity that are applicable to space travel research [13][14][15], or the absence of light that has important implications for studies of circadian rhythm in adults and embryonic stages [16][17][18][19][20]. On the other hand, the response of many bony fish species such as trouts and minnows to toxic compounds is very similar to that in humans. Therefore, these fish have been extensively used for toxicology research already for many decades [21][22][23][24] and recently this attention has been extended to the model fish species zebrafish and medaka [25][26][27][28][29][30][31][32]. In this review, we will give an overview of genome sequencing and assembly technologies that have been most popular to study the bony fish and the near future possibilities that will still have to gain in importance. Secondly, we will discuss the impact of fundamental and applied research on model fish species with special attention to the current status of genome sequencing and the impact for further genomic studies. Thirdly, we will give an overview of the advances in genomics of non-model bony fish species. Finally, we will discuss the predicted impact of bony fish genomics on biomedical and aquacultural applications and their importance for future evolutionary studies in a broader perspective than the bony fish.

COMPARISON OF SEQUENCING PLATFORMS
Over the past 8 years a number of so-called nextgeneration sequencing platforms have hit the market. They are all based on parallel sequencing of immobilized targets and have revolutionized the genomics field by generating an abundance of sequencing data. Several different sequencing strategies are employed by these platforms. Each of them has their own characteristics. Here we will briefly discuss some of the more popular platforms which are widely used in fish genomics today. An overview of several characteristics of these platforms is shown in Table 1.
There are now four companies who together dominate the market. Roche (454 GS FLX) and Life Technologies (Ion Torrent machines) both developed systems that use pyrosequencing to read the DNA sequence. Although this technique is fast it has problems reading through homopolymers. The read length on the Ion Torrent machine does not match these from the 454 GS FLX but is likely to increase as new chips and chemistry become available.
Next to their Ion Torrent machines Life Technologies also has the SOLiD platform in its portfolio. This platform is more comparable in terms of throughput and costs per base to the Illumina platform. Whereas SOLiD employs a ligation system with dibase tags, Illumina's HiSeq and MiSeq use a process called sequencing by synthesis (SBS). This SBS technology has already been on the market for a few years now and lately the development of this technology has mainly resulted in longer read length and not so much in more reads per flow-cell.
All these machines need clonal copies of the DNA molecule to obtain enough signal for reliable base calling. The amplification step needed to obtain these copies can be a source of bias in the sequence data and information about DNA modifications is lost. An altogether different system is used by the PacBio RS II from Pacific Biosciences. In this machine strand synthesis is followed on single DNA molecules. Although this produces reads spanning several kilobases the raw error rate is high due to the nature of imaging single molecules. Since no amplification is needed it has the benefit that DNA modifications can also be detected and there is no bias in the sequence data.
When using different applications like de novo genome sequencing, resequencing and transcriptome sequencing different parameters are important that influence the choice of the sequencing platform. For de novo genome sequencing it is important to have even coverage in all regions and to have a low error rate. To facilitate assembly the read length should be as long as possible. The combined use of Illumina HiSeq and PacBio RS platforms are best suited for this type of applications. When sequencing a transcriptome a high throughput is desirable but read length is a less important factor.
In the coming years we can expect a further drop in cost/Mb driven by ongoing development of the current technologies and the introduction of new sequencing technologies like sequencing using nanopores. This will result in tools that will make de novo genome sequencing and resequencing even more efficient and easier.
The sequencing endeavours of non-model fish species are increasingly based on whole genome shotgun sequencing (WGS). This kind of sequence data is still inferior in coverage to map-based sequence data, for instance based on BAC sequencing. This is notwithstanding the fact that even in the absence of large scaffolded WGS data sets it is still possible to obtain highly valuable complete exome predictions that also make use of transcriptome data sets and improved gene prediction models.
However, especially chromosomal areas with many repetitive sequences will be poorly covered by WGS assemblies. Furthermore, for polyploid species it will be very difficult to obtain a reliable estimate of the coverage of the entire genome. The bioinformatics needed for scaffolding of WGS is still in the development stage. In Table 2, we present an overview of the software that has been used for de novo assembly and scaffolding of WGS data. It can be argued that in the future the technologies mentioned above will further improve to such extent that the disadvantages of WGS will become less pronounced. For instance, when PacBio sequencing length runs and coverage will further increase it could be used to obtain larger scaffolds even for difficult areas of a WGS assembly. This was recently demonstrated by sequencing the genome of the Arabidopsis Ler-0 mutant solely using the PacBio RS II platform (data available from github.com/PacificBiosciences/DevNet/wiki/ Datasets).
It should also be mentioned that alternative methods to BAC sequencing have been developed that are highly applicable to obtaining genetic maps of fish species. To obtain a genetic map of an organism restriction associated DNA (RAD) tag sequencing can be employed as demonstrated for the spotted gar [53], the threespine stickleback [54] and the Xiphophorus sequencing projects [43]. This method uses next-generation sequencing to map sequence variants in the neighbourhood of restriction sites in the offspring from a cross. From the inheritance of the variants a high-density genetic linkage map can be constructed. This map can then be used to align scaffolds in higher order structures. More recently optical mapping of nicking sites on the genome in nanochannel arrays has also been employed to create a high-density genome map that can be used to order contigs and scaffolds [55].

GENOMICS IN MODEL FISH SPECIES
The most frequently studied fish species are zebrafish (Danio rerio) and medaka (Oryzias latipes). Although statistically the zebrafish is currently used most often as a research model, the use of medaka has particular advantages and the importance of the availability of two genomically well-characterized models for comparative purposes and tool development should not be underestimated [5,56,57]. For instance, the use of the Tol2 transposon from medaka in the zebrafish, where this transposon does not occur, is the basis for the most successful transgenesis protocols in zebrafish [58]. As a result of the combined efforts of a very large number of research groups these fish species have now established themselves in every field of biology, and also have propagated the use of fish species for chemical, physical and mathematical studies [59][60][61] and therefore have earned the name model fish species. Although historically these models have earned their fame by their contribution to large forward genetic screens linked to vertebrate developmental studies [62], in recent years these model species have also been extensively used for biomedical applications, and there are already several examples of medicines in clinical trials that were originally developed in zebrafish models. These studies have shown that research in model fish species can greatly speed up the discovery of new medicines [63][64][65][66]. Model fish species are also increasingly used for comparative studies in experiments with other fish species that are of importance for aquaculture, e.g. as a model for the effects of swimming exercise on muscle development [67]. Reversely, species that are very important in aquaculture, such as rainbow trout and common carp (Cyprinus carpio), have shown to have benefits for fundamental research. Research with the latter species is especially relevant to biomedical studies in the very closely related zebrafish owing to its large body size, the availability of highly inbred lines and a very large spawn size that offers possibilities for highthroughput screening [41,68].
From a genomics perspective the zebrafish genome is now the most advanced model in that the sequencing efforts have reached the stage in which the completed genome will be further perfected by the Genome Reference Consortium (http://genomereference.org) [9]. The recently published zebrafish reference genome will undoubtedly have a major impact on future genomics studies, for instance by its major role in aiding the identification of protein functions, as shown recently by Kettleborough et al. [69] and Varshney et al. [70], and by supporting the identification of mutations in forward genetic screens [71]. Howe et al. [9] have shown examples of how the available genomic sequence data can lead to new insights into the evolution of genome architecture and can identify new biological functions for instance involved in sex determination. The results obtained from the zebrafish models can now be compared with other fish species such as medaka that has been extensively used for studies of sex determinants and is thereby the basis to obtain a better understanding of the evolution of sex determination in all bony fish with implications for mammalian research on sex chromosome evolution [72][73][74][75]. Due to the rapid evolutionary turnover of sex chromosomes in fish, sex-linked markers found in medaka and zebrafish will not be directly translatable to results in other fish species. However, by comparative genomic studies with the data obtained in species such as medaka and rainbow trout [76] the resulting knowledge on sex determination mechanisms in several bony fish might also lead to predicted gender markers for other fish species. This will have applications for aquaculture, since methods for determining the sex ratios of offspring of cultured fish species is of economical value.
The genome sequence of the zebrafish demonstrates that even between closely related fish species there can be large differences in repetitive DNA content. For instance, in zebrafish the type II DNA transposable elements cover 39% of the entire genome sequence [9], whereas in common carp there is a very low number of repetitive elements, as low as in fugu [41]. This, together with smaller intron and intergenic region sizes, explains why common carp as a pseudo-tetraploid species has a similar DNA content as zebrafish. We recently have obtained a shotgun sequence of the giant Danio (genus Devario) showing that it has a diploid genome that resembles the zebrafish rather than common carp in its richness of repeat sequences (Spaink and Dirks, unpublished data).
In addition to these comparative studies, the available model fish genome sequences are an essential basis for the successful interpretation of the extensive transcriptome, proteome and metabolome data sets that are now rapidly accumulating, also for nonmodel fish species, as illustrated by a small representation of the many recent publications that have stimulated our research in this area [41,[77][78][79][80][81][82][83][84][85][86][87][88][89][90][91][92][93]. The limited annotation of particular classes of genes, such as non-coding RNAs and genes that are only expressed during disease, are bottlenecks that still need to be addressed. Furthermore, there is still a lack of information on orthology relationships between genes from different fish species and mammalian genes. This is a pity since the application in model fish of many new genomics technologies, for instance in epigenetic analysis [94][95][96][97][98], will be more difficult to translate to comparative epigenetic studies in other fish species and mammals.

NEW INSIGHTS FROM NON-MODEL TELEOST FISH GENOMES
Commercial availability of massive parallel sequencing or next-generation sequencing technologies in 2005 triggered an exponential growth of the number of species for which draft assemblies of complete genome sequences were released. The genome sequence of the giant panda was the first sequence of a vertebrate species that was de novo assembled based on next-generation technology alone [99]. As of 2 July 2013 a total of 3263 eukaryotic genomes were registered at NCBI's genome database (http://www.ncbi. nlm.nih.gov/genome/). Animal genomes accounted for 977 entries and the majority of these belong to the groups of mammals (378) and insects (285). Teleost fish, although the largest known group of vertebrates ($ 27 000 species), are only poorly represented in this database, namely by 93 species and including 42 entries with the status 'no data' and 17 entries with the status 'SRA/traces'. A combined search for whole genome sequencing projects of ray-finned fish (Actinopterygii) and lobe-finned fish (Sarcopterygii) in three commonly used databases, namely NCBI, ENSEMBL (http://www.ensembl. org/index.html) and GOLD (www.genomesonline. org/), resulted in a list of 61 registered fish genomics projects ( Table 3), some of which have the status 'Scaffolds or contigs' (27), or 'Chromosomes' (6), and more than half of which are still incomplete. Clearly, the orders of the Cypriniformes (6 projects), Cyprinodontiformes (11 projects) and Perciformes (18 projects) are currently the most popular for genomics projects.
Additional draft assemblies of complete teleost genomes have been published, but are not yet available from the NCBI database. For example, genomic scaffolds of the European eel (Anguilla anguilla) [7], Japanese eel (Anguilla japonica) [6], and the common carp (C. carpio) [41] are all accessible via the website www.zfgenomics.com. Recently, a draft assembly of the complete genome of Pacific bluefin tuna (Thunnus orientalis) was published [36], which is accessible via GenBank (accession nos. BADN01000001-BADN01133062).
Availability of the complete genome sequence of model and non-model fish species has a strong catalytic effect on a broad range of scientific disciplines and on applied science, as indicated by the following examples. Sequence analysis of the complete genome of the atlantic cod (Gadus morhua) uncovered that these cold-adapted teleosts lack a functional major histocompatibility complex (MHC) II pathway. Apparently, this is compensated for by expansion of the number of MHCI genes and by specific adaptations in the Toll-like receptor (TLR) families, thereby providing new fundamental insight into the evolution of the adaptive immune system in  [102] (continued)  [43] Adapted from NCBI (http://www.ncbi.nlm.nih.gov/genome/), ENSEMBL (http://www.ensembl.org/index.html) and GOLD (http://www.genomesonline.org/).
vertebrates [39]. The draft genome sequences of the European eel (A. anguilla) and Japanese eel (A. japonica) showed that these fish species, in contrast to most other teleosts, retained fully populated Hox gene clusters, which may be correlated with their peculiarly complex life cycle that includes two larval stages [6,7]. In contrast, elasmobranch fishes, such as the cat shark (Scyliorhinus canicula) and the little skate (Leucoraja erinacea), seem to have lost all HoxC cluster genes [42]. This sheds a completely new light on the relative importance of this family of genes for body plan formation in the fish embryo. Detailed analysis of the genome sequence of the Pacific bluefin tuna (T. orientalis) revealed remarkable adaptations in multiple visual pigment genes, which may not only explain their specific predatory behaviour in the blue-pelagic ocean but may also contribute to improved aquaculture conditions [36]. The recent publication of the genome sequence of the platyfish (Xiphophorus maculatus) has already significantly broadened our understanding of a wide variety of phenomena, such as live-bearing fish reproduction, pigmentation patterns and melanoma tumorigenesis, and even complex behavioural traits [43].

CONCLUSIONS AND FUTURE OUTLOOK
The state-of-the-art in genomics of the bony fish has advanced so enormously in the last few years that even in the context of the recent large human sequencing projects, for example in the Encode projects [104], it is no longer possible to catch phrase the recent advances under the term of 'fishy genomics' or 'fish and chips'. The latter catch phrase anyway will have to suffer increasing unpopularity with the prediction that RNA and DNA microarray technologies will soon lose most of their importance, as they will be gradually replaced by methods based on sequencing technologies in the coming years. As explained above, teleost fish species have much to offer for research that is dependent on whole organism test models and for biomedical applications they have in many aspects advantages even over the use of mammalian test systems as recently discussed by Spaink et al. [68]. Independently of its applied values, genome-wide studies of the bony fish have great impact for comparative genomics: it will provide a deep understanding of the recent half billion years of evolution in vertebrates and of more recent era that led to an extreme diversification of particular subgroups of the Teleostei, such as the cichlids that have been intensively studied from an evolutionary perspective [105]. It will also provide enormous opportunities for data mining and will provide the possibility to trace back the origins of genes from the organisms closest to the earliest evolutionary branches to its origins within invertebrates. For this purpose it is fortunate that many invertebrate species such as the tunicates are also increasingly being analysed with genomics technologies (http://www.tu nicate-portal.org/wordpress/). That this can lead to unexpected findings is nicely illustrated by the recent discovery of a completely novel fluorescent protein in the Japanese eel [106]. Furthermore, it can lead to new insights into the origin of individual genes, for instance the interesting example of horizontal gene transfer of a transposon between lamprey species and their hosts indicate that transfer of genetic material between species mediated by parasite-host interactions could be very frequent [107]. In addition to fundamental evolutionary research there will also be important applied aspects, for instance in nature conservation biology and the impact of ancient climate changes on species diversification or extinction processes. This could lead to better prediction models for the effects of current estimated climate changes on biodiversity of the teleost fish species and thereby could provide better guidelines for knowledge-based fishery regulations. Sequence technology has reached the stage that the capacity of instrumentation is not limiting anymore for sequencing a large number of vertebrates, in contrast to the period at the end of the 20th century when, as an illustration, one of the reasons for sequencing the genome of the Fugu (Fugu rubripes) was its small size genome. With the super high capacity of shotgun sequencing facilities it might already now be possible to obtain WGS data for all teleost fish species. Although this would still be extremely costly and no plans have yet been proposed for this, there are bigger problems than cost involved: the bioinformatics and curation facilities that are still not adapted to handle the next-generation sequencing data flow coming from many independent sequencing projects, at least not in a user friendly way. Especially since the quality of WGS shotgun sequences does not make the data highly suitable yet to be integrated in a bioinformatic setting such as ENSEMBL it is needed that complementary bioinformatics and data curation solutions become available at low thresholds to analyse and compare the early versions of WGS assemblies [108]. In addition, it would be desirable to strive to common genome data curation and annotation facilities that cover all fish species as now is offered for zebrafish within VEGA [109] (vega.sanger.ac.uk) and to obtain a comprehensive web site that links all bony fish gene annotations and functional studies following the example presented by ZFIN for zebrafish (zfin.org).
In the context of genome evolution, we can see the great progress in the last years in answering several old questions that have been extensively debated for over decades such as the origin of the Teleostei gene duplication. Since it is likely that a majority of all vertebrates will be sequenced within the coming decades, we can get new insights in many fish species into the correlation between genome duplications and repeat content of genomes, on the one hand, with environmental selection pressures and particular adaptations of body architecture. We can also predict that we can soon obtain new insights into the mechanisms that were the cause of gene losses resulting in the trimmed genomes of the modern fishes that we are now studying. This will certainly give an amazing view of the genome dynamics that took place during a period of natural selection that lasted for many hundreds of millions of years. This knowledge can form a bridge between molecular biological studies carried out at the very basic molecular levels in microbes and lower vertebrates and studies in mammalian systems. We have therefore no doubts that genomic studies in the bony fish species will remain to play an important role in uniting the levels of molecular and evolutionary studies, e.g. by being perfect models for system biology studies [60,61,110,111].

Key Points
Next-generation sequencing has revolutionized de novo assembly of fish genomes sequences. Fish models are rapidly gaining importance at all levels of fundamental and applied science. We predict that advances will further accelerate and that the resulting genomic data sets will lead to unprecedented new insights in to vertebrate gene functions and evolutionary mechanisms. The application for nucleotide sequencing in transcriptomics technologies will further increase and will gradually replace expression microarray technologies. There is an increased need for better and more user-friendly bioinformatic tools and curated database storage of data might become a bottleneck.