Initial data release and announcement of the Fish10K: Fish 10,000 Genomes Project

With more than 30,000 species, fish are the largest and most ancient vertebrate group. Despite their critical roles in many ecosystems and human society, fish genomics lags behind work on birds and mammals. This severely limits our understanding of evolution and hinders progress on the conservation and sustainable utilization of fish. Here, we announce the Fish10K project, an international collaborative project or initiative? aiming to sequence 10,000 representative fish genomes under a systematic context within ten years, and officially welcome collaborators to join this effort. As a step towards this goal, we herein describe a feasible workflow for the procurement and storage of biospecimens, and sequencing and assembly strategies. To illustrate, we present the genomes of ten fish species from a cohort of 93 species chosen for technology development.


Background Fish genomes sequenced to date
As of writing, genome assemblies are publicly available for less than 1% of fish species (216 species of 56 orders) (Supplementary Table 1 .3Kb, respectively. There are 97 species with a scaffold N50 of more than 1Mb, of which 31 have a contig N50 above 1Mb (Figure 1). These genomes has fueled a number of studies on the phylogeny and evolution of fish (e.g., the African coelacanth genome and tetrapod evolution), evolutionary processes of specific fish subgroups (e.g., elephant shark genome illustrating the phylogenetic relationship of Chondrichthyes as a sister group to bony vertebrates) [1], genetic mechanisms of adaptation to different environments (e.g. the deep-sea Mariana Trench snailfish and cave-dwelling fish) [2], and specific biological processes (for example, the tonguefish Cynoglossus semilaevis genome for understanding ZW sex chromosome evolution) [3]. Nevertheless, the current fish genome sequencing results are only a drop in the ocean, and numerous critical research questions remain to be resolved. A non-exhaustive list includes gaining a comprehensive and clear understanding of fish phylogeny, genome size diversity and chromosome evolution, diverse environmental adaptations, morphology evolution, respiratory system, immune system, the evolution and function of ultra-conservative (UCE) and conserved nonexonic elements 4 (CNEEs).

The era of genome consortiums
With the rapid development of DNA sequencing technology, this is the time for large-scale, collaborative genomic studies. The first such project was the Genome 10K (G10K) Project established in 2009, which aimed to sequence and assemble genomes of about 10,000 vertebrate species [4]. Further advances in sequencing have extended this vision. The Vertebrate Genomes Project (VGP) was launched in 2017 to generate chromosome-level, haplotype-phased genome assemblies of vertebrate species [5]. The Bird 10,000 Genomes Project (B10K) was initiated [6] after the successful phylogenomic study on 45 avian genomes in 2014 [7]. The B10K projects aims to sequence and assemble all known bird species in three phases. Similar efforts have been made for bats [8], plants [9], and other species [10,11]. Despite current challenges in funding, sampling, sequencing, assembly, and data analysis, these projects have already made substantial progress. For fish, which makes up more than half of all vertebrate species, no projects at the similar scale has been initiated. The only large-scale genomic study to our knowledge was Fish-T1K, which aimed to sequence the transcriptomes (RNA-seq) of ray-finned fishes [12]. However, the insights gained from transcriptome data alone is relatively limited. Accelerating fish genomics by large-scale genome sequencing efforts would undoubtedly boost research into fish biodiversity, speciation, adaptation, as well as aiding the conservation and sustainable utilization of fish.
The Fish10K Genome Project 5 We here announce the Fish10K Genome Project, aiming to sample, sequence, assemble, and analyze genomes of 10,000 fish species. We are proposing an effective and integrated workflow, in which major genomics challenges are addressed, to construct high-quality reference genomes. Through developing and applying effective analysis methods, we will be able to address critical evolutionary and biological research questions related to fish. In order to prove the efficiency of our workflow and the feasibility of this large-scale genome project, we are releasing ten high-quality genomes as part of a pilot project. We hope the released genomes, along with the other genomes generated by Fish10K, will be valuable resources for fish researchers as well as to fishery industry. 6

Main text Feasibility test and the release of ten fish genomes
In order to establish cost-effective strategies and assess the feasibility of a large-scale genome project, we initiated a pilot study in June 2017. Over the last two years, we went on four expeditions across lakes, rivers, and coastal waters of China, collecting 324 fish species. After careful documentation of sample information and species identification, the tissues of 93 species were selected for DNA extraction and sequencing. We used single tube long fragment reads technology (stLFR) [13] and the DNBSEQ platform to sequence the species, generating long read (Nanopore or PacBio) and Hi-C data for a subset. In this way, we were able to test the feasibility of three different sequencing and assembly strategies (Figure 2): stLFR data alone (synthetic long reads generated using second-generation sequencing platform) (Strategy I); stLFR data combined with low-depth, long reads (~10× raw Nanopore data to fill in the gaps) (Strategy II); and high-depth long reads (~80× raw Nanopore data) combined with second generation short reads (either short insert size libraries or stLFR) (Strategy III). We have sequenced all the 93 species using stLFR (Supplementary Table 2 Table 1). The contig N50s of seven of these genomes are more than 1 Mb and a minimum 93% of BUSCO genes were found, indicating the genome assemblies are of high quality. Three genomes were assembled at chromosome-level, with more than 92% scaffold-anchored ratio using Hi-C data.
The Fish10K Genome Project: from 100 to 10,000 With the experience gained in the Fish10K pilot study and our published results, we believe that the project can scale-up. Thus, we are proposing a roadmap (Figure 3) in which we will construct high-quality reference genomes for representative species in all orders (Phase I) and families (Phase III) in concert with the generation of draft genome sequences for additional related species (Phase III).
An interrogation of FishBase [15] and "Fishes of the world" (5 th ed.) [16] revealed information on 34,115 fish species from ~5,000 genera, ~529 families, and ~80 orders (Supplementary Table 2). The species were divided into six lineages (Elasmobranchii, Holocephali, Actinopterygii, Sarcopterygii, Cephalaspidomorphi, and Myxini), in which Elasmobranchii and Holocephali belong to Chondrichthyes (cartilaginous fishes); and Actinopterygii and Sarcopterygii belong to Osteichthyes (bony fishes). As mentioned above, there are reference genomes available for at least one species of 56 orders, while for the rest of the orders reference genomes are required. Also, there are fish orders with a large number of species (e.g., Perciformes has 62 families; Siluriformes has 40 families; and Scorpaeniformes has 39 families), suggesting that additional high-quality reference genomes are required to represent the diverse biological characteristics. Thus, in Phase I we aim to sequence 450 bony 9 fish and 50 cartilages fish species, covering all 80 orders (Supplementary Table 2).
In Phase II, we aim to sequence approximately 3,000 species, covering almost all ~500 fish families. In Phase III, we will sequence ~6,500 fish genomes, covering ~5,000 genera.

Sampling, sequencing, assembly, and annotation
Sampling is a critical challenge in any large-scale genome consortium. We propose a centralised sampling mode (i.e., mirroring our 93-species pilot phase), with several sampling centres set up to collect samples. In addition to these sampling centres, we would like to obtain further samples from around the world. To make sure we have enough information for further analysis and to maximise the value of the genome data, we propose a sampling standard. The associated meta-data was designed to include as much information as possiblestressing the importance of collecting images of each specimen and adequate storage conditions (frozen or voucher specimen).
For sequencing, we propose to use both second-and third-generation sequencing technologies to generate high-quality genome assemblies. Based on our pilot study, and considering the feasibility of obtaining the required amount of high molecular DNA, we have chosen a 'stLFR data + low-depth Nanopore data + Hi-C data' strategy (Strategy II in Figure 2) for the majority of the species. For more complex genomes, we will generate high-depth Nanopore sequence data to ensure that good assemblies can be achieved (Strategy III, 'stLFR data + high-depth Nanopore data + Hi-C data';

Figure 2).
For key species (to be determined by the working groups; see below), we will employ a Pacific Biosciences circular consensus sequencing (CCS) long 10 high-fidelity (HiFi) approach, allowing the generation of highly-accurate long reads [17]. For the large-scale sequencing of 6,000 species in Phase III, we propose to employ stLFR alone (Strategy I in Figure 2). For a diploid species with a genome size of less than 10Gb (generated using our preferred Strategy II), we will require the contig N50 and scaffold N50 to be longer than 1Mb and 10Mb, respectively, and (if applicable) to anchor more than 90% of the assembled sequences to chromosomes.
The same criteria will apply to assemblies generated using a high-depth long read strategy (Strategy III). For assemblies generated using stLFR sequencing alone (Phase III) assemblies must have a contig N50 and scaffold N50 longer than 100 Kb and 1Mb, respectively. All assemblies must have a BUSCO completeness estimate higher than 90%. Finally, genome feature annotations (e.g., repeat and gene annotations) will be performed using well-established in-house pipelines. (http://icg-ocean.genomics.cn/index.php/fish10kintroduction) will provide detailed information on the project status, as well as continuously updated information on the sequenced species. It also provides a portal for data download (in particular for assembled genomes).

Organisation of Fish10K consortium
Fish10K has been initiated by a core group of researchers, forming the steering committee of Fish10K (Figure 4). The steering committee oversees the project and is responsible for fundraising, expanding the steering committee, organising the scientific groups and species groups, and coordinating sampling, sequencing, assembly, and analysis strategies. The steering committee is also responsible for the generation of genomic data. Various scientific groups will focus on technical and scientific questions related to this project. The scientific groups, which will have advance access to all generated data, will include a sampling group, a sequencing and assembly group, and a series of groups focusing on different fish-related scientific questions. We wish to receive proposals from researchers who would like to take part in scientific groups. We also invite researchers who are studying fish species which are rare or extinct to join Fish10K as members in the species group (with or without associated funding for sequencing). In addition to obtaining the genome sequences of their area of interest, joining the consortium provides immediate access to all genomes currently being assembled by Fish10K.

Conclusions
Fish10K will generate an unprecedented, comprehensive data set of fish, the largest and most diverse vertebrate group. Our effort will allow us to complete the genomic tree for fish and, in concert with other projects such as VGP and B10K, vertebrates in general. N50 is the sequence length of the shortest contig (or contig) at 50% of the total genome length.
16 Figure 2. The sequencing and assembly strategies. In the preferred strategy (Strategy II), high-quality DNA fragments (≥40Kb) are used to construct a stLFR library which is sequenced using the DNBSEQ platform. Low-sequencing-depth long reads are only used to improve the continuity of highly complex regions (increase the contig N50). In the alternative Strategy I, high-depth long reads are used to construct contigs, while low-depth stLFR reads are used to polish the contig and link the scaffolds. Hi-C data is used to generate a chromosome-level assembly.