Full genome survey and dynamics of gene expression in the greater amberjack Seriola dumerili

Abstract Background Teleosts of the genus Seriola, commonly known as amberjacks, are of high commercial value in international markets due to their flesh quality and worldwide distribution. The Seriola species of interest to Mediterranean aquaculture is the greater amberjack (Seriola dumerili). This species holds great potential for the aquaculture industry, but in captivity, reproduction has proved to be challenging, and observed growth dysfunction hinders their domestication. Insights into molecular mechanisms may contribute to a better understanding of traits like growth and sex, but investigations to unravel the molecular background of amberjacks have begun only recently. Findings Illumina HiSeq sequencing generated a high-coverage greater amberjack genome sequence comprising 45 909 scaffolds. Comparative mapping to the Japanese yellowtail (Seriola quinqueriadiata) and to the model species medaka (Oryzias latipes) allowed the generation of in silico groups. Additional gonad transcriptome sequencing identified sex-biased transcripts, including known sex-determining and differentiation genes. Investigation of the muscle transcriptome of slow-growing individuals showed that transcripts involved in oxygen and gas transport were differentially expressed compared with fast/normal-growing individuals. On the other hand, transcripts involved in muscle functions were found to be enriched in fast/normal-growing individuals. Conclusion The present study provides the first insights into the molecular background of male and female amberjacks and of fast- and slow-growing fish. Therefore, valuable molecular resources have been generated in the form of a first draft genome and a reference transcriptome. Sex-biased genes, which may also have roles in sex determination or differentiation, and genes that may be responsible for slow growth are suggested.

"BUSCO WAS USED TO EVALUATE THE ASSEMBLED GENOME AND THE TRANSCRIPTOME AND OUT OF 303 BUSCO GROUPS (AKA CONSERVED ORTHOLOGS), SPECIFIC TO EUKARYOTES MORE THAN 93% WERE IDENTIFIED TO BE ENCODED BY BOTH GENOME AND THE TRANSCRIPTOME ASSEMBLY (TABLE 1)." 2.2. Please include how close is the zona pellucida sperm-binding protein 4 gene you identify on Linkage group 12 to the Japanese marker that identifies the sex locus in S. quinqueradiata. It would be nice to see this explored more to build up more of a story to the article. 3.Please describe developmental differences observed in the 20 dph and 40dph small vs large fish. In RNA-seq experiments related to growth rate, it can be difficult to separate out differences in size due to genetics vs differences in developmental stage. In your experimental design it is possible that these two factors are confounded and it should be stated as such or not. 4.The biggest revision I have is that the authors are using the gene naming schema output from maker. A typical gene model schema for Seriola dumerili would be Sedum.G00000001 for Genes and Sedum.T00000001.1 Sedum.T00000001.2 etc for transcript isoforms of the genes. Renaming the gene models is important as other researchers will start using your gene model annotations and referring to them by name. Ideally, the authors will rename all the figures and tables with the new names. However, at a bare minimum the GFF file should be updated with new names and a 2 column reference file associating the maker names with the new naming schema supplied.
Please upload the renamed GFF file to GIGA Science. Lines 147-149: "Out of the 53,023 obtained transcripts 44,371 were successfully mapped to the generated in silico groups of the greater amberjack and 30,342 transcripts to the genome of medaka." Do the authors have any indication as to why more than 20,000 predicted gene transcripts were identified and mapped to the generated in silico groups compared to the medaka genome? This (i.e., 53,000 gene transcripts) is substantially more than would be predicted for a typical diploid teleost genome (about 30,000). Splice variation or other reasons such as gene duplications?
Line 295: "hypothesis that RA signaling pathway" should be "hypothesis that the RA signaling pathway". R: DONE. THANK YOU.
Line 306: "their processes" should be "the processes". R: DONE. THANK YOU.
Line 344: Should "HCMR" be spelled out at first use? R: DONE. THANK YOU.
Lines 346-349: The authors might consider defining "n" here as "fish" in relation to biological sample size (e.g., n = 4 fish).
R: DONE. THANK YOU.  There are several issues I would ask the authors to address: 1. The full assembly process is somewhat unconventional, and the manuscript could be improved with additional details and clarifications. Specifically: 1a. The 'in silico groups' (first introduced in the abstract, l. 58, and then in Results l. 145) are hypothetical linkage groups/pseudochromosomes. It would be helpful if they are immediately introduced as such.  ON TRIAL AND ERROR METHOD, WE FOUND THAT MASURCA WAS  ABLE TO PRODUCE AN ASSEMBLY WITHIN A CONSIDERABLE TIME FRAME  WITH 50X COVERAGE. WE DID TRY TO ASSEMBLE THE GENOME WITH THE 60X  COVERAGE OR MORE AND THE ASSEMBLIES NEVER COMPLETED WITHOUT  ERRORS.   TRINITY'S NORMALIZATION PROTOCOL WAS USED FOR DIGITAL  NORMALIZATION DUE TO THREE KEY POINTS:  1.THE TOOL IS INDEPENDENT OF WHETHER THE DATA IS ORIGINATED USED AS RECOMMENDED BY THE REVIEWER FOR  DETERMINING QUALITY OF THE ASSEMBLY AND RESULTS ARE SHOWN  BELOW AND PRESENTED IN THE RESULT SECTION.   "BUSCO WAS USED TO EVALUATE THE ASSEMBLED GENOME AND THE  TRANSCRIPTOME AND OUT OF 303 BUSCO GROUPS (AKA CONSERVED  ORTHOLOGS), SPECIFIC TO EUKARYOTES MORE THAN 93% WERE IDENTIFIED  TO BE ENCODED BY BOTH GENOME AND THE TRANSCRIPTOME ASSEMBLY  (TABLE 1)." 1e. According to l. 143, 14990 scaffolds were mapped to the medaka genome to construct the 'in silico groups'. The number suggests this is only a small fraction of the genome, but I assume these are primarily larger scaffolds. A more informative number would therefore be the percentage of the assembly that could be assigned to a group. (I believe it is ~80%, based on a sequence file provided for review but not listed in the manuscript). 1f. The Results suggest 'synteny analysis' (l. 145) is possible with the hypothetical groups. This is not really performed, except for figures 1b-c. In the end, the 'on-to-one relationship' between Seriola and medaka (Discussion l. 225-227) has some aspects of circular reasoning, as it is based on only 468 marker matches. Therefore, this alignment is mostly based on an indirect alignment of the medaka genome with itself. (Incidentally -the manuscript in different places refers to 'circus plots', which should be 'Circos plots'.) R: SYNTENY ANALYSIS WAS PERFORMED USING THE GENERATED  TRANSCRIPTS WHEREAS THE IN SILICO GROUPS WERE BASED ON THE  SCAFFOLDS. THE 468 MARKERS MENTIONED BY THE REFEREE ARE  CONCERNING THE PUBLIC AVAILABLE MARKERS OBTAINED FROM THE  JAPANESE YELLOWTAIL. HOWEVER THE GENERATED IN SILICO GROUPS  WERE NOT ONLY BASED ON THE HOMOLOGY SEARCH OF THE GREATER  AMBERJACK WITH THE JAPANESE YELLOWTAIL BUT ALSO WITH THE MODEL  FISH SPECIES MEDAKA (FIG.1) AND STICKLEBACK (ADD.FILE 7). TRANSCRIPTS  WERE MAPPED TO EACH FISH GENOME SEPARATELY. THE CORRECTNESS OF  THE ASSIGNED IN SILICO GROUPS WERE EVALUATED BASED ON THE  PUBLISHED SYNTENY ANALYSIS OF THE JAPANESE YELLOWTAIL AND  MEDAKA. CONSEQUENTLY THERE CANNOT BE ANY ASPECTS OF CIRCULAR  REASONING. THE TYPO IS NOW CORRECTED, THANK YOU.
2. The gonadal transcriptome analysis is based on a (technically sound) RNA-seq comparison of testis with ovaries. However, I think the interpretation could be improved: 2a. These are completely different tissues. Therefore, 'upregulation' and 'downregulation' are not really suitable descriptions of gene expression level differences, as there is no developmental path leading from one condition to the other by tweaking expression levels. Therefore, 'higher/lower expression in' would be more appropriate. WITH THE REFEREE AND THE DESCRIPTIONS FOR  THE GONAD ANALYSIS HAVE BEEN CHANGED TO 'HIGHER/LOWER  EXPRESSION' AS SUGGESTED. 2b. Related: when comparing different tissues, these differences are best interpreted qualitatively rather than quantitatively (and without statistical tests), as there is no reasonable way to align the profiles. In this study, standard DESeq2 normalization (l. 444) is presumably applied, but using different tissues almost certainly violates the assumptions behind its method (i.e. minimize gene expression differences). Its validity needs to be demonstrated for this data, for example by showing the overall similarity between the profiles and their successful normalization. WITH THE DIFFERENT TERM FOR THE DESCRIPTION OF  THE GONAD EXPRESSION RESULTS. IN ORDER TO OBTAIN HOWEVER  GENDER SPECIFIC GENE EXPRESSION, THE PATTERNS HAVE TO BE  COMPARED AND DIFFERENCES HAVE TO BE TESTED FOR SIGNIFICANCE.  CONCERNING THE USAGE OF DESEQ2 IT HAS TO BE NOTED THAT THIS IS A  STANDARD ANALYSIS METHOD USED IN MANY SIMILAR TRANSCRIPTOME  2c. In general, this study examines sex-specific gene expression, but also discusses sex determination. Is there a reason to assume the genes highly 'differentially expressed' in gonads are the genes responsible for sex determination? 2d. Related: the genome assembly is based on two individuals, a male and a female. Unfortunately, the sequencing libraries have been pooled (l. 377), otherwise it would have been relatively straightforward to identify (small) loci not present in males (assuming ZW sex determination). It may still be possible to identify contigs with relatively low coverage -these could be candidate sex determination regions. 3. Line 181: please mention that these transcriptomes are of muscle tissue (the current phrasing suggests complete animals have been profiled). For this part of the study, animals at the same age were used. I assume the slow-and fast-growing animals were at the same developmental stage here? 4. The Discussion of the genome assembly quality in the light of other Mediterranean fish species is not entirely appropriate (lines 195-219). The species are not closely related. Although genome size and sequence coverage are factors influencing shortread assembly contiguity, genome structure (i.e. repeat landscape) and heterozygosity level are probably much more important. Both have not been analyzed. The Seriola assembly may well have ended up more fragmented than necessary by pooling the genomes of two individuals. This choice should be discussed, and an analysis of overall heterozygosity (e.g. using a k-mer profile) would be relevant here.     3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63   The present study investigates the molecular background of the greater amberjack.

265
In addition to genome sequencing, the gender specific mechanisms at the transcriptome level 266 operating in the greater amberjack were investigated. By this means, gonadal specific 267 transcripts were identified, with more transcripts found to be highly expressed in male than in 268 female gonads ( Fig. 4a and  in the present study as being more highly expressed in the female gonads. Enzymes encoded 282 by cyp450 genes play an important role in the synthesis and metabolisms of steroid 283 hormones, as well as of certain fats and acids used to digest fats. The differentially expressed 284 transcripts encoding for cyp450 genes in the present study comprised 7 cyp450 transcripts 285 more highly expressed in female, and 6 more highly expressed in male gonads (Fig. 8a).