A Nearly Complete Genome of Ciona intestinalis Type A (C. robusta) Reveals the Contribution of Inversion to Chromosomal Evolution in the Genus Ciona

Abstract Since its initial publication in 2002, the genome of Ciona intestinalis type A (Ciona robusta), the first genome sequence of an invertebrate chordate, has provided a valuable resource for a wide range of biological studies, including developmental biology, evolutionary biology, and neuroscience. The genome assembly was updated in 2008, and it included 68% of the sequence information in 14 pairs of chromosomes. However, a more contiguous genome is required for analyses of higher order genomic structure and of chromosomal evolution. Here, we provide a new genome assembly for an inbred line of this animal, constructed with short and long sequencing reads and Hi-C data. In this latest assembly, over 95% of the 123 Mb of sequence data was included in the chromosomes. Short sequencing reads predicted a genome size of 114–120 Mb; therefore, it is likely that the current assembly contains almost the entire genome, although this estimate of genome size was smaller than previous estimates. Remapping of the Hi-C data onto the new assembly revealed a large inversion in the genome of the inbred line. Moreover, a comparison of this genome assembly with that of Ciona savignyi, a different species in the same genus, revealed many chromosomal inversions between these two Ciona species, suggesting that such inversions have occurred frequently and have contributed to chromosomal evolution of Ciona species. Thus, the present assembly greatly improves an essential resource for genome-wide studies of ascidians.


Introduction
The genome of the ascidian, Ciona intestinalis, was decoded in 2002 as the seventh animal genome (Dehal et al. 2002). Recently, it was shown that there are two cryptic species of C. intestinalis, types A and B (Caputi et al. 2007;Nydam and Harrison 2007, 2011a, 2011bSato et al. 2012Sato et al. , 2014Roux et al. 2013;Bouchemousse et al. 2016). A taxonomic study (Brunetti et al. 2015) proposed renaming C. intestinalis type A as Ciona robusta and C. intestinalis type B as C. intestinalis. This newly proposed nomenclature is sometimes confusing, especially in studies using genomic information, because the animal from which the genome was decoded (Dehal et al. 2002) was originally identified as C. intestinalis. To avoid such confusion, many reports have included two names to identify the species, such as C. intestinalis type A (C. robusta) (e.g., Yoshida et al. 2017;Satoh et al. 2018;Cao et al. 2019;Kourakis et al. 2019;Matsubara et al. 2019;Mizutani et al. 2019;Oonuma and Kusakabe 2019). For the sake of clarity, the genome that was further explored in this study was from C. intestinalis type A (C. robusta).
Ascidians are tunicates, the closest relatives to vertebrates (Delsuc et al. 2006;Putnam et al. 2008). The ascidian tadpolelike larva, which comprises only 2,600 cells, shares the basic body plan of vertebrates. The larval tail contains a central notochord flanked laterally by muscle, dorsally by nerve cord, and ventrally by endodermal cells (Satoh 2003;Lemaire 2011). The genome sequence has been utilized as a key resource to analyze developmental mechanisms underlying such a simple body plan, especially genome-wide gene regulatory networks, epigenetic regulatory mechanisms, and gene expression profiles at single-cell resolution (Imai et al. 2004(Imai et al. , 2006Suzuki et al. 2007;Horie et al. 2018;Cao et al. 2019;Madgwick et al. 2019). Thus, the genome sequence, in combination with more than a century of experimental animal studies (Conklin 1905a(Conklin , 1905b, has made Ciona an ideal model system for studies of developmental mechanisms (Satoh 2013), and the origin and evolution of chordates and vertebrates (Satoh 2016). For example, recent studies have shown that ascidian embryos develop cells similar to placodal cells and neural crest cells in vertebrate embryos (Manni et al. 2004;Mazet et al. 2005;Jeffery et al. 2008;Tresser et al. 2010;Abitua et al. 2012Abitua et al. , 2015Wagner and Levine 2012;Ikeda et al. 2013;Stolfi et al. 2015;Waki et al. 2015;Horie et al. 2018).
Chromosomal-level genome sequence data for C. intestinalis type A (C. robusta) became available after a major update in 2008 (Satou, Mineta, et al. 2008). This version, called the KH assembly (Kyoto-Hoya; "Hoya" is a Japanese word for ascidians), includes 68% of the sequence information in 14 pairs of chromosomes. Recent technological advances enabled us to analyze higher-order structure of the genome, and motivated us to improve the quality of the Ciona genome assembly.
Comparisons of invertebrate genomes have shown that orthologous sequences on an ancestral chromosome tend to be retained in its descendant chromosome of extant taxa, but the order of orthologous sequences is generally not conserved (Clark et al. 2007;Hillier et al. 2007;Hill et al. 2008). On the other hand, in vertebrate genomes, interchromosomal rearrangements are more common (Waterston et al. 2002;Kasahara et al. 2007). To analyze at higher resolution how chromosomes have changed during evolution, chromosomal-level genome sequences will undoubtedly be helpful.
In the present study, we provide a new assembly, called the HT (Hoya T-strain)-version. This assembly contains 95% of the genome sequences in the chromosomes; thus, it provides a valuable genomic resource for chordate studies. Comparison of this assembly with the genome of Ciona savignyi, which is a different species in the same genus, allowed us to identify many chromosomal inversions between the two species.

Biological Materials
In the present study, we used C. intestinalis type A (C. robusta).
To confirm that the animal we used was C. intestinalis type A, we used genomic sequences of five loci, Fgf4/5/6 (this gene annotation was likely incorrect, because the sequences found in the public database were all mapped to a region within an intron of the Fgf receptor gene; chromosome 4: 7,098,700-7,099,456), Foxa.a (fkh;chromosome 11: 7,730,731,071),Jade (chromosome 2: 4,786,787,267),Patched (chromosome 5: 4,938,939,428), and Vesicular acetylcholine transporter (vAChTP; chromosome 1: 5,619,869-5,620,532), because these loci have been reported to be diverged between these two types (Nydam and Harrison 2011a). Sequences retrieved from NCBI were aligned using the Clustal Omega program (Sievers et al. 2011), and alignments were manually adjusted. After removing gaps, alignments were used to construct molecular phylogenetic trees by the maximum likelihood method with the PhyML program (Guindon and Gascuel 2003). Trees were tested with 100 bootstrap pseudoreplicates. All molecular phylogenetic trees for these five loci indicated that the animal we used was C. intestinalis type A (C. robusta) (supplementary fig. S1, Supplementary Material online).

Genome Sequencing
For PacBio RSII sequencing, we used sperm obtained from an animal in the eighth generation of self-fertilization, as described previously . To prepare a library, an SMRTbell Template Prep Kit 1.0 (Pacific Biosciences) was used. The library was sequenced using a PacBio RSII sequencer employing P6-C4 chemistry (Pacific Biosciences) with 360-min movie lengths. Contig assembly was performed with the MECAT assembler pipeline (Xiao et al. 2017). Each program was run with the following parameters: for mecat2pw, "-j 0"; for mecat2cns, "-i 0"; for extract_sequences, "40Â 160000000"; for mecat2canu, "genomesize ¼ 160000000." For polishing contig sequences obtained with the MECAT assembler, Pilon was utilized (Walker et al. 2014). Illumina sequencing reads (paired 101 base reads, 16.6 Gb in total; SRA accession number DRR018354) for a 11th generation animal  were mapped with bowtie2 (Langmead et al. 2019).
15 min on ice. Cells were pelleted at 500Âg for 5min. Supernatant was removed and stored at À80 C.
Cells were thawed on ice, washed with PBS, resuspended in 250 ll of ice-cold Hi-C lysis buffer (10 mM Tris-HCl pH 8.0, 10 mM NaCl, 0.2% Igepal CA-630) with 50 ll of protease inhibitors (Sigma, P8340), and incubated on ice for 20 min. The lysate was centrifuged at 2,500 Â g for 5 min, washed with ice-cold Hi-C lysis buffer, resuspended in 50 ll of 0.5% SDS, and incubated for 10 min at 62 C. Then, 145 ll of water and 25 ll of 10% Triton X-100 were added. After incubation for 15 min at 37 C, 25 ll of NEBuffer2 and 100 U of MboI were added, and chromatin was digested overnight at 37 C with rotation. MboI was inactivated by incubating at 62 C for 20 min. DNA ends were labeled with biotin by adding 50 ll of fill-in master mix (0.3 mM biotin-14-dATP, 0.3 mM dCTP, 0.3 mM dGTP, 0.3 mM dTTP, 40 units of DNA Polymerase I Large Klenow Fragment), and incubated at 37 C for 1.5 h with rotation. Proximal ligation was performed by adding 900 ll of ligation master mix [669 ll of water, 120 ll of 10Â NEB T4 DNA ligase buffer, 100 ll of 10% Triton X-100, 6 ll of 10 mg/ml BSA, and 5 ll of 400 U/ll T4 DNA ligase (NEB)], and incubated at room temperature for 4 h. Proteins were degraded by adding 50 ll of 20 mg/ml proteinase K, 120 ll of 10% SDS, and incubated at 55 C for 2 h. Crosslinking was reversed by adding 130 ll of 5 M NaCl and incubated at 68 C overnight.
Biotinylated DNA was collected by ethanol precipitation, and resuspended in 130 ll of Tris buffer (10 mM Tris-HCl, pH 8.0). DNA was sheared using Covaris S220 with following parameters; Peak Incident Power: 140, Duty Factor: 10, Cycle per Burst: 200, time: 80 s. Sheared DNA was size-selected to 100-500 bp and purified using AMPure XP beads (Bechman Coulter). DNA was eluted in 100 ll of Tris buffer.
A 150 ll of 10 mg/ml Dynabeads MyOne Streptavidine T1 beads (Life technologies) were washed with 400 ll of 1Â Tween washing buffer (1Â TWB: 5 mM Tris-HCl pH 7.5, 0.5 mM EDTA, 1 M NaCl, 0.05% Tween 20), and resuspended in 300 ll of 2Â binding buffer (2Â BB: 10 mM Tris-HCl pH 7.5, 1 mM EDTA, 2 M NaCl). The beads were added to the sheared DNA sample, and incubated at room temperature for 15 min with rotation. Biotinylated beads with bound DNA were collected with a magnet and supernatant was discarded. The beads were washed twice by adding 600 ll of 1Â TWB, transferred to a new tube, incubated at 55 C for 2 min on a Thermomixer. The supernatant was discarded using a magnet. Then, beads were washed with 100 ll of Tris buffer, transferred to a new tube, and resuspended in 50 ll of Tris buffer.
A-tailing and Illumina adapter ligation were performed using the KAPA Hyper Prep Kit (Kapa Biosystems KK8500). Adapter-ligated DNA was washed twice more by adding 600 ll of 1Â TWB, transferred to a new tube, and incubated at 55 C for 2 min on a Thermomixer. The supernatant was discarded using a magnet to retain the beads. Beads were washed with 100 ll of Tris buffer, transferred to a new tube, and resuspended in 50 ll of Tris buffer. Libraries were amplified directly from the beads with 9-14 cycles of PCR, using KAPA HiFi HotStart ReadyMix (KAPA Biosystems), and DNA was purified using AMPure XP beads. Paired-end sequencing of Hi-C libraries was performed using the Illumina HiSeq 1500 platform. Sequenced reads were mapped, filtered, and normalized using Juicer (version 1.5) (Durand et al. 2016). Hi-C data were analyzed with the 3D de novo assembly (3D-DNA) pipeline (Dudchenko et al. 2017) to obtain candidates for novel linkages and misassembly locations in the former KH version assembly.
To validate these candidates, we used Illumina paired sequencing data obtained in a previous study  (SRA accession number DRR018353 and DRR018354) (supplementary fig. S2, Supplementary Material online), and BAC end sequence data (Dehal et al. 2002;Kobayashi et al. 2002). We inspected Hi-C data using the Hi-C data browser (Durand et al. 2016) to obtain scaffolds, which are tentatively called KH/Hi-C linked scaffolds.

Chromosome Assembly
Contigs obtained from sequencing data were aligned with the KH/Hi-C linked scaffolds using Nucmer (Kurtz et al. 2004). For possible heterozygous regions on chromosomes 7 and 8, only longer contigs were kept for the following assembly processes. Contigs were compared with KH/Hi-C linked scaffolds using BLASTN (Altschul et al. 1990) with the options "-task megablast -perc_identity 95." The top hits with 1-kb or longer alignments were used as inputs for Chromosomer to link the contigs (Tamazian et al. 2016). Finally, nucleotide sequences of both ends of BAC clones used for fluorescence in situ hybridization (Shoguchi et al. 2006) were mapped with BLAT (Kent 2002). On the basis of mapping data, six contigs were included in the final chromosomal sequences. We inserted 1,000 "N"s between contigs, and as a result, 40,000 "N"s are included in the final assembly. Note that the assembly included two additional "N"s, which are bases we failed to determine.

Genome Size Estimation
For genome size estimation, we caught animals in Ushimado, Okayama Prefecture, Japan. From two animals, we obtained a sufficient number of sequencing reads ($4 Gb). All possible 21-mers were counted using Jellyfish (Marcais and Kingsford 2011), and genome sizes for each individual were estimated with Genomescope (Vurture et al. 2017).

Gene Prediction
For predicting genes/transcripts, we used Augustus (Stanke et al. 2008) with cDNA sequences available in the DDBJ/ EMBL/Genbank database and the KH version of the gene model set as hints. Predicted models were inspected on the Artemis browser (Carver et al. 2012), and manually curated on the basis of mapped cDNA-based data (full cDNA sequences and expression sequencing tags). Transcript names consist of five fields delimited by dots (e.g., KY.Chr1.1.v1.nonSL1-1). The first field represents the gene/ transcript model version; therefore, all models have the same tag (KY stands for Kyoto). The second field represents the chromosome (or unassembled contig) name. The third namefield represents the serial number for gene loci on individual chromosomes. Gene models are defined with these first three fields (KY.Chr1.1). The fourth field specifies alternative transcript variants by numbers preceded by the letter "v." The fifth field includes information for the 5 0 -and 3 0 -ends of the models, and consists of two subfields delimited with hyphens. The first subfield refers to the evidence identifying the 5 0 -end: SL means trans-splice acceptor site defined experimentally, nonSL means non-trans-spliced mRNA 5 0 -end determined experimentally, and ND means 5 0 -end predicted computationally (not determined by experimental evidence). The number concatenated to the 5 0 -end code identifies individual alternative 5 0 -ends within each locus. The second subfield refers to the 3 0 -end and consists of numbers identifying individual alternative 3 0 -ends within each locus.

Experimental Validation of the Inversion in Chromosome 4
We used genomic DNAs obtained from seven wild-caught animals. These DNAs were isolated in our previous study, in which we called these specimens wt1 to wt7 . To examine this inversion, we performed PCR. The experimental design is shown in figure 2D. Primer sequences are as follows: For, 5 0 -ACGTAGGAGATCCAAATCAAAGCC ATCATA-3 0 ; Rev, 5 0 -ACCCACAGTAACCTATGATAAACGAC TACTT-3 0 ; Test, 5 0 -CTATCACACAAGAGATATGCACAA AGCATA-3 0 . PCR was performed with Primestar Gxl enzyme according to manufacturer instructions (Takara Bio).

A Comparison of Gene Model Sets Between Two Ciona Species
Ciona savignyi protein sequences were obtained from the Ensembl database (Zerbino et al. 2018). Genomic positions of genes encoding these proteins were also deduced from the same database, and these genes were ordered within each reftig. Ciona savignyi proteins were used as queries for BLAST searches against proteins derived from KY gene models. Because we compared 11,616 proteins, hits with E-values <4.3e À6 (0.05/11616; Bonferroni correction for multiple testing) were regarded as significant. For conservative comparisons, we used only hits in which the alignment exceeded 40% of query and subject protein lengths. Species-specific tandem duplications may affect subsequent analyses; therefore, we used only the highest scoring match if two or more C. savignyi proteins were mapped to a single KY model. Reftigs that contained 10 or more genes with putative orthologs among the KY gene models and had orthologs preferentially in 1 of the 14 chromosomes (Fishers exact tests with the Bonferroni correction <5%/122) were chosen to make a dot plot.

A Strategy for Constructing a New Assembly
To construct a new version of genome assembly, we employed the strategy shown in figure 1A (see also supplementary fig. S2, Supplementary Material online). First, we rebuilt scaffolds using contigs from the former KH assembly and Hi-C data, and the resulting scaffolds were further curated with Illumina sequencing reads. Second, independently of the first step, we assembled long-read sequencing data from a PacBio sequencer. Third, we combined these two types of data with fluorescent in situ hybridization (FISH) data from BAC clones (Shoguchi et al. 2004(Shoguchi et al. , 2006 to obtain a final assembly called the HT assembly.
Step 1: Constructing Reference Scaffolds from Scaffolds of the Previous Version Using Hi-C Data We performed an Hi-C analysis using tailbud embryos to determine distances between nucleotide positions within chromosomes. The KH version assembly contains a total of 115,226,814 bases, 68% of which are mapped to the 14 chromosomes. Mapping of the Hi-C data suggested many possibly misassembled sites and potential linkages ( fig. 1B-D). We first inferred positions of unassembled scaffolds and misassembled positions within the chromosomes using the 3D-DNA pipeline (Dudchenko et al. 2017). These candidates were individually validated as follows.
We mapped Illumina sequencing data obtained from a previous study ) (supplementary fig. S2B, Supplementary Material online) to examine whether these candidate positions were supported by paired-end sequences. Similarly, we mapped sequence data of both insert ends of BAC clones, which were derived from Sanger sequencing (Dehal et al. 2002;Kobayashi et al. 2002). For each of the candidates supported by any of these sequencing data, we inspected Hi-C data using the Hi-C data browser (Durand et al. 2016). As a result, we obtained a new set of 14 chromosomal sequences containing 95,495,880 bases. This set of chromosomal sequences was used as references for the new assembly (see below). Hereafter, these reference scaffolds are tentatively called KH/Hi-C linked scaffolds.
Step 2: Primary Assembly Using Long Reads Ciona is a hermaphroditic animal, and self-fertilization can be induced in the laboratory. Taking advantage of this, we created an inbred line (T-line) by repeated self-fertilization . Genomic DNA from sperm obtained from a specimen in the eighth generation was used. The heterozygosity rate in natural populations is expected to be 1.1-1.2% (Dehal et al. 2002;Satou et al. 2012). Under the assumption of neutrality, the heterozygosity rate for the eighth generation inbred animals was expected to be 0.0043-0.0047% [1.1-1.2% Â (1/2) 8 ], and we considered this rate sufficiently low. Using this specimen, we obtained 644,697 sequence reads from a PacBio RSII sequencer, which Step 3: Scaffolding the New Contigs Next, we compared the new contigs mentioned above with the KH/Hi-C linked scaffolds using Nucmer (Kurtz et al. 2004)  This alignment revealed duplicated regions in chromosomes 7 and 8 ( fig. 1E and F). These contigs probably represent haplotypes, because the duplicated region in chromosome 7 overlapped the region that is retained as heterozygous even in the 11th generation ). This region contains a self-incompatibility locus with a highly variable region among individuals to prevent fertilization between eggs and sperm with the same locus type (Harada et al. 2008). We chose more contiguous contigs for these regions for the following assembly processes ( fig. 1E and F). These contigs and contigs that aligned to other chromosomes (47 contigs in total) were assembled into 14 chromosomes in reference to the KH/Hi-C linked scaffolds, using Chromosomer (Tamazian et al. 2016). Nevertheless, 60 contigs remained unassembled into chromosomes.
Finally, we mapped sequences of both insert ends of 270 BAC clones that were used for previous FISH assays (Shoguchi et al. 2004(Shoguchi et al. , 2006. Sequences of 18 BAC clones were mapped onto seven unassembled contigs with BLAT (Kent 2002), and the FISH results indicated that these contigs were located at chromosomal ends. As a result, we included these seven contigs in chromosomal sequences, and the number of unassembled contigs was reduced to 53. Linkages that were determined on the basis of the KH/Hi-C linked scaffolds and FISH are listed in supplementary table S1, Supplementary Material online.
The final assembly included 122,951,598 bases (table 1). Among them, 117,489,544 bases (95.6%) were included in the chromosomal sequences. The N50 was 8,327,059 bases and the L50 was 6. We called this the HT assembly, a great improvement compared with the KH assembly, as evident from their statistics (table 1).

Gene Models
We utilized Augustus (Stanke et al. 2008) to build a new set of gene models, using cDNA sequences and the KH gene models as hints. The resulting models were manually curated with Artemis (Carver et al. 2012). We identified transcription start sites (TSSs) and trans-splicing accepter sites for the 5 0 -spliced leaders (SL sites) using the data obtained through various high-throughput methods Matsumoto et al. 2010;Yokomori et al. 2016). Using these data, 5 0ends of models were extended or shortened. Our previous study showed that there are no intergenic regions in any  operon, and that polycistronic transcripts derived from operons are resolved by trans-splicing ). Therefore, we curated every intergenic region within operons and checked whether upstream genes ended with "AG," which worked as a trans-splicing acceptor site for downstream genes.
The resultant model set [KY (KYoto) model set] contains 14,072 genes with 18,701 splicing variants in total. When multiple 5 0 -ends (TSSs and/or SL acceptor sites) or 3 0 -ends were indicated by experiments for a single splicing variant, multiple transcript models were constructed. As a result, 61,667 transcript models were constructed; among them, 22,239 transcript models start with TSSs, and 31,757 transcript models start with SL sites. Among these transcript variants, 14% were not included in the KH models. Similarly, 9% were not included in the Refseq model set

Validation of the Assembly
We again mapped sequences of both insert ends of the BAC clones described above onto the final assembly. Among the 270 BAC clones, one or both ends of 257 clones were confidently mapped (BLAT score !150; single hits) (supplementary table S2, Supplementary Material online). Most of these clones (254 of 257) were mapped onto the expected chromosomes in the expected order. Only three clones showed inconsistent results. More specifically, one end of a clone (GECi31_c16) was mapped onto chromosome 8, which was supported by FISH results, but the other end of this clone was mapped onto chromosome 10, which was not supported by FISH result. Similarly, one end sequence for each of two clones (GECi38_g14 and GECi47_f07) was mapped near one end of chromosome 8, and the other end was mapped to unassembled contigs. We were not able to determine whether these three clones indicate actual misassemblies or whether they artificially contained two genomic fragments. Nevertheless, as mentioned above, a vast majority of the clones were mapped onto the expected chromosomes. Therefore, it is unlikely that the new assembly contains any large misassemblies.
Next, Hi-C data were mapped onto the new version of the genome ( fig. 2A). The overall contiguity of chromosomes was much improved (compare fig. 2A with fig. 1B). We found no obvious inconsistency except for one position in chromosome 4 ( fig. 2B), where the inbred strain indeed had an inversion (see below). In other words, except for this region, the Hi-C data indicated that the overall structure of the assembly successfully reproduces the chromosomal structures.
To evaluate nucleotide level errors, we compared the assembled sequence with the KH version of the genome sequence. To do this, we split the KH genome sequences into 222,119 fragments, each of which was 500 bases in length. These were used as query sequences for the BLAT program (Kent 2002). Among them, 178,398 fragments were mapped uniquely onto the new assembly, and 88,559,623 bases were aligned in total. This alignment contained 974,264 mismatches, corresponding to 1.1% (¼974,264/88,559,623) in excellent agreement with the heterozygosity rate of natural populations of this animal (1.1-1.2%). Therefore, it is likely that the new assembly does not contain significant errors at the nucleotide level.
Finally, we evaluated the genome and gene models using BUSCO, a tool for assessing completeness of genome assemblies and gene models with single-copy orthologs (Simao et al. 2015). For this purpose, we used the metazoan gene model set distributed with BUSCO. Among the "metazoan" genes, 95.3% were found in the genome. This score was slightly improved when compared with the score of the previous KH assembly (94.2% found) (table 2). Similarly, BUSCO gave a slightly better score for the KY gene model set than for the former KH version set (table 2). This observation suggests that the HT assembly mainly improved genomic regions that do not encode protein-coding genes.
For an independent validation, we mapped 318 genes for transcription factors and signaling ligand molecules (supplementary fig. S5, Supplementary Material online). These genes constitute a set of the most extensively annotated genes, including all family members encoding bHLH, bZip, Ets, Fox, HMG-box, homeodomain, and nuclear receptor transcription factors, and all family members encoding Fgf, Ephrin, Tgfb/ Bmp, Wnt, hedgehog, and Notch ligands (Hino et al. 2003;Wada et al. 2003;Yagi et al. 2003;Yamada et al. 2003;Satou, Imai, et al. 2003;Satou, Sasakura, et al. 2003;Satou, Wada, et al. 2008). It also includes genes that encoded well-known transcription factors such as Zic, Prdm1, and Snai. Among these genes, only one (Fgf4/5/6) was included in an unassembled contig. This indicates that most protein coding genes are included in chromosomal sequences.
The KH version of the assembly predicted that several transcription factor genes, important for fate specification in embryos, are multicopy genes. However, precise genomic structures and copy numbers of Foxd, Tbx6-r.b, and Zic-r.b (formerly ZicL) were uncertain in the KH version, due to a large sequence gap near the Foxd locus, and because the Tbx6-r.b and Zic-r.b loci are located near scaffold ends. In the present HT assembly, two copies of Foxd were encoded on chromosome 8 as neighbors, and three copies of Tbx6-r.b were encoded within a 12-kb region of chromosome 11 (Tbx6bd in supplementary fig. S5, Supplementary Material online). These copy numbers are the same as those predicted in the KH version. However, the current HT assembly revealed that another Tbx6-related gene, Tbx6-r.a, was encoded in the vicinity of the Tbx6-r.b copies (supplementary fig. S5, Supplementary Material online). The Tbx6-r.a locus was located $120 kb from the three copies of Tbx6-r.b, and seven genes are predicted in the intervening region. Although five copies of Zic-r.b were predicted previously (Yamada et al. 2003), the current HT assembly contained six copies of Zicr.b in an 18-kb region of chromosome 6 (Zic-r.b-g in supplementary fig. S5, Supplementary Material online). Similarly, two copies of Prdm1-related gene, which encodes an important transcriptional repressor in early embryos, have been identified (Ikeda et al. 2013;Ikeda and Satou 2017). The current HT assembly contains another copy of Prdm1-related gene near the two copies previously identified on chromosome 12 (Prdm1-r.a-c in supplementary fig. S5, Supplementary Material online). There are four copies of type-A ephrin genes, which are thought to have been multiplied in the lineage leading to extant ascidians (Satou, Sasakura, et al. 2003). In the HT assembly, two additional type-A ephrin genes were identified in this gene cluster on chromosome 3 (Efna.a-f in supplementary fig. S5, Supplementary Material online). Thus, the current HT assembly contains fewer gaps and is continuous. As a result, the number and position of tandemly repeated copies can be determined unambiguously, showing that this long version of the genome more faithfully reproduces the genomic structure.
Telomeres, Ribosomal RNA Genes, and Trans-Spliced Leader Donor RNA Genes We found repetitive sequences at either or both ends of 10 chromosomes (table 3). Because the typical repeat sequence, CCCCTAA, was highly similar to telomeric repeats found in many organisms (CCCTAA) (Podlevsky et al. 2007), it is highly likely that these repeats constitute telomeres of these chromosomes. We found this repeat at both ends of chromosomes 3, 9, and 14, indicating that these chromosomes are almost completely assembled (supplementary table S3, Supplementary Material online). We also found this repeat at either end of chromosomes 4, 5, 6, 7, 10, 12, and 13. However, we did not find telomeric repeats at either end of chromosomes 1, 2, 8, and 11, although we found the repeats in three unassembled contigs. The short arms of chromosomes 4, 5, and 6 contain 18S/ 28S ribosomal RNA genes . These genes were not included in chromosomal sequences of the current assembly. Instead, they were found in 19 unassembled contigs (table 3). Probably because of their highly repetitive nature, these contigs were not successfully assembled.
Previous studies estimated that more than half of Ciona mRNA species have an SL at their 5 0 -ends (Vandenberghe et al. 2001;Satou et al. 2006;Matsumoto et al. 2010). This SL is added by trans-splicing, and its donor RNA is encoded by a multi-copy SL gene (Yeats et al. 2010). FISH analysis showed that these copies are located as a cluster in the short arm of chromosome 8 (Yeats et al. 2010). Indeed, we found 32 genes with high similarity to the previously identified SL gene in a 300-kb region near one end of chromosome 8. Two unassembled contigs contained two additional clusters, each of which contained 16 SL genes within a region of $90 kb (table 3). Although it is likely that these contigs encode sequences of the short arm of chromosome 8, we were not able to determine their precise locations, order, and orientations.

Re-estimation of the Genome Size
The genome size of C. intestinalis (types unidentified) has been estimated between 140 and 190 Mb per haploid (Atkin and Ohno 1967;Laird 1971;Simmen et al. 1998). However, as described above, our assembly ($123 Mb) contained almost the entire genome sequence, including highly repetitive segments. These observations suggested that the genome size had been overestimated. To test this hypothesis, we first re-estimated the genome size of the inbred strain. We adopted a method using k-mer profiles, implemented in GenomeScope (Vurture et al. 2017). We used two sets of Illumina sequencing reads obtained from two individuals of the 11th generation (F11a and F11b in supplementary fig.  S2B, Supplementary Material online). According to GenomeScope, genome sizes of the two siblings were estimated at 116 and 114 Mb, respectively. To confirm that the inbred strain has the same genome size as individuals from natural populations, we prepared two specimens from a different geographic location than the source of the inbred line. The genome sizes of these specimens were estimated at 119 and 120 Mb. Therefore, it is likely that the actual genome size in natural populations is smaller than previously estimated. This means that the current assembly covers the vast majority of the genome of C. intestinalis type A (C. robusta).
A 600-kb Inversion in the Inbred Strain As mentioned above, Hi-C data mapping indicated an inversion in chromosome 4 ( fig. 2A and B). This inversion was confirmed by aligning the new version and the KH/Hi-Clinked scaffold version of chromosome 4 ( fig. 2C). These data consistently indicated that a region of $600 kb between nucleotide positions 1.9 and 2.5 M of chromosome 4 is inverted. Because the Hi-C data were obtained from tailbud embryos derived from eggs and sperm of wild-caught animals, it is possible that only the inbred animals have this large inversion. To test this possibility, we designed three PCR primers ( fig. 2D), which were designated For, Test, and Rev. Because we did not retain DNA used for the genomic sequencing, we used a different animal of this inbred line (F8b in supplementary fig. S2B, Supplementary Material online). For comparison, we also used seven wild-caught animals. From the inbred line, we obtained a band with the set of primers For and Test, but not with the set of primers Rev and Test ( fig. 2E). On the other hand, from wild-caught animals, we obtained a band with the set of primers Rev and Test, but not with the primers For and Test ( fig. 2E). Therefore, this inversion occurred in the inbred line.

A Comparison of Chromosomal Structures Between Two
Ciona Species Next, we compared these new chromosomes of C. intestinalis type A (C. robusta) with scaffolds of a species in the same genus, C. savignyi. Although a previous study performed a similar comparison and showed extensive intrachromosomal rearrangements between these two species (Hill et al. 2008), we expected that we could obtain much better resolution using the present assembly. For this purpose, we first mapped C. savignyi gene models predicted on scaffold sequences called reftigs (Small et al. 2007;Zerbino et al. 2018) to the KY gene model set.
We found that 122 reftigs contained 10 or more genes with putative orthologs among the KY gene models. Genes encoded by each of 106 reftigs (of the above 122) were found preferentially in one of the 14 HT-chromosomes (Fisher's exact tests with Bonferroni correction <4.1eÀ4 ¼ 0.05/ 122). Rank order positions of orthologous gene pairs in the 14 HT-chromosomes and the 106 reftigs of C. savignyi are shown in supplementary figure S6, Supplementary Material online. As previously shown (Hill et al. 2008), gene rearrangements were extensive within chromosomes, but not between pairs of the 14 chromosomes.
We noticed small inversions in our new mapping data. In chromosome 10, a region containing six orthologous gene pairs was clearly inverted between the two species ( fig. 3A and B). Figure 3C shows a similar instance, in which a region that contained five orthologous gene pairs was inverted. Additional examples are shown in supplementary figure S7, Supplementary Material online. We also found a gene arrangement that can be explained by two serial inversions ( fig. 3D), although one translocation could also explain the phenomenon. These examples indicate that inversions have occurred frequently and have contributed to intrachromosomal gene rearrangements in Ciona species.

Discussion
The Genome of C. intestinalis Type A (C. robusta) Is Smaller than Previously Estimated We found telomeres on both ends of chromosomes of 3, 9, and 14. This indicates that these chromosomes are assembled almost completely from one end to the other. Their lengths were 11.2, 8.32, and 6.29 Mb, respectively, smaller than our previous estimates using cytogenetic data (13.0, 10.7, and 7.92 Mb, respectively) (Shoguchi et al. 2006). Therefore, the lengths of these chromosomes in the present assembly are 78-86% (mean ¼ 81%) of the previously estimated sizes. Because this previous estimate was based on an assumption that the genome size was 162Mb (Simmen et al. 1998), $132 Mb (162 MbÂ 81%) is a rough estimate for the actual genome size of C. intestinalis type A (C. robusta).
Independently of the genome assembly, we used Illumina sequencing reads to estimate the genome size. Data from two siblings of the inbred strain and two wild-caught specimens gave estimates of 114-120 Mb per haploid. These values are close to the aforementioned estimate and to the total length of the current assembly (123 Mb).
In this way, our present data indicate that the actual genome size of C. intestinalis type A (C. robusta) is smaller than previous estimates (Atkin and Ohno 1967;Laird 1971;Simmen et al. 1998). However, because short arms of chromosomes 4, 5, and 6, which encode 18S/28S rDNA genes, show size polymorphisms (Shoguchi et al. , 2006, the actual genome size may vary among individuals. Sequencing of genomes of seven larvacean species indicates that transposable elements contribute to interspecies variation in genome size (Naville et al. 2019). It is possible that repetitive sequences similarly contribute to intraspecies variation in genome size in ascidians. It is also possible that these earlier studies used animals different from C. intestinalis type A (C. robusta), and that the genome size of these animals is indeed larger than that of C. intestinalis type A (C. robusta).

Quality of the Assembly
In the present assembly version, over 95% of nucleotides are included in chromosomal sequences, which are accessible through the DDBJ/EMBL/Genbank database and also through the Ghost database ) (http://ghost.zool. kyoto-u.ac.jp/default_ht.html; last accessed October 24, 2019). In addition, 13,801 (98%) of 14,072 predicted genes are included in the chromosomes. Consistently, among 318 genes that encode transcription factors or signaling ligands, 317 are included in the chromosomes. Thus, most protein coding genes are included in chromosomal sequences of the current assembly. Unassembled contigs, in which the remaining 2% of genes are encoded, may not have been assembled into chromosomes due to technical problems, although we cannot rule out the possibility that some of these contigs constitute minichromosomes. Meanwhile, many genes encoding 18S/28S ribosomal RNAs and SL RNAs were found in unassembled contigs. rRNA genes are encoded in the short arms of chromosomes 4, 5, and 6, the total length of which is estimated at over 13 Mb (Shoguchi et al. 2006). The genome contains $670 copies of the genes for SL RNAs, most of which are encoded in the short arm of chromosome 8 (Yeats et al. 2010). Even with the long sequencing reads obtained with a PacBio RSII sequencer, such long repetitive sequences were difficult to reconstruct.
In the present study, we constructed the assembly using PacBio RSII sequencing, Illumina sequencing, Hi-C, and FISH data. Such a combinatorial method worked efficiently, because long contigs obtained from long sequencing reads were greatly improved with Hi-C and FISH data. Specifically, the N50 value increased from 3.7 to 8.3 Mb. In the present study, Hi-C data were used to build reference scaffolds from an earlier version of the assembly, but not directly for connecting contigs. The resulting scaffolds were helpful for screening partially heterozygous regions. This method will be applicable for improving genome assemblies of other organisms.
Ciona has one of the simplest and most compact chordate genomes, which makes it a useful model system for analyzing higher-order genome structure. This genome may enable us to analyze global genomic regulatory landscapes more easily.
Copy numbers of multicopy genes that perform essential functions in embryonic development have been determined in the inbred strain. In the previous version of the genome assembly, many multicopy genes were located near sequencing gaps and scaffold ends, which prevented determination of exact copy numbers. We show here that copy numbers of the Zic-r.b, Efna, and Prdm1-r genes differ from those predicted in the previous version, although we do not know why these key genes are multicopy genes.

An Inversion in the Inbred Line and Inversions between Two Ciona Species
We found a large inversion in the genome of the inbred line. Because we did not retain the DNA of the F0 animal, we are unable to determine whether the F0 animal had this inversion or whether it occurred during inbreeding. In the former scenario, this inversion may have occurred in a natural population. Although we cannot completely rule out this possibility, the latter scenario seems more likely because seven wildcaught individuals did not contain this inversion. If the latter scenario is the case, it may suggest that inversions occur frequently. Although such inversions may become fixed by genetic drift in natural populations if they are neutral, they will be fixed much more frequently in inbred lines established by self-crossing.
A previous study suggested that extensive intrachromosomal rearrangements have occurred after the split of the two Ciona species (Hill et al. 2008). This observation is best explained by the occurrence of frequent inversions, but not by frequent translocations, because inversions are intrachromosomal events, but translocations can occur both within and between chromosomes. We found several inversions in the genomes of the two Ciona species. This provides evidence for the contribution of inversions to gene rearrangements in Ciona chromosomes.
Chromosomal inversions have been implicated in speciation and local adaptation (Kirkpatrick 2010;Wellenreuther and Bernatchez 2018). In Ciona species, inversions may have contributed similarly to speciation and environmental adaptation. Inversions may also have shuffled genes within chromosomes both during and after speciation in the genus Ciona.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.