Genome assembly and transcriptome resource for river buffalo, Bubalus bubalis (2n = 50)

Abstract Water buffalo is a globally important species for agriculture and local economies. A de novo assembled, well-annotated reference sequence for the water buffalo is an important prerequisite for studying the biology of this species, and is necessary to manage genetic diversity and to use modern breeding and genomic selection techniques. However, no such genome assembly has been previously reported. There are 2 species of domestic water buffalo, the river (2n = 50) and the swamp (2n = 48) buffalo. Here we describe a draft quality reference sequence for the river buffalo created from Illumina GA and Roche 454 short read sequences using the MaSuRCA assembler. The assembled sequence is 2.83 Gb, consisting of 366 983 scaffolds with a scaffold N50 of 1.41 Mb and contig N50 of 21 398 bp. Annotation of the genome was supported by transcriptome data from 30 tissues and identified 21 711 predicted protein coding genes. Searches for complete mammalian BUSCO gene groups found 98.6% of curated single copy orthologs present among predicted genes, which suggests a high level of completeness of the genome. The annotated sequence is available from NCBI at accession GCA_000471725.1.


Background information on Bubalus bubalis
A reference genome sequence is important for understanding the biology of a species, managing genetic diversity, and, in the case of domestic animals, applying new genome-based selection methods for genetic improvement [1]. The water buffalo (Bubalus bubalis) was domesticated about 5000 years ago, and since then it has been of economic importance as a dairy, meat, and draught animal in most regions of the world [2]. There are 2 subspecies of water buffalo, known as the river and swamp types, which differ quite considerably both in terms of genetics and phenotypes. The river buffalo has 50 chromosomes whereas the swamp buffalo has 48 chromosomes, resulting from the fusion of buffalo chromosomes 4p with 9 [3]. River buffaloes are found from western Asia to Europe and have been selected for milk production. Swamp buffaloes are more common in eastern Asia and are used for meat and milk, although they only produce less than 500 L per lactation, and as a draught animal, they have undergone little genetic improvement. There are about 182 million water buffaloes in the world (compared to 1.3 billion cattle), with about 174 million in Asia, 3.7 million in Africa, 3.3 million in South America, and smaller populations in Europe and Australia. Recently the wild African buffalo (Syncerus caffer) genome was published [4]. A water buffalo pseudo-sequence has been reported, but it was created by aligning short sequence reads to the cattle reference genome and hence has many errors and unresolved structural variants and repeats [5].
Here we describe a de novo assembly of the river buffalo genome and the annotation of the assembled sequence.
RNA-Seq data were produced from 30 tissues, which were used in the annotation and have been deposited in the NCBI SRA database under the project PRJNA207334. These data provide an initial draft reference sequence to facilitate genomic studies of water buffalo, a species that has global economic relevance. The data were produced by a small consortium with minimal funding, demonstrating the opportunities now presented by the rapidly advancing next-generation sequencing technologies.

Chosen animal and sequencing
An inbred, female Italian Mediterranean buffalo (Olimpia) from a half-brother and half-sister mating was chosen for sequencing to increase the homozygosity level across the genome and simplify the genome assembly. Blood samples were collected by a qualified veterinary surgeon during routine handling for disease surveillance at her home farm. Animal work was carried out in compliance with Italian laws on animal experimentation and ethics (DL No. 116, 27 January 1992).
The karyotype was verified as normal by high-resolution Rbanding (Fig. S1). Genomic DNA was extracted from blood using a commercial kit (Genomic tip 100/G; Qiagen, Venlo, the Netherlands). Whole-genome shotgun sequence (WGS) and mate pair reads with a target fragment size of 20 kb were produced on a Roche 454 FLX titanium instrument using kits and protocols supplied by the manufacturer (Roche Biosciences, Nutley, NJ, USA) and short paired end and mate paired libraries produced and sequenced on an Illumina GAIIx instrument. The data are summarized in Table 1. Most of the coverage was from the Illumina sequencing data; however, the Illumina reads were considerably shorter than the 454 WGS reads. The longer 454 single-end reads provided information to resolve short repeats, which contributed to greater sequence contiguity. The Illumina protocol for building jumping libraries did not reliably produce mate pairs longer than 6 kb, whereas the 454 protocol produced mate pair libraries up to 20 kb. The latter provided additional connectivity, which was essential for building the scaffolds.

Assembly
The success of every assembly project depends on the quality control of the input data. Due to an amplification step in library production, 454 mate pair libraries have redundancy, such that independent molecules originating from the same ligated genomic sequence may be represented many times. This redundancy is a problem as one of the main assumptions made by most assemblers is that the coverage is uniform. To filter redundant data, the 454 sequences were initially mapped to the genome of a closely related species, in this case Bos taurus version Btau UMD3.1 [6], using the NUCmer aligner [7]. Pairs of mated reads that all start and end at approximately the same position were identified, and only 1 of the redundant pairs was retained. This reduced the number of 454 mate pair reads available for the assembly from 2 410 676 to 1 870 392.
The de novo assembly of the water buffalo genome was carried out using the MaSuRCA assembler version 1.9.2 (MaSuRCA, RRID:SCR 010691) [8]. The Illumina paired end reads were error corrected using QuORUM (QuORUM, RRID:SCR 011840) [9]. Then the MaSuRCA assembler was used to create super-reads via kmers, with many of the Illumina reads extended to the same super-read. The 2 122 738 136 Illumina paired end reads were reduced to 104 345 186 super-reads using a k-mer size of 57 to extend the paired end reads. The super-reads were 274 bp long on average and covered the genome at about ×10.
The Illumina mate-pair protocol relies on circularization of longer (3-6 kb) fragments with a biotinylated linker, followed by shearing. Fragments with biotin junctions are then recovered and sequenced using standard paired-end protocol producing pairs where 3 and 5 ends of the fragment are switched. When a sequence goes through the biotin junction, a chimeric read is produced. Such chimeric reads were filtered out by mapping them against the super-reads produced from paired end read data. Non-junction pairs may be formed when the biotin junction is close to one of the ends of a fragment. Such non-junction pairs were filtered out by aggressively trying to join the mate orientation with the super-reads. If a pair was joined with short separation, it was discarded. About 30% of initial mate pairs were retained after this filtering. The reads remaining after filtering were assembled with the Celera Assembler version 6.2, part of the standard MaSuRCA pipeline, followed by the MaSuRCA gap closer [8], which closed 23% of scaffold gaps. The final quantitative genome assembly statistics are summarized in Table 2. This buffalo genome assembly is available at the GigaScience database and the NCBI [10].

Transcriptome resources
Gene expression data were produced to aid the annotation process. Samples from 30 tissues were collected from 2 buffalo calves following humane euthanasia and stored in liquid nitrogen. RNA was prepared using a TRizol TM -based protocol (Thermofisher Scientific, Via G.B. Tiepolo, 18. I-20900 Monza MB, Italy) or RNeasy mini isolation kit (Qiagen, Venlo, the Netherlands), depending on the tissue. RNA integrity was assessed using an Agilent Technologies 2100 Bioanalyzer, and samples with an RNA Integrity Number (RIN) value greater than 8.5 were used to prepare libraries with an mRNA-Seq sample preparation kit (Illumina Inc. San Diego, CA, USA) according to the manufacturer's protocol. Sequencing of 2 × 100 paired-ends was carried out on an Illumina HiSeq 2000. An average of 53.6 million paired-end reads were produced per tissue. The quality of the data was explored by reconstructing transcript sequences to generate expressed contigs. The raw reads were processed and analyzed using a de novo transcriptome assembly pipeline based on the Trinity software [11], which gave 163K unique sequences across all tissues.

Genome annotation
Annotation of genes, transcripts, and proteins was done using the NCBI Eukaryotic Genome Annotation Pipeline [12]. The NCBI pipeline operates on a RefSeq copy of the submitted assembly and produces RefSeq transcript and protein models, which are tracked with NCBI Gene IDs. Prior to annotation, the genome was first masked for repeats using WindowMasker [13]. Transcripts, RNA-Seq reads, and proteins that were publicly available in NCBI archival databases as of November 2013 were aligned with the masked assembly using splice-aware global alignment tools Splign [14] and ProSplign [15]. The evidence used as input for this annotation run included 3004 buffalo transcripts present in Genbank or EST records, 737 buffalo Genbank protein sequences, 35 660 human known RefSeq proteins (with To address the volume of RNA-Seq data, only a single representative of 100% identical redundant raw short Illumina RNA-Seq reads from this study was aligned, and best-ranking alignments (95% identity and 90% coverage) with identical splice structure, consensus splice sites, and similar start and end points were collapsed into representative alignments. Rare introns, which may reflect background noise, were filtered out. Overlapping transcript, protein, and RNA-Seq alignments with compatible open-reading frames were assembled into putative models by Gnomon [16]. Partial putative models were then extended by Gnomon, using an HMM-based algorithm from GenScan (GenScan, RRID:SCR 012902) [17] to gain start or stop codons or fill small internal gaps. In some cases, Gnomon also introduced sequence insertions or deletions (In-Dels) in the models to retain the supported reading frame at genomic InDel locations. The frame-shift corrected models were annotated with a title prefix "PREDICTED: LOW QUALITY PRO-TEIN." In addition, a small number of models was predicted ab initio by Gnomon. The final set of models selected from the Gnomon set excluded ab initio predictions with no strong hit to UniProtKB/SwissProt or to eukaryotic proteins in the nr database, predictions with high homology to transposable or retro-transposable elements and poorly supported predictions conflicting with better-supported models annotated on the opposite strand. In addition, for genes with multiple alternative variants, the model selected had to be supported in its entirety by a single long alignment (e.g., a full-length mRNA) or by RNA-Seq reads from a single sample. Protein naming was based on human orthologs, or lacking that, on best protein similarity; human orthologs were identified based on shared best Swiss-Prot alignments for the human Ref-Seq protein and the buffalo protein model, as well as considerations of local conservation of synteny [18]. The gene "type" categorization (e.g., protein-coding, pseudogene, non-coding RNA locus) was determined based on observed open reading frame lengths, statistical coding propensity, protein alignments, and protein orthology to human orthologs.
The final RefSeq annotation contains a total of 21 711 protein-coding genes (Table 3) and is collectively called Bubalus bubalis Annotation Release 100, which is available in the NCBI Nucleotide, Protein, and Gene resources database [10]. Additional RefSeq annotation details are available online [19].

Sequence repeats
Repeats in the buffalo genome assembly were identified using the RepeatMasker suite (RepeatMasker, RRID:SCR 012954) [20]. Scanning the buffalo genome assembly for cattle repeats as defined in the Repbase database [21] identified more than 3.8 million repetitive elements (Table S1). The majority of these repeated elements belong to the short interspersed elements (SINE > 2 million) and long interspersed elements (LINE > 1.1 million) families. In total, more than 1.2 giga base pairs (Gbp) of the buffalo genome assembly belong to repetitive elements, corresponding to 43.69% of the whole assembly.

Genome assembly quality and annotation assessment
The genome assembly is made up of 366 983 scaffolds, with N50 of about 1.4 Mbp. A plot on the cumulative length of scaffolds shows that the 3669 longest scaffolds, less than 1% of all scaffolds, accounts for approximately 92% of the assembly size (Fig. S2). Most of the sequence assembled into the buffalo scaffolds shows a good match with Bos taurus genome; however, around 15% does not find a match with the cow genome and is either a sequence missing from the cow assembly or a putative buffalo-specific genomic sequence.
In addition to measuring standard assembly metrics, genome assembly and annotation quality were further assessed in several ways. The assembled buffalo sequence was compared with other closely related domestic species including cattle, goat, sheep, and pig (Table 4). All genomes in the table were annotated by NCBI, and the average genome size is approximately 2.8 Gbp. The total number of annotated proteins in buffalo is most similar to cattle (21 295), followed by goat (20 755), sheep (20 645), and then pig, the latter of which has more annotated protein coding genes (24 205). The number of partial coding sequence (CDS) is an indication of the quality of the genome annotation; the smaller the number of partial CDS, the better the annotation. The recently published goat genome [17] was assembled using the latest sequencing technologies and hence has high annotation quality, with only 455 partial CDS. Surprisingly, the buffalo genome assembly described here has fewer partial CDS than cattle and pig (see Table 4), considering that it was assembled using a combination of data from older 454 and Illumina short reads. In addition, as indicated above, more protein coding genes were identified in the annotation of the buffalo genome than the annotations of the genomes of cow, sheep, and goat.
The completeness of the genome assembly was assessed using the Benchmarking Universal Single-Copy Orthologs (BUSCO, RRID:SCR 015008) [22]. The presence of the 4104 mammalian BUSCO gene groups was tested against the predicted buffalo gene set. The longest protein isoforms for each of the predicted nuclear genes were used as input for BUSCO searches. In total, 4048 of the BUSCO groups were present in the assembly, with an additional 50 fragmented genes and only 6 completely missing BUSCO groups ( Table 5). The presence of 98.6% complete mammalian BUSCO gene groups suggests a high level of completeness of this buffalo genome assembly.

Conclusion
This is the first de novo sequence assembly and annotation of the river buffalo genome, which represents an important resource for this species and is a significant improvement on the previous alignment of low-coverage short reads to the bovine genome sequence.