Abstract

Acerola (Malpighia emarginata DC.) is a tropical evergreen shrub that produces vitamin C-rich fruits. Increasing fruit nutrition is one of the main targets of acerola breeding programs. Genomic tools have been shown to accelerate plant breeding even in fruiting tree species, which generally have a long-life cycle; however, the availability of genomic resources in acerola, so far, has been limited. In this study, as a first step toward developing an efficient breeding technology for acerola, we established a chromosome-scale genome assembly of acerola using high-fidelity long-read sequencing and genetic mapping. The resultant assembly comprises 10 chromosome-scale sequences that span a physical distance of 1,032.5 Mb and contain 35,892 predicted genes. Phylogenetic analysis of genome-wide SNPs in 60 acerola breeding materials revealed 3 distinct genetic groups. Overall, the genomic resource of acerola developed in this study, including its genome and gene sequences, genetic map, and phylogenetic relationship among breeding materials, will not only be useful for acerola breeding but will also facilitate genomic and genetic studies on acerola and related species.

Introduction

Acerola (Malpighia emarginata DC.) is a tropical, fruit-bearing, evergreen shrub belonging to the family Malpighiaceae in the order Malpighiales. Acerola, also known as Barbados cherry and West Indian cherry, is native to Central and South America and the Caribbean Islands.1 Acerola is cultivated in tropical and subtropical countries, especially in northwestern Brazil and the Mekong Delta region of Vietnam (Tien Giang Province and Ben Tre Province), primarily to meet the demand of the fresh fruit market and/or processing industries.2 Because acerola fruits are rich in vitamin C (ascorbic acid), their concentrated juice is used as a raw material for preparing beverages and supplemental vitamin C powder. Therefore, one of the targets of acerola breeding programs is to increase the vitamin C content of acerola fruits. However, acerola, like other fruit tree species, has a long-life cycle, making its breeding using conventional methods challenging.

Genomics can accelerate breeding programs not only in cereal and vegetable crops but also in fruit trees.3 Indeed, novel genomics-based approaches, eg genome-wide association study and genomic selection based on machine learning methods, Bayesian networks, image analysis, and genomic prediction, have been implemented in the breeding programs of citrus, apple, and Japanese pear.4–6 While massive genome information and multi-omics profiles have been available for the above-mentioned fruit trees,4,7 the publicly available genome data of acerola, besides its chromosome number (2n = 20),8 has been limited. This situation has caused the breeding of acerola to lag behind that of other fruit crops.

High-fidelity long-read sequencing, also known as HiFi (PacBio) sequencing, has enabled genome assembly at the chromosome level or telomere-to-telomere level in several species.9 The telomere-to-telomere assembly comprises a single contig, with telomere repeats at both ends, that covers the entire chromosome without any gaps of undetermined nucleotide sequences. Sequence scaffolding methods, together with the HiFi technology, help to establish chromosome-level genome assembly. Hi–C is a popular method used to connect the assembled sequences into the chromosome-level assembly.10 Alternatively, genetic mapping with high-density genetic loci, based on a classical Mendelian law, enables anchoring the genome sequence fragments to linkage maps for establishing chromosome-level sequences.11

The establishment of genome resources and genetic diversity assessment of breeding materials would be the first step in the breeding of acerola. In this study, we used HiFi technology to sequence the acerola genome and then anchored the sequences to a genetic map to construct a chromosome-level genome assembly of acerola. Subsequently, we assessed the genetic diversity of acerola breeding materials using SNPs derived by double digest restriction-site associated sequencing (ddRAD-Seq).12 The acerola genome information generated in this study will be helpful in accelerating acerola breeding programs and increasing knowledge of the genetics and genomics of Malpighiales.

Materials and methods

Plant materials

Acerola cultivar ‘NRA309’, which was registered in 2008 by Nichirei Foods Inc. (Tokyo, Japan), was used for genome sequencing. An S1 mapping population (n = 118), derived by the self-pollination of ‘NRA309’, was used to construct a genetic map of acerola (Supplementary Table S1). In addition to ‘NRA309’, a total of 59 acerola lines were used for genetic diversity analysis. These 59 lines included 10 lines from Brazil, 4 from Japan, and 9 from the United States as well as 36 progenies generated in breeding programs by crossing the above-mentioned lines (Supplementary Table S2). All plant materials were maintained at the Faculty of Agriculture, Kagoshima University.

Genome size estimation

Genomic DNA (gDNA) of ‘NRA309’ was extracted from leaves using Genomic Tip (Qiagen, Hilden, Germany). The PCR-free Swift 2S Turbo Flexible DNA Library Kit (Swift Sciences, Ann Arbor, MI, USA) was used to construct a short-read library, which was then converted into a DNA nanoball sequencing library using the MGI Easy Universal Library Conversion Kit (MGI Tech, Shenzhen, China). The library was sequenced using a DNBSEQ G400 instrument (MGI Tech) to generate 100-bp paired-end reads. The genome size of ‘NRA309’ was estimated using Jellyfish (k-mer size = 17).13

Genome sequencing and primary assembly

‘NRA309’ gDNA was subjected to long-read library preparation. Briefly, the ‘NRA309’ gDNA was sheared to an average fragment size of 40 kb using Megaruptor 2 (Deagenode, Liege, Belgium) in the Large Fragment Hydropore mode, and the sheared DNA was used for HiFi SMRTbell library preparation with the SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA). The obtained DNA libraries were fractionated with BluePippin (Sage Science, Beverly, MA, USA) to eliminate fragments shorter than 20 kb in length. The fractionated DNA libraries were sequenced on an SMRT Cell 8M on the Sequel II system (PacBio). HiFi reads were constructed using the CCS pipeline (https://ccs.how) and assembled using Hifiasm (version 0.16.1)14 with default parameters. Organelle genome sequences, identified by sequence similarity searches of the reported plastid and mitochondrial genome sequences of acerola relatives (Supplementary Table S3) using Minimap2 (version 2.24),15 were eliminated. Assembly completeness was evaluated with the embryophyta_odb10 data using Benchmarking Universal Single-Copy Orthologs (BUSCO) (version 5.2.2).16

Chromosome-level genome assembly by genetic mapping

The gDNA of lines in the S1 population was extracted from leaves using the Plant Genomic DNA Extraction Mini Kit (Favorgen, Ping-Tung, Taiwan). The extracted gDNA was digested with PstI and MspI restriction endonucleases and subjected to ddRAD-Seq library preparation.17 The resultant library was sequenced on a DNBSEQ G400 instrument (MGI Tech) to generate 100-bp paired-end reads. After removing adapter sequences (AGATCGGAAGAGC) with fastx_clipper in the FASTX-Toolkit (version 0.0.14, http://hannonlab.cshl.edu/fastx_toolkit) and trimming low-quality reads (quality score < 10) with PRINSEQ (version 0.20.4),18 high-quality reads were aligned onto the genome assembly of ‘NRA309’ using Bowtie2 (version 2.3.5.1).19 High-confidence biallelic SNPs were identified using the mpileup and call options of BCFtools (version 1.9)20 and filtered using VCFtools (version 0.1.16)21 based on the following criteria: read depth ≥ 5; SNP quality = 999; minor allele frequency ≥ 0.2; proportion of missing data < 20%. A linkage analysis of the SNPs was performed using Lep-Map3 (version 0.2)22 to construct a genetic map. Contig sequences were anchored to the genetic map, and chromosome-level pseudomolecule sequences were established using ALLMAPS (version 0.7.3).11 Telomere sequences containing repeats of a 7-bp motif (5ʹ-TTTAGGG-3ʹ) were searched using the search subcommand of tidk (https://github.com/tolkit/telomeric-identifier), with a window size of 100,000 bp.

Prediction of protein-coding genes and repetitive sequences

Protein-coding genes were predicted using the ab initio strategy of Helixer (version 0.3.2).23 Prediction completeness was evaluated with the embryophyta_odb10 data using BUSCO (version 5.2.2).16 The predicted genes were functionally annotated using emapper (version 2.1.12) implemented in EggNOG,24 in conjunction with searches against the UniProtKB database25 and Arabidopsis peptide sequences (Araport11)26 using DIAMOND (version 2.0.14)27 and BLAST,28 respectively. Repetitive sequences in the assembly were identified using RepeatMasker (https://www.repeatmasker.org), based on repeat sequences registered in Repbase and a de novo repeat library built with RepeatModeler (https://www.repeatmasker.org).

Genetic diversity analysis

The genome structure of acerola was compared with that of 3 Malpighiales species, including rubber tree (Hevea brasiliensis, NCBI RefSeq assembly GCF_030052815.1),29 cassava (Manihot esculenta, GCF_001659605.2),30 and castor bean (Ricinus communis, GCF_019578655.1),31 using Minimap215 implemented in D-Genies.32

ddRAD-Seq libraries of all 59 acerola lines were prepared as described above and sequenced on a DNBSEQ G400 instrument (MGI Tech). Subsequently, the ddRAD-seq reads obtained from all 59 lines and ‘NRA309’ were aligned onto the chromosome-level pseudomolecule sequences to detect sequence variants. High-confidence biallelic SNPs were selected with the filtering conditions: read depth ≥ 5; SNP quality = 999; minor allele frequency ≥ 0.05; proportion of missing data < 20%. Effects of SNPs on gene function were estimated with SnpEff.33 The population structure of 60 lines was evaluated based on the maximum-likelihood estimation of individual ancestries using ADMIXTURE (version 1.3.0).34 A phylogenetic tree, based on 100 bootstrap replicates, was created with SNPhylo (version 20140701)35 and visualized with iTOL (version 6.9.1).36 Chromosome sequences, predicted genes, repetitive sequences, and SNP positions were visualized with Circos.37

Results

Genome sequencing and assembly

To estimate the acerola genome size, clean short-read data (73.1 Gb) were subjected to k-mer distribution analysis. The results indicated that the acerola genome was highly heterozygous, with a haploid genome size of 1,114.2 Mb (Supplementary Fig. S1).

A total of 90.3 million HiFi reads, with an N50 length of 32.4 kb (28.3 Gb; 24.6 × coverage of the estimated genome size), were obtained from 1 SMRT Cell 8M and assembled into 73 contigs. Three potential contaminant sequences (221.1 kb in total) from organelle genomes were removed. Consequently, 70 contigs with a total length of 1,053.0 Mb and N50 and N90 lengths of 73.1 Mb and 39.1 Mb, respectively (Table 1), were obtained. These 70 contigs represented the genome assembly of acerola and were designated as NRA309_r1.0.

Table 1.

Statistics of the acerola genome assembly.

Contig sequences (NRA309_r1.0)Pseudomolecule sequences (NRA309_r1.0.pmol)Contigs unassigned to any linkage groups
Total size (bp)1,052,960,1721,032,546,98120,414,091
No. of sequences701051
N50 length (bp)73,099,088104,183,6511,133,687
N90 length (bp)39,070,83792,689,787135,127
Gap length (bp)09000
No. of genesNot analyzed35,551341
Contig sequences (NRA309_r1.0)Pseudomolecule sequences (NRA309_r1.0.pmol)Contigs unassigned to any linkage groups
Total size (bp)1,052,960,1721,032,546,98120,414,091
No. of sequences701051
N50 length (bp)73,099,088104,183,6511,133,687
N90 length (bp)39,070,83792,689,787135,127
Gap length (bp)09000
No. of genesNot analyzed35,551341
Table 1.

Statistics of the acerola genome assembly.

Contig sequences (NRA309_r1.0)Pseudomolecule sequences (NRA309_r1.0.pmol)Contigs unassigned to any linkage groups
Total size (bp)1,052,960,1721,032,546,98120,414,091
No. of sequences701051
N50 length (bp)73,099,088104,183,6511,133,687
N90 length (bp)39,070,83792,689,787135,127
Gap length (bp)09000
No. of genesNot analyzed35,551341
Contig sequences (NRA309_r1.0)Pseudomolecule sequences (NRA309_r1.0.pmol)Contigs unassigned to any linkage groups
Total size (bp)1,052,960,1721,032,546,98120,414,091
No. of sequences701051
N50 length (bp)73,099,088104,183,6511,133,687
N90 length (bp)39,070,83792,689,787135,127
Gap length (bp)09000
No. of genesNot analyzed35,551341

A genetic map of acerola was constructed based on 940.0 million ddRAD-Seq reads obtained from 118 S1 lines and ‘NRA309’ (parental line). Briefly, high-quality reads were aligned to the assembled contigs as a reference, with an average mapping rate of 88.6% (Supplementary Table S1), and 7,935 high-confidence SNPs were detected. In the subsequent linkage analysis, 10 linkage groups were obtained, with the number of linkage groups corresponding to the basic chromosome number of acerola. The resultant genetic map spanned a distance of 1,059.8 cM and contained a total of 7,775 SNPs (Table 2, Supplementary Fig. S2).

Table 2.

Statistics of the acerola pseudomolecule sequence.

Linkage groupNo. of SNPsMap length (cM)Total length (bp)No. of contigsTelomere repeat unit aNo. of genes
NRA309ch011,035105.3104,183,65121 (Bottom)3,580
NRA309ch02982126.3118,493,99730 (None)4,537
NRA309ch03906100.592,689,78731 (Bottom)3,215
NRA309ch04847110.2107,707,03421 (Top)3,815
NRA309ch0578187.9100,396,05312 (Both)3,540
NRA309ch06770121.8110,748,73321 (Bottom)3,723
NRA309ch07748125.6122,825,99512 (Both)4,097
NRA309ch0860589.793,730,03111 (Bottom)3,103
NRA309ch0956290.481,726,45721 (Bottom)2,593
NRA309ch10539102.1100,045,24321 (Top)3,348
Subtotal (ch01 to ch10)7,7751,059.81,032,546,981191135,551
Unassigned contigsNot availableNot available20,414,091515341
Total7,7751,059.81,052,961,072701635,892
Linkage groupNo. of SNPsMap length (cM)Total length (bp)No. of contigsTelomere repeat unit aNo. of genes
NRA309ch011,035105.3104,183,65121 (Bottom)3,580
NRA309ch02982126.3118,493,99730 (None)4,537
NRA309ch03906100.592,689,78731 (Bottom)3,215
NRA309ch04847110.2107,707,03421 (Top)3,815
NRA309ch0578187.9100,396,05312 (Both)3,540
NRA309ch06770121.8110,748,73321 (Bottom)3,723
NRA309ch07748125.6122,825,99512 (Both)4,097
NRA309ch0860589.793,730,03111 (Bottom)3,103
NRA309ch0956290.481,726,45721 (Bottom)2,593
NRA309ch10539102.1100,045,24321 (Top)3,348
Subtotal (ch01 to ch10)7,7751,059.81,032,546,981191135,551
Unassigned contigsNot availableNot available20,414,091515341
Total7,7751,059.81,052,961,072701635,892

aLocations of telomere repeat units are shown in parentheses.

Table 2.

Statistics of the acerola pseudomolecule sequence.

Linkage groupNo. of SNPsMap length (cM)Total length (bp)No. of contigsTelomere repeat unit aNo. of genes
NRA309ch011,035105.3104,183,65121 (Bottom)3,580
NRA309ch02982126.3118,493,99730 (None)4,537
NRA309ch03906100.592,689,78731 (Bottom)3,215
NRA309ch04847110.2107,707,03421 (Top)3,815
NRA309ch0578187.9100,396,05312 (Both)3,540
NRA309ch06770121.8110,748,73321 (Bottom)3,723
NRA309ch07748125.6122,825,99512 (Both)4,097
NRA309ch0860589.793,730,03111 (Bottom)3,103
NRA309ch0956290.481,726,45721 (Bottom)2,593
NRA309ch10539102.1100,045,24321 (Top)3,348
Subtotal (ch01 to ch10)7,7751,059.81,032,546,981191135,551
Unassigned contigsNot availableNot available20,414,091515341
Total7,7751,059.81,052,961,072701635,892
Linkage groupNo. of SNPsMap length (cM)Total length (bp)No. of contigsTelomere repeat unit aNo. of genes
NRA309ch011,035105.3104,183,65121 (Bottom)3,580
NRA309ch02982126.3118,493,99730 (None)4,537
NRA309ch03906100.592,689,78731 (Bottom)3,215
NRA309ch04847110.2107,707,03421 (Top)3,815
NRA309ch0578187.9100,396,05312 (Both)3,540
NRA309ch06770121.8110,748,73321 (Bottom)3,723
NRA309ch07748125.6122,825,99512 (Both)4,097
NRA309ch0860589.793,730,03111 (Bottom)3,103
NRA309ch0956290.481,726,45721 (Bottom)2,593
NRA309ch10539102.1100,045,24321 (Top)3,348
Subtotal (ch01 to ch10)7,7751,059.81,032,546,981191135,551
Unassigned contigsNot availableNot available20,414,091515341
Total7,7751,059.81,052,961,072701635,892

aLocations of telomere repeat units are shown in parentheses.

Contigs were aligned onto the acerola genetic map to establish chromosome-level sequences. Of the 70 contigs, 19 were mapped to 10 linkage groups. Among them, 1, 2, and 3 contigs were mapped on 3, 5, and 2 linkage groups, respectively (Table 2). The 10 pseudomolecule sequences, NRA309_r1.0.pmol, spanned a physical distance of 1,032.5 Mb (98.1% of the total contig length) (Fig. 1a, Tables 1 and 2, Supplementary Fig. S2). Telomere repeats were found at one end of 7 sequences and at both ends of 2 sequences, while no telomeres were found in 1 sequence (Table 2). Overall, chromosomes 5 and 7 were completely assembled, without gaps, at the telomere-to-telomere level. The remaining 51 contigs (total length = 20.4 Mb) with 5 telomere repeat units were not assigned to any linkage groups (Table 1, 2). In accordance with the BUSCO score, the acerola genome assembly was 98.8% complete (Table 3).

Table 3.

Completeness of the genome assembly and predicted genes of acerola.

Genome assembly (%)Predicted genes (%)
Complete98.899.1
Single-copy complete95.695.8
Duplicated complete3.23.3
Fragmented0.60.7
Missing0.60.2
Genome assembly (%)Predicted genes (%)
Complete98.899.1
Single-copy complete95.695.8
Duplicated complete3.23.3
Fragmented0.60.7
Missing0.60.2
Table 3.

Completeness of the genome assembly and predicted genes of acerola.

Genome assembly (%)Predicted genes (%)
Complete98.899.1
Single-copy complete95.695.8
Duplicated complete3.23.3
Fragmented0.60.7
Missing0.60.2
Genome assembly (%)Predicted genes (%)
Complete98.899.1
Single-copy complete95.695.8
Duplicated complete3.23.3
Fragmented0.60.7
Missing0.60.2
The genome of acerola. a) Physical map of the acerola genome in mega base (Mb) scale, b) % nucleotides of genes per 1 Mb, c) % nucleotides of LTR retrotransposons per 1 Mb, and d) SNP density per 1 Mb.
Fig. 1.

The genome of acerola. a) Physical map of the acerola genome in mega base (Mb) scale, b) % nucleotides of genes per 1 Mb, c) % nucleotides of LTR retrotransposons per 1 Mb, and d) SNP density per 1 Mb.

Gene and repetitive sequence prediction

Next, we predicted protein-coding genes in the acerola genome using Helixer. As a result, a total of 35,892 genes (average coding sequence length = 1,150 bp) were predicted (Fig. 1b), of which 35,551 and 341 genes were located on the 10 chromosome-scale sequences and the 51 unassigned contigs, respectively. The BUSCO score of Helixer-predicted genes was 99.1% (Table 3). Out of the 35,892 genes, 31,742 (88.4%), 30,989 (86.3%), and 29,068 (81.0%) genes were functionally annotated with EggNOG mapper, DIAMOND search against UniProtKB, and BLAST search against Araport11 peptide sequences, respectively (Supplementary Table S4). In total, 32,239 (89.8%) genes were annotated by at least 1 of the 3 methods.

Repetitive sequences occupied a total of 816.3 Mb (77.5%) of the pseudomolecule sequences. Nine major types of repeats were identified in varying proportions (Table 4). LTR retroelements (427.2 Mb) represented the dominant repeat type in the pseudomolecule sequences (Fig. 1c), followed by DNA transposons (89.8 Mb). Repeat sequences unavailable in public databases totalled 252.5 Mb.

Table 4.

Repetitive sequences in the acerola genome.

Repeat typeNo. of repeatsTotal length (bp)Relative proportion (%)
SINEs829109,559>0.0
LINEs51,72830,364,5782.9
LTR elements372,431427,204,24140.6
DNA transposons145,80789,786,6428.5
Small RNA4,8917,793,0740.7
Satellites1,995437,534>0.0
Simple repeats154,2126,367,4870.6
Low complexity33,6171,711,0630.2
Unclassified687,125252,478,99924
Repeat typeNo. of repeatsTotal length (bp)Relative proportion (%)
SINEs829109,559>0.0
LINEs51,72830,364,5782.9
LTR elements372,431427,204,24140.6
DNA transposons145,80789,786,6428.5
Small RNA4,8917,793,0740.7
Satellites1,995437,534>0.0
Simple repeats154,2126,367,4870.6
Low complexity33,6171,711,0630.2
Unclassified687,125252,478,99924
Table 4.

Repetitive sequences in the acerola genome.

Repeat typeNo. of repeatsTotal length (bp)Relative proportion (%)
SINEs829109,559>0.0
LINEs51,72830,364,5782.9
LTR elements372,431427,204,24140.6
DNA transposons145,80789,786,6428.5
Small RNA4,8917,793,0740.7
Satellites1,995437,534>0.0
Simple repeats154,2126,367,4870.6
Low complexity33,6171,711,0630.2
Unclassified687,125252,478,99924
Repeat typeNo. of repeatsTotal length (bp)Relative proportion (%)
SINEs829109,559>0.0
LINEs51,72830,364,5782.9
LTR elements372,431427,204,24140.6
DNA transposons145,80789,786,6428.5
Small RNA4,8917,793,0740.7
Satellites1,995437,534>0.0
Simple repeats154,2126,367,4870.6
Low complexity33,6171,711,0630.2
Unclassified687,125252,478,99924

Comparative genome structure analysis

The pseudomolecule sequences of acerola were aligned against those of rubber tree (H. brasiliensis), cassava (M. esculenta), and castor bean (R. communis), all of which are members of the Euphorbiaceae in the Malpighiales. Unexpectedly, no clear similarity was detected, and probable conserved structures were highly fragmented between the genome of acerola and those of the other 3 Euphorbiaceae species (Supplementary Fig. S3).

Genetic diversity of acerola lines

To evaluate the genetic diversity of acerola lines, the gDNA of 59 lines was subjected to ddRAD-Seq. An average of 12.7 million ddRAD-Seq reads per sample were obtained. High-quality reads from 59 lines and ‘NRA309’ were aligned to the pseudomolecule sequences of ‘NRA309’ (reference), with an average mapping rate of 90.6% (Supplementary Table S2), and 49,070 high-confidence SNPs were detected (Fig. 1d). According to SnpEff results, the most prominent SNP type was modifier impact (59.5%) in introns and intergenic regions, followed by low impact (21.9%; synonymous mutations), moderate impact (18.2%; missense mutations), and high impact (0.4%; nonsense and splice-site mutations) (Supplementary Table S5). ADMIXTURE analysis grouped the 60 lines into 3 clusters (K = 3), which value was the minimum cross-validation error for the best-fitting model: cluster a, 7 Japanese and Hawaiian lines; cluster b, 16 Hawaiian and Brazilian lines; and cluster c, 36 lines derived from crosses between FB05 and 5 members of cluster b (Fig. 2a). These 3 clusters were well-supported by phylogenetic analysis (Fig. 2b).

Genetic structure and phylogenetic tree of 60 acerola lines. a) Population structure of 60 acerola lines. Each colour represents a distinct group. The origin of acerola lines is shown in parentheses: BRA, Brazil; JPN, Japan; United States; and Cross, Progenies of crosses. b) Phylogenetic tree. Numbers on branches indicate bootstrap values based on 100 replicates.
Fig. 2.

Genetic structure and phylogenetic tree of 60 acerola lines. a) Population structure of 60 acerola lines. Each colour represents a distinct group. The origin of acerola lines is shown in parentheses: BRA, Brazil; JPN, Japan; United States; and Cross, Progenies of crosses. b) Phylogenetic tree. Numbers on branches indicate bootstrap values based on 100 replicates.

Discussion

Here, we present a chromosome-scale genome assembly of acerola, a member of the family Malpighiaceae. To the best of our knowledge, this is the first report of a chromosome-scale genome assembly of not only acerola but also a Malpighiaceae species. Chromosome-level genome assemblies have been reported for at least 3 species in the order Malpighiales, namely, rubber tree (H. brasiliensis), cassava (M. esculenta), and castor bean (R. communis), all of which are members of the family Euphorbiaceae. Comparative genome structure analysis revealed that the genome structures of acerola and the 3 Euphorbiaceae species were seldom conserved (Supplementary Fig. S3). This suggests the possibility that acerola diverged from members of Euphorbiaceae within the order Malpighiales. More chromosome-level genome assemblies for members of the Malpighiaceae would be required to validate this hypothesis.

The acerola genome information generated in this study would be useful for acerola breeding programs, in which the improvement of fruit vitamin C content together with plant disease and pest resistance are frequently targeted. The phylogenetic relationship of acerola cultivars determined in this study (Fig. 1) was in agreement with that determined previously using PCR-based sequence-related amplified polymorphism (SRAP) markers.38 Parental lines in breeding programs could be adequately selected to enhance the genetic diversity of breeding materials. Furthermore, agronomically important loci could be identified through quantitative trait locus analysis, based on the genetic map. Once the genetic loci of interest have been determined, the underlying candidate genes could be easily identified using genome information. Being a woody plant, the life cycle or generation time of acerola is longer than that of annual herbaceous plants including vegetables, which makes the completion of conventional genetic analysis and breeding within a limited time period challenging. Therefore, the application of novel genomics-based approaches, eg genome-wide association study and genomic selection based on machine learning methods, Bayesian networks, image analysis, and genomic prediction,4–6 has been proposed for fruit tree crops including woody plants.

We conclude that the acerola genome resources developed in this study, including its genome and gene sequences, genetic map, and phylogenetic relationship among breeding materials, would be useful for genomics research on acerola and Malpighiales and for the development of novel fast-paced breeding technologies. This study provides a standard genome resource for the genomics, genetics, and breeding of acerola and related species.

Acknowledgements

We thank Y. Kishida, C. Minami, K. Ozawa, H. Tsuruoka, and A. Watanabe (Kazusa DNA Research Institute) for their technical assistance.

Funding

This work was supported by JSPS KAKENHI (22H05172 and 22H05181) and Kazusa DNA Research Institute Foundation.

Conflicts of interest

K.H., N.H., and H.A. are employees of Nichirei Foods Inc. All other authors declare no competing interests.

Data availability

Raw sequence reads were deposited in the DNA Data Bank of Japan (DDBJ) BioProject database under the accession number PRJDB18209, for which details are listed in Supplementary Tables S1 and S2. The assembled sequences are available at DDBJ (accession numbers: AP035802–AP035862) and Plant GARDEN (https://plantgarden.jp).39

References

1.

Vilvert
JC
,
de Freitas
S. T.
,
Veloso
CM
,
Amaral
CLF.
Genetic diversity on acerola quality: a systematic review
.
Braz Arch Biol Technol
.
2024
:
67
:
e24220490
. doi:10.1590/1678-4324-2024220490.

2.

Ferreira
MAR
et al.
Multivariate selection index of acerola genotypes for fresh consumption based on fruit physicochemical attributes
.
Euphytica
.
2022
:
218
:
25
. doi:10.1007/s10681-022-02978-1.

3.

Varshney
RK
et al.
Designing future crops: genomics-assisted breeding comes of age
.
Trends Plant Sci
.
2021
:
26
:
631
649
. doi:10.1016/j.tplants.2021.03.010.

4.

Iwata
H
,
Minamikawa
MF
,
Kajiya-Kanegae
H
,
Ishimori
M
,
Hayashi
T.
Genomics-assisted breeding in fruit trees
.
Breed Sci
.
2016
:
66
:
100
115
. doi:10.1270/jsbbs.66.100.

5.

Iwata
H
et al.
Genomic prediction of trait segregation in a progeny population: a case study of Japanese pear (Pyrus pyrifolia)
.
BMC Genet
.
2013
;
14
:
81
. doi:10.1186/1471-2156-14-81.

6.

Minamikawa
MF
,
Nonaka
K
,
Hamada
H
,
Shimizu
T
,
Iwata
H.
Dissecting Breeders’ Sense via explainable machine learning approach: application to fruit peelability and hardness in citrus
.
Front Plant Sci
.
2022
:
13
:
832749
. doi:10.3389/fpls.2022.832749.

7.

Shiratake
K
,
Suzuki
M.
Omics studies of citrus, grape and rosaceae fruit trees
.
Breed Sci
.
2016
:
66
:
122
138
. doi:10.1270/jsbbs.66.122.

8.

Mondin
M
,
de Oliveira
C. A.
,
Vieira
MLC.
Karyotype characterization of Malpighia emarginata (Malpighiaceae)
.
Rev Bras Frutic
.
2010
:
32
:
369
374
. doi:10.1590/S0100-29452010005000072.

9.

Gladman
N
,
Goodwin
S
,
Chougule
K
,
Richard McCombie
W
,
Ware
D.
Era of gapless plant genomes: innovations in sequencing and mapping technologies revolutionize genomics and breeding
.
Curr Opin Biotechnol
.
2023
:
79
:
102886
. doi:10.1016/j.copbio.2022.102886.

10.

Dudchenko
O
et al.
De novo
assembly of the Aedes aegypti genome using Hi
C yields chromosome-length scaffolds
.
Science
.
2017
:
356
:
92
95
. doi:10.1126/science.aal3327.

11.

Tang
H
et al.
ALLMAPS: robust scaffold ordering based on multiple maps
.
Genome Biol
.
2015
:
16
:
3
. doi:10.1186/s13059-014-0573-1.

12.

Peterson
BK
,
Weber
JN
,
Kay
EH
,
Fisher
HS
,
Hoekstra
HE.
Double digest RADseq: an inexpensive method for De Novo SNP discovery and genotyping in model and non-model species
.
PLoS One
.
2012
:
7
:
e37135
. doi:10.1371/journal.pone.0037135.

13.

Marçais
G
,
Kingsford
C.
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
.
Bioinformatics
.
2011
:
27
:
764
770
. doi:10.1093/bioinformatics/btr011.

14.

Cheng
H
,
Concepcion
GT
,
Feng
X
,
Zhang
H
,
Li
H.
Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm
.
Nat Methods
.
2021
:
18
:
170
175
. doi:10.1038/s41592-020-01056-5.

15.

Li
H.
Minimap2: pairwise alignment for nucleotide sequences
.
Bioinformatics
.
2018
:
34
:
3094
3100
. doi:10.1093/bioinformatics/bty191.

16.

Simão
FA
,
Waterhouse
RM
,
Ioannidis
P
,
Kriventseva
EV
,
Zdobnov
EM.
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
.
Bioinformatics
.
2015
:
31
:
3210
3212
. doi:10.1093/bioinformatics/btv351.

17.

Shirasawa
K
,
Hirakawa
H
,
Isobe
S.
Analytical workflow of double-digest restriction site-associated DNA sequencing based on empirical and in silico optimization in tomato
.
DNA Res
.
2016
:
23
:
145
153
. doi:10.1093/dnares/dsw004.

18.

Schmieder
R
,
Edwards
R.
Quality control and preprocessing of metagenomic datasets
.
Bioinformatics
.
2011
:
27
:
863
864
. doi:10.1093/bioinformatics/btr026.

19.

Langmead
B
,
Salzberg
SL.
Fast gapped-read alignment with Bowtie 2
.
Nat Methods
.
2012
:
9
:
357
359
. doi:10.1038/nmeth.1923.

20.

Li
H.
A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data
.
Bioinformatics
.
2011
:
27
:
2987
2993
. doi:10.1093/bioinformatics/btr509.

21.

Danecek
P
et al. ;
1000 Genomes Project Analysis Group
.
The variant call format and VCFtools
.
Bioinformatics
.
2011
:
27
:
2156
2158
. doi:10.1093/bioinformatics/btr330.

22.

Rastas
P.
Lep-MAP3: robust linkage mapping even for low-coverage whole genome sequencing data
.
Bioinformatics
.
2017
:
33
:
3726
3732
. doi:10.1093/bioinformatics/btx494.

23.

Stiehler
F
et al.
Helixer: cross-species gene annotation of large eukaryotic genomes using deep learning
.
Bioinformatics
.
2021
:
36
:
5291
5298
. doi:10.1093/bioinformatics/btaa1044.

24.

Cantalapiedra
CP
,
Hernández-Plaza
A
,
Letunic
I
,
Bork
P
,
Huerta-Cepas
J.
eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale
.
Mol Biol Evol
.
2021
:
38
:
5825
5829
. doi:10.1093/molbev/msab293.

25.

UniProt
C.
UniProt: the universal protein knowledgebase in 2023
.
Nucleic Acids Res
.
2023
:
51
:
D523
D531
. doi:10.1093/nar/gkac1052.

26.

Cheng
C-Y
et al.
Araport11: a complete reannotation of the Arabidopsis thaliana reference genome
.
Plant J
.
2017
:
89
:
789
804
. doi:10.1111/tpj.13415.

27.

Buchfink
B
,
Reuter
K
,
Drost
H-G.
Sensitive protein alignments at tree-of-life scale using DIAMOND
.
Nat Methods
.
2021
:
18
:
366
368
. doi:10.1038/s41592-021-01101-x.

28.

Altschul
SF
et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
.
1997
:
25
:
3389
3402
. doi:10.1093/nar/25.17.3389.

29.

Cheng
H
et al.
Chromosome-level wild Hevea
brasiliensis genome provides new tools for genomic-assisted breeding and valuable loci to elevate rubber yield
.
Plant Biotechnol J
.
2023
:
21
:
1058
1072
. doi:10.1111/pbi.14018.

30.

Alves-Pereira
A
et al.
Selective signatures and high genome-wide diversity in traditional Brazilian
manioc (Manihot esculenta Crantz) varieties
.
Sci Rep
.
2022
:
12
:
1268
. doi:10.1038/s41598-022-05160-8.

31.

Lu
J
et al.
A chromosome-level genome
assembly of wild castor provides new insights into its adaptive evolution in tropical desert
.
Genom Proteom Bioinf
.
2022
:
20
:
42
59
. doi:10.1016/j.gpb.2021.04.003.

32.

Cabanettes
F
,
Klopp
C.
D-GENIES: dot plot large genomes in an interactive, efficient and simple way
.
PeerJ
.
2018
:
6
:
e4958
. doi:10.7717/peerj.4958.

33.

Cingolani
P
et al.
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3
.
Fly
.
2012
:
6
:
80
92
. doi:10.4161/fly.19695.

34.

Alexander
DH
,
Novembre
J
,
Lange
K.
Fast model-based estimation of ancestry in unrelated individuals
.
Genome Res
.
2009
:
19
:
1655
1664
. doi:10.1101/gr.094052.109.

35.

Lee
T-H
,
Guo
H
,
Wang
X
,
Kim
C
,
Paterson
AH.
SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data
.
BMC Genom
.
2014
:
15
:
162
. doi:10.1186/1471-2164-15-162.

36.

Letunic
I
,
Bork
P.
Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation
.
Nucleic Acids Res
.
2021
:
49
:
W293
W296
. doi:10.1093/nar/gkab301.

37.

Krzywinski
M
et al.
Circos: a
n information aesthetic for comparative genomics
.
Genome Res
.
2009
:
19
:
1639
1645
. doi:0.1101/gr.092759.109.

38.

Ito
A
et al.
Identification of acerola (Malpighia glabra L.) accessions by SRAP
markers
.
Trop Agric Dev
2014
:
58
:
30
32
. doi:10.11248/jsta.58.30.

39.

Ichihara
H
et al.
Plant GARDEN: a portal website for cross–searching between different types of genomic and genetic resources in a wide variety of plant species
.
BMC Plant Biol
.
2023
:
23
:
391
. doi:10.1186/s12870-023-04392-8.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.