PeachVar-DB: A Curated Collection of Genetic Variations for the Interactive Analysis of Peach Genome Data

Applying next-generation sequencing (NGS) technologies to species of agricultural interest has the potential to accelerate the understanding and exploration of genetic resources. The storage, availability and maintenance of huge quantities of NGS-generated data remains a major challenge. The PeachVar-DB portal, available at http://hpc-bioinformatics.cineca.it/peach, is an open-source catalog of genetic variants present in peach ( Prunus persica L. Batsch) and wild-related species of Prunus genera, annotated from 146 samples publicly released on the Sequence Read Archive (SRA). We designed a user-friendly web-based interface of the database, providing search tools to retrieve single nucleotide polymorphism (SNP) and InDel variants, along with useful statistics and information. PeachVar-DB results are linked to the Genome Database for Rosaceae (GDR) and the Phytozome database to allow easy access to other external useful plant-oriented resources. In order to extend the genetic diversity covered by the PeachVar-DB further, and to allow increasingly powerful comparative analysis, we will progressively integrate newly released data.


Introduction
Peach (Prunus persica L. Batsch) is the most economically important fruit tree species of Prunus genera. According to archaeological evidence, peach was domesticated in China from an unknown ancestor and is related to other wild species of the Amygdalus subgenus, including P. mira, P. davidiana and P. kansuensis (Faust and Timon 1995). Wild-relatives bear fruits of poor eating quality, although they could be a valuable source either for the introgression of disease resistance traits or as rootstocks (Pascal et al. 1998).
Intense breeding activities in peach allowed a progressive improvement of important fruit quality characteristics and technological attributes (Infante et al. 2008, Monet andBassi 2008). Peach genetics also achieved remarkable progress, developing a wide number and types of molecular markers, allowing several linkage maps to be built and the mapping of important traits, as well as the estimation of genetic diversity and domestication paths (Byrne et al. 2012, Li et al. 2013, Akagi et al. 2016. However, relevant quantitative characters, particularly those related to fruit nutritional properties, plant environmental adaptation and disease resistance, have a complex pattern of inheritance, so the understanding of the molecular genetic bases of such traits requires a more detailed knowledge of the gene network expressing the phenotype (Foulongne et al. 2003, Cirilli et al. 2016.
The relatively small size of its genome, estimated at about 265Â10 6 bp, as well as the collinearity with other diploid Prunus species, makes peach an ideal model for comparative and functional genomics within the Rosaceae family (Abbott et al. 2002). The completion of the peach genome project made available a high-quality reference assembly of the double-haploid cultivar 'Lovell' (Verde et al. 2013, enabling highthroughput discovery of genetic markers through the application of next-generation sequencing (NGS) technologies. The whole-genome re-sequencing (WGRS) approach has the potential to reveal most of, if not all, the genomic variations in a target individual in comparison with the reference genome (Mochida and Shinozaki 2010). Sequencing efforts on peach and wild relatives are rapidly growing in light of the potential usefulness of high-density genetic information for germplasm characterization, functional genomics and evolutionary studies (Ahmad et al. 2011, Cao et al. 2016, Velasco et al. 2016.
Given their potential to overcome the well-known limited resolution of the biparental linkage mapping approach, genome-wide association studies are becoming a popular and effective tool for dissecting the genetic architecture of monogenic or polygenic traits in peach (Cao et al. 2014, Micheletti et al. 2015. Also, dense marker information is essential for the adoption of marker-assisted breeding and genome selection, which appears among the most promising strategies to boost breeding programs and trait introgression, as recently shown with peach varieties (Biscarini et al. 2017). The reliability and effectiveness of these novel genomics tools will largely depend on the continuous sharing and integration of massive amounts of genetic and phenotypic data coming from diversified sources (Kang et al. 2016). The optimal management of high-throughput biological information requires the implementation of dedicated bioinformatics facilities.
For this reason, integrative databases were built for several model and non-model species, such as TAIR, Gramene, OryzaBase and the SOL Genomics Network (Mochida and Shinozaki 2010). An important central repository, the Genome Database for Rosaceae (GDR), is available for peach and other species of the Rosaceae family ), but it does not currently host high-density genomic data or the relative annotations obtained from re-sequencing projects of a massive number of accessions. Therefore, whole-genome information about SNPs (single nucleotide polymorphisms) and InDels (small insertions and deletions) is not easily accessible, remaining de facto a powerful but largely unused resource. The availability of an interactive database to explore genomic variability in peach would facilitate research studies, allowing access to information regarding genetic variants even to scientists who have no access to bioinformatics platforms for data analysis.
The present article introduces the first release of PeachVar-DB, an open-source catalog of annotated genomic variants (SNP and InDels) found both in peach (P. persica L. Batsch) and wild-related species of Prunus genera. Variant discovery was achieved by applying an imputation-free joint variant-calling procedure on 146 accessions publicly available on the Sequence Read Archive (SRA). Such a procedure improves variant discovery by leveraging population-wide information from a cohort of multiple samples (Liu et al. 2013). Our objective is to provide easy access to information from WGRS for a range of purposes, from genotyping to functional genomics. The web interface of the database provides several tools to search, visualize and download results, along with other useful annotation statistics and information.

Database content and web access
Whole-genome sequencing data of 125 peach (P. persica L. Batsch) accessions and 21 wild relatives of the Amygdalus subgenus have been downloaded from the NCBI SRA (Leinonen et al. 2011). Following identification of genetic variants for each accession (see the Materials and Methods), data were stored in PeachVar-DB, a NoSQL graph database (Neo4j).
The current database release contains a total of 4,630,814 and 461,785 high-quality SNPs and InDels, respectively, with 42.8% SNPs and 42.5% InDels being exclusively present in wild relatives. About 11% of the total number of SNPs have a minor allele frequency of <0.05, and 5.1% are located in exonic regions ( Table 1). The PeachVar-DB web portal provides a user-friendly interface and is available at the http://hpc-bioinformatics. cineca.it/peach. The home page contains a list of all the input accessions with links to the SRA (Fig. 1A), an interactive pie chart of the number of variants per accession (Fig. 1B), a variant summary table by algorithm (Fig. 1D) and an overall summary of accessions and variants found (Fig. 1C).

Search tools
The 'Search' menu provides several tools to explore genomic variants in PeachVar-DB (Fig. 2). All the query output can be easily downloaded in both text and VCF format, allowing users to perform downstream analysis on a customized subset of variants. Whenever a genetic variant present in PeachVar-DB falls into a known genetic marker interval, the output table displays a button named 'Genetic Marker'.
The 'Search by region' function ( Fig. 2A) runs the query on the genomic co-ordinates specified by the user. Furthermore, the selected genomic windows are hyperlinked to the 'Synteny Viewer' tool available under GDR, allowing the visualization of genome conservation and rearrangement patterns. Users interested in the variants occurring on a specific gene can use the 'Search by gene ID' function ( Fig. 2B) to retrieve this information. Genes can be queried either by typing their name (based on the JGI nomenclature for peach genome v2.1 transcript annotations) or through the provided full list of transcripts. The results can be visualized via JBrowse and are linked to a detailed functional annotation description in Phytozome (Goodstein et al., 2012). Additionally, Gene Ontology identifiers, annotations and descriptions are displayed for the selected gene by clicking on the 'see GOterms' button. In addition, the search can be restricted to specific gene regions (e.g. CDS, untranslated regions or mRNA sequence), using the 'Search by gene features' function ( Fig. 2C). Users interested in accession-specific variants can run the 'Search by accession' function (Fig. 2D). The results table reports an extra column indicating whether the listed variants are common to other accessions as well. Users can also directly paste custom nucleotide sequences, retrieving BLAST results as full-text alignment through the 'Search by similarity' function ( Fig. 2E). Finally, the 'Pairwise accession comparison' function ( Fig. 2F) gives the user the ability to

Accession info
The 'Accession info' link gives access to four menu options named 'dataset information', 'population structure', 'phylogenesis' and 'phenotypic data' (Fig. 3). The 'dataset information' page provides an overview of the input files (e.g. SRA accession names and Bioproject hyperlinks) subdivided into two categories: P. persica and wild relatives (Fig. 3A). The 'population structure' page describes the population structure of the 146 accessions inferred using a subset of 50,000 randomly selected SNPs in ADMIXTURE v1.22 (Alexander et al. 2009). A value of K = 4 explains most of the ancestry within accessions. It differentiates the cluster of wild relatives (P4) and ornamental peach (P2) from edible peach (P1 and P3), the latter showing various degrees of admixture. For higher values of K, landraces and accessions derived from Oriental and Occidental breedings tend to separate. Accession names and membership probabilities can be viewed in pop-up windows by hovering over the histograms (Fig. 3B). The 'phylogenesis' page displays the genetic relationships estimated using SNPhylo (Lee et al. 2014) and PhyML 3.0 (Guindon et al. 2010) using an interactive phylogenetic tree. By clicking on the branch of interest, a pop-up menu allows users to apply several display modification functions (e.g. subtree collapsing; descendent/internal/incident branches visualization; path to root selection as shown in Fig. 3C). 'Phenotypic data' encloses a range of typical peach Mendelian traits, mainly related to quality attributes, such as fruit hairiness (peach or nectarine), shape (round or flat), texture (melting, non-melting and stony hard), flesh color (yellow or white) and taste (acid or sub-acid). The dynamical pie charts provide an overview of the phenotypic variability in peach fruit (Fig. 3D).

The JBrowse Genome Browser
PeachVar-DB integrates the Ajax-based genome browser JBrowse v1.11.3 (Skinner et al. 2009) for an easy to use panning and zooming navigation of the peach reference genome (Fig. 4). In addition to gene models and putative orthologous genes from other species (Arabidopsis thaliana, Prunus avium and Oryza sativa), JBrowse allows the direct visualization of the location of SNPs/InDels. After selecting a variant of interest, users can visualize the complete list of annotations, such as allele frequency and distribution, missing data or assembling statistics of each accession. Moreover, through the native 'Open track file' tool, user-provided files, such as VCF or BAM, can be privately visualized and explored using the reference genome features present in the JBrowse environment. To increase usability, we added the latest release of Prunus genetic markers mapped to the peach genome v2.0.a1 as an additional track.

Statistics
This page reports a range of useful statistics (see Fig. 5).
Information regarding whole-genome linkage disequilibrium patterns, SNP density (Fig. 5A, B), gene density (Fig. 5C), nucleotide diversity (Fig. 5D) and mean read depth, among others, are displayed. For example, as shown in Fig. 5A, responsive histograms provide more detail as the size of the display window increases (e.g. compare the two histograms in Fig. 5A and B in which the same information is dynamically displayed by using two different window sizes, about 2 and 7 million positions, respectively). Whenever a specific accession is selected, a dynamic annotated genetic variants summary is displayed at the top of the page (Fig. 5E).

Conclusions and Future Developments
Tremendous progress of -omics approaches is enabling highthroughput generation of biological data, opening the so called bio-data era. The development of bioinformatics facilities to manage such massive amounts of information is becoming a fundamental aspect of research, particularly in the field of functional genomics and population genetics. Integrative databases are already available for several species, including also those specifically intended for hosting highdensity genome data. PeachVar-DB is the first database specifically dedicated to the storing, sharing and querying of peach genomic variations identified by the analysis of WGRS data from a panel of peach and wild relative accessions. The final results presented in this database rely on the identification of a highly reliable and accurate consensus call-set, consisting of >5 millions peach genetic variants, detected by merging the results of multiple variant-calling algorithms. Genomic variants can be quickly and easily explored, visualized and retrieved by using a well-assorted set of tools ranging from BLAST similarity search to the JBrowse Genome Browser. In turn, this information can be used for a range of purposes, including germplasm characterization, SNP array design and experimental design for functional and evolutionary genomics.
As the amount of NGS data freely available in peach is expected to grow, we designed the PeachVar-DB portal architecture to support new data updates and the integration of new tools. The dataset of accessions and genomic variants will be updated at least every 2 years, depending on the amount of whole-genome sequencing data publicly released. Other aspects will also be improved, including hyperlinks to other important reference communities, particularly EVA (European Variations Archive).
The linking of genomic data with phenotypic traits remains one of the major challenges for crop improvement. In this perspective, the inclusion of phenotypic information has not been dictated solely by the need for a basic description of the input dataset, but rather it represents a first step towards the creation of a more comprehensive database devoted to functional genomics, which will be able to handle and analyze a plethora of both genomic and phenomic data.

Data retrieving and processing
Data were downloaded from the NCBI SRA, selecting only whole-genome sequencing libraries (125 peach accessions, P. persica L. Batsch, and 21 wild relatives of the Amygdalus subgenus), and excluding microRNA and RNaseq libraries or DNA samples derived from chromatin immunoprecipitation or other DNA treatments. The whole set of accessions is largely representative of the genetic variability present in peach germplasm, including ornamental varieties, landrace and improved accessions derived from oriental and occidental breeding. Wild relatives include accessions of P. davidiana, P. mira, P. kansuensis, P. communis (almond) and P. ferganensis, the latter being genetically similar to peach. A detailed list of accession numbers and relative hyperlinks is provided both in the homepage and in the 'Dataset information' page, accessible from the 'Accession info' menu. The full set of sequences provides an overall 1,601Â coverage of the estimated peach genome size (227 Mbp), with a mean coverage for each accession varying from 2.65Â to 109.93Â. Raw reads were quality filtered, trimmed with Trimmomatic v. 0.32 (Bolger et al. 2014) and mapped onto the peach reference 'Lovell' genome V2.1 using the BurrowsÀWheeler Aligner (BWA)-MEM algorithm with default parameters. After duplicate removal and indexing of mapped reads with Picard tools (http://broadinstitute.github.io/picard/), genomic variants were identified by executing three different variant-calling algorithms: GATK-Haplotype Caller (Van der Auwera et al. 2013), BCFtools v. 1.2 and Freebayes (Garrison and Marth, 2012). SNP and InDel discovery and genotyping were performed across all 146 samples simultaneously, with a joint-calling procedure, according to each algorithm's best practices. After filtering for low-quality variants, an intersection of results was performed using a custom Perl script, retaining only those in common to all the algorithms in order to get a more robust set of results (Fig. 6). Once obtained, the final joint VCF file has been split into individual samples, zipped and indexed with the bgzip utility from samtools suite (http://www.htslib.org/doc/tabix.html). Users can freely download all the above files via the 'Download' page of the PeachVar-DB portal.

VCF downstream analysis
The evolutionary relationships among accessions stored in the database were inferred using SNPhylo (Lee et al. 2014) for whole-genome SNP data filtering. Multiple sequence alignment was exerted with MUSCLE software (Edgar 2004), and PhyML 3.0 (Guindon et al. 2010) was used for estimating phylogenetic relationships through the maximum likelihood method and bootstrapping (100). Population structure was estimated through ADMIXTURE software v. 1.22 (Alexander et al. 2009) using 50,000 randomly selected SNPs and by inputting K values from 2 to 8 to identify the value of K (a priori genetic clusters) that maximizes the predictive accuracy based on a 10-fold cross-validation with 10 different fixed initial seeds. The 'compare two accessions' utility relies on the runtime execution of bedtools software (Quinlan and Hall 2010), showing only variants in common between two accessions of choice. The information shown under the 'Statistics' page has been pre-calculated using: the vcf-tools utility (Danecek et al. 2011) for SNP density, linkage disequilibrium and nucleotide diversity; samtools mpileup (Li et al. 2009, Li 2011 for 'Mean Depth'; and BCFtools (http://github.com/ samtools/bcftools) for 'Gene density by region' and 'Gene density by chromosome'.

Web architecture
PeachVar-DB is hosted on a Ubuntu 16.04 Linux operating system with the Apache 2.4.18 Web server. The front-end web-user interface was developed using the AngularJS framework (v. 1.6.4) which presents data dynamically within a single-page application (SPA). The graphical user interface was implemented using the Angular material library (v. 1.1.4), a reference implementation of Google's Material Design Specification, which provides a set of reusable, well-tested and accessible UI components for an optimal final user experience.  The number of SNPs with a fixed size of about 2 million bases. By changing the value of the window size by means of a custom slider, the same information is plotted at a different resolution (i.e. about 7 million bases, see B). Information regarding gene density (C) and nucleotide diversity (D), among others, is also shown. A dynamical summary table of the number of annotated genetic variants is finally reported in (E), whenever a specific accession is selected. Fig. 6 Venn diagrams of genomic variants identified by each variant-calling algorithm (GATK-HC, BCFtools and Freebayes) and their intersection. Variants which were in common to two out of the three algorithms were not retained.
The front-end also relies on third-party graphical libraries, such as d3 (https:// github.com/d3/d3), which allows easy display of custom graphic components (e.g., phylogenetic trees, Venn diagrams, etc.), angular-chart (http://jtblin. github.io/angular-chart.js/), which effectively renders data according to several chart types, and md-data-table (https://github.com/iamisti/mdDataTable) for visualization and pagination of search results. The front-end interface uses AJAX to retrieve data from a web server; domain data are stored in a huge graph-based database.

JBrowse additional tracks
Additional tracks displayed in the JBrowse section were downloaded from the following sources: A. thaliana and O. sativa from Phytozome database, P. avium and peach genetic markers from the GDR database.