Transcriptome visualization and data availability at the Saccharomyces Genome Database

Abstract The Saccharomyces Genome Database (SGD; www.yeastgenome.org) maintains the official annotation of all genes in the Saccharomyces cerevisiae reference genome and aims to elucidate the function of these genes and their products by integrating manually curated experimental data. Technological advances have allowed researchers to profile RNA expression and identify transcripts at high resolution. These data can be configured in web-based genome browser applications for display to the general public. Accordingly, SGD has incorporated published transcript isoform data in our instance of JBrowse, a genome visualization platform. This resource will help clarify S. cerevisiae biological processes by furthering studies of transcriptional regulation, untranslated regions, genome engineering, and expression quantification in S. cerevisiae.


INTRODUCTION
The annotation of >6000 genes in the reference genome of Saccharomyces cerevisiae is maintained by the Saccharomyces Genome Database (SGD; www.yeastgenome.org) (1), and is based on the common laboratory strain S288C (2). As a model organism database, SGD maintains a record of the sequence and chromosomal location of these gene features and manually curates functional annotation of their protein products in accordance with the guidelines of the Gene Ontology consortium (GO; www.geneontology. org) (3). This sequence information can be used to determine homology relationships across other organisms, and GO provides a controlled vocabulary and relational ontology for describing molecular functions, biological pro-cesses, or cellular components that may be shared by evolutionary conservation.
Precise gene mRNA sequence and coordinates are relevant to studies of mRNA stability (4), localization (5), and translational efficiency (6). Additionally, genome engineering projects seeking to alter or to vectorize the expression of S. cerevisiae genes (7) and transcript-based computational methods for measuring gene expression could also benefit from categorization of full-length mRNA transcripts (8). High-throughput next generation sequencing methodologies that measure the RNA expression of genes or map protein regulation of genomic DNA have become increasingly sensitive, making identification of these sequences easier.
In this paper we describe how SGD has taken data files and associated metadata from these RNA sequencing (RNA-seq) experiments, available at public repositories such as the Gene Expression Omnibus (GEO; www. ncbi.nlm.nih.gov/geo/) (9) and Array Express (www.ebi.ac. uk/arrayexpress/) (10), and visualized them using JBrowse (jbrowse.org) (11), a web-based genome browser application. We have divided datasets into tracks that either map assay values continuously across each position of the genome or highlight regions of interest identified experimentally. One of the categories of biochemical assays represented in SGD's JBrowse instance is the transcriptome: the identification of all RNA transcripts produced from the entire genome under particular conditions. The 5 and 3 untranslated regions flanking each gene are captured in these data tracks, where we aim to provide the research community with additional information about fundamental yeast cellular transcription.

INTEGRATION OF TRANSCRIPTOME DATA
A number of publications have separately sequenced the 5 and/or 3 ends of transcripts in S. cerevisiae (12)(13)(14)(15)(16)(17)(18)(19)(20)(21). SGD has provided data from these studies to map the 5 and 3 First, we downloaded a text file of transcript coordinates and raw counts from GEO (accession GSE39128). We selected the subset in which each transcript's chromosomal location fully overlapped the protein coding sequence of a single open reading frame (ORF) annotated by SGD on the same strand. In order to compare across conditions, we combined transcripts from both conditions and gave them unique identifiers containing the systematic name of the associated SGD ORF. We first ordered transcripts by distance upstream of the start site of the associated ORF and then in descending order by transcript length, and finally created an output file using the General Feature Format (GFF) annotation format. Transcript identifiers are consistent across all files. For example, the YAL008W id199 transcript isoform in the file with the most abundant transcripts found in yeast grown in glucose media (most abundant full-ORF transcripts ypd.gff3) corresponds to the same isoform in all other files; most abundant full-ORF transcripts gal.gff3, which contains the most abundant transcripts found under the galactose condition.
To begin to define a transcriptome, we initially created a set with all full-length transcripts for all ORFs (unfiltered full-ORF transcripts.gff3). Because growth condition affects what is being transcribed, we split this set into two different transcript sets, based on growth conditions (galactose or glucose). To depict the full range of what is transcribed for a particular ORF, for each growth condition, we then filtered the dataset for the longest transcript for each ORF. Finally, to indicate the predominant tran-script isoform under each condition, we also created a most abundant transcript set. The GFF annotation filename suffixes for the longest and most abundant transcript sets denote whether they refer to the glucose nutritional condition ( ypd.gff3) or galactose nutritional condition ( gal.gff3). All filenames and descriptions can be found in Table 1.
To reflect how many transcripts covered each individual nucleotide of the S. cerevisiae genome, coverage tracks were generated for each condition from the raw transcript text file. Because of the size of the files, we split transcript coverage into plus and minus strands and created separate big-Wig (.bw) files. The presence of intergenic, truncated and polycistronic transcription are also reflected in the provided files (Table 1).

TRACK VISUALIZATION AND ANNOTATION DOWN-LOAD
SGD's JBrowse instance is accessible through the 'Genome Browser' link within the 'Sequence' menu in the purple toolbar that runs across the top of most SGD webpages, or via direct URL (browse.yeastgenome.org). Within the JBrowse browser window, the TIF-seq transcriptome tracks can be viewed by using the 'Select tracks' button in the top left corner. In the resulting slide out window, nine data tracks comprise the transcriptome (unfiltered transcripts that fully overlap ORFs, longest transcript in each of two conditions, most abundant transcript in each of two conditions, and both plus and minus strand coverage in each of two conditions), and can be navigated to in several ways using the categorical track selector to the left or the text query box at the top. Choosing 'Pelechano' within the 'First Author' category or '23615609' within the 'PMID' category and checking the leftmost boxes for each track in the metadata display table results in the tracks being viewable in the JBrowse navigation window. These instructions are also reviewed in a video tutorial on SGD's YouTube page (www.youtube.com/ SaccharomycesGenomeDatabase). A recent post on the SGD Blog (www.yeastgenome.org/blog/explore-the-s288ctranscriptome-in-jbrowse) has the YouTube tutorial embedded and provides direct links to the tracks in JBrowse, Nucleic Acids Research, 2020, Vol. 48, Database issue D745 Once displayed in the JBrowse navigation window, tracks can be distinguished by color. Unfiltered transcripts are displayed in solid yellow and can number up to the hundreds. By default, the browser clips the number of tran-scripts viewed at close zoom or collapses the track into a 'density' view at far zoom ( Figure 1). For the most abundant and longest transcript tracks, transcript identifiers are displayed beneath the glyph ( Figure 2B, C). Pink shading represents the logarithmically scaled transcript abundance; darker shading reflects higher abundance. Clicking on the glyph for a transcript reveals a popup listing its exact coor- dinates, raw abundance in the particular media condition, and predicted sequence based on the reference genome (Figure 3). Quantitative coverage tracks for each condition are presented as histograms; blue for the plus strand and red for the minus strand (Figure 2A, D). These tracks represent the cumulative raw abundances of all transcripts at each position of the S288C reference sequence.

FUTURE DIRECTIONS
There are multiple ways to expand the transcriptome data that SGD provides. Pelechano's dataset examines transcripts and their abundance at specific glucose/galactose concentrations for mid-log phase cells. However, additional datasets, such as those from experiments that examine conditions utilizing different chemical, genetic or epigenetic perturbations, as well as extended time courses, can be incorporated. Large heterogeneity between individual transcripts for the same gene was a key observation of the Pelechano study. Incorporation of single cell sequencing methodologies could clarify the varied transcriptional landscape between individual cells and determine the existence of burgeoning subpopulations over time (24,25). Multiple studies exist that also profile the transcriptional heterogeneity of untranslated regions (UTR) and transcription start sites (TSS) utilizing alternative deep sequencing technologies (26,27). Overlaying existing ribosome profiling (Riboseq) studies with the transcriptome data could expand our understanding of transcriptional dynamics (28). SGD will continue to integrate the aforementioned research in a systematic way and depict them informatively to help to gain insight into the S. cerevisiae transcriptome.

DATA AVAILABILITY
JBrowse is an open source genome browser available in the GitHub repository (https://github.com/GMOD/jbrowse). SGD software is open source and available from the GitHub Nucleic Acids Research, 2020, Vol. 48, Database issue D747 Figure 3. Transcript isoform dialog popup. The 'Ypd' or 'Gal' attribute lists the raw abundance in the glucose-or galactose-media condition, respectively. 'Region sequence' displays the predicted sequence based on the S288C reference genome. repository (https://github.com/yeastgenome). The TIF-seq data from Pelechano et al., 2013, is accessible at NCBI GEO archive (accession GSE39128).