Integrated curation and data mining for disease and phenotype models at the Rat Genome Database

Abstract Rats have been used as research models in biomedical research for over 150 years. These disease models arise from naturally occurring mutations, selective breeding and, more recently, genome manipulation. Through the innovation of genome-editing technologies, genome-modified rats provide precision models of disease by disrupting or complementing targeted genes. To facilitate the use of these data produced from rat disease models, the Rat Genome Database (RGD) organizes rat strains and annotates these strains with disease and qualitative phenotype terms as well as quantitative phenotype measurements. From the curated quantitative data, the expected phenotype profile ranges were established through a meta-analysis pipeline using inbred rat strains in control conditions. The disease and qualitative phenotype annotations are propagated to their associated genes and alleles if applicable. Currently, RGD has curated nearly 1300 rat strains with disease/phenotype annotations and about 11% of them have known allele associations. All of the annotations (disease and phenotype) are integrated and displayed on the strain, gene and allele report pages. Finding disease and phenotype models at RGD can be done by searching for terms in the ontology browser, browsing the disease or phenotype ontology branches or entering keywords in the general search. Use cases are provided to show different targeted searches of rat strains at RGD.


Introduction
The laboratory rat (Rattus norvegicus) has been used as an animal model in biomedical research for over 150 years. The first recorded inbreeding of a rat strain for scientific purposes was started by King in the early 1900s (1). Since then, there have been increasing numbers of rat models from naturally occurring mutations, selective breeding and, more recently, genome manipulation, including chemical mutagenesis and genome editing. Through the innovation of genome-editing technologies (2) and rat embryonic stem cells (3), genome-modified rats provide precision models of disease by disrupting or complementing targeted genes. The genome-edited models not only confirm the causal effect of targeted genes but also can be used as in vivo models to further study disease pathogenesis and treatment. These genome-modified rats, and selectively bred rats and their parental strains, have been used in a wide range of research in the fields of physiology, pharmacology, toxicology, nutrition, behavior, immunology and pathology. As a result, there are more than 1.6 million publications of rat research in PubMed, with about 35 000 being added every year. The vast amount of data embedded in these publications has great value to scientists in future research planning. The Rat Genome Database (RGD; http://rgd.mcw.edu) was created in 1999 to organize the existing knowledge and present it to the research community with integrated genomic, genetic, phenotypic and disease datasets.
To facilitate integration of and access to rat data, RGD uses multiple ontologies to standardize data and present these datasets in a format that can be read by humans and machines. RGD's primary focus is manual annotation of genes, strains and quantitative trait loci (QTL). Particularly noteworthy are the manually curated disease annotations across rat, human and mouse genes at RGD. The primary annotations are made by curators from peerreviewed journals and transferred to the other two genes of the ortholog group (4). Automated pipelines are set up to bring in data from other databases to complement the RGD manually curated disease-gene data. RGD regularly imports disease data from the ClinVar database (5), OMIM (6) and the Comparative Toxicogenomics Database (7), as well as maintaining archival disease annotations from the Genetic Association Database (8).
To make rat strain data easy to find, RGD established the Rat Strain (RS) Ontology to depict relationships among strains (9). The RS Ontology organizes rat strains into different strain types based on their breeding history and genetic backgrounds. Using the rat strain as a data hub, users can navigate among different datasets associated with the strain of interest. These datasets include manually curated disease annotations, qualitative mammalian phenotype (MP) annotations and quantitative phenotype annotations.

Strain registration
RGD currently has a catalogue of more than 3000 registered strains and substrains. These strains were curated from publications and submissions by authors and vendors. RGD regularly receives new strain submissions from major rat resources such as the Gene Editing Rat Resource Center (https://rgd.mcw.edu/wg/gerrc/), the Rat Resource and Research Center (http://www.rrrc.us/), the National BioResource Project (http://www.anim.med.kyoto-u.ac.jp/ nbr/Default.aspx) in Japan and commercial rat providers. Researchers can submit their strains and obtain the official symbols and IDs used for manuscript publication. The Strain Submission Form (https://rgd.mcw.edu/rgdweb/ models/strainSubmissionForm.html?new=true) can be found via the 'Submit Data' link on the RGD homepage (http://rgd.mcw.edu/) or on the menu bar of the Strain Search page, which can be accessed from the link in the center of the RGD homepage. Since the launch of the RS in 2009, there has been a steady increase in registered strains ( Figure 1A). Of all strains, close to one-third are congenics, an inbred host carrying a particular locus from a donor strain. The congenics were used extensively to identify chromosomal elements associated with a particular phenotype in early genomic studies. The number of mutant strains, which include rats carrying spontaneous mutations or induced mutations, has almost tripled between 2009 and 2018. The transgenic strains, rats carrying foreign DNA sequence, are another strain type that shows steady increase since 2009.

RS Ontology
The RS Ontology includes 15 first-level nodes depicting the strain types ( Figure 1B). Each strain type is named following standardized nomenclature rules (9). The breeding history and genome modification information are embedded in the tree structure. For example, the salt-sensitive Sprague-Dawley (SS) rat strain was developed by L.K. Dahl (10) and then distributed to different institutions where three substrains were developed: SS/Jr, SS/Hsd and SS/N. SS/JrHs-dMcwi was originally derived as SS/Jr, then bred at Harlan (Hsd), and is currently housed at the Medical College of Wisconsin ( Figure 1C). Searching RGD with SS/JrHsdMcwi as a keyword returns 419 related strains, including mutants, congenics, transgenics and inbreds ( Figure 1D). Using the Ontology Browser, users can find rat strains by how they were generated or from which parents they were derived. Considering SS.BN-(D13Rat151-D13Rat197)-Serpinc1 em2Mcwi (RGD: 12790721, RS:0004357) as an example ( Figure 2A). Its two ontological parents SS.BN-(D13Rat151-D13Rat197)/Mcwi (RS:0001711) and SS-Chr 13BN mutants (RS:0004542) to the left of it in the browser depict that the strain is derived from SS.BN-(D13Rat151-D13Rat197)/Mcwi and is a mutant strain derived from SS congenics introgressed with Chr13 DNA fragment from Brown Norway. Strains with sibling relationship are colisted in the same column, which includes congenic and mutant strains. The Strain Report page containing all RGD data related to the strain is accessed via the 'View Strain Report' link on the browser page. The Strain Report page ( Figure 2B) contains links to go back to the Browser, to the mutant allele Serpinc1 em2Mcwi ( Figure 2C) carried by the strain, and other basic information about the strain. The mutant allele created by gene-editing techniques is named according to standard nomenclature rules and is linked to its parent gene report page and its associated rat strain report page. Each mutant allele and the allele-carrying strain are annotated with diseases and/or phenotypes, and the annotations are propagated to the parent gene Serpinc1.
These manual annotations are integrated in the database and can be found in the related pages linked by the gene. For example, the strain SS.BN-(D13Rat151-D13Rat197)-Serpinc1 em2Mcwi has been found to be more susceptible to kidney reperfusion injury than the control wild-type, based on the study by Wang et al. (11). The curated diseases and phenotypes were integrated into the Serpinc1 em2Mcwi mutant allele page, the Serpinc1 gene page and the mutant rat strain page. Users can find these annotations in the Annotation section ( Figure 2E) of all three report pages, and each page provides links to the others.

Functional annotations: disease and phenotype
The Rat has been a preferred model for complex disease research such as cardiovascular, metabolic syndrome and neurobehavioral studies. To understand the underlying molecular mechanisms of disease, researchers have been inbreeding rats, generating congenic/consomic strains and engineering rat genes to create strains with different disease manifestations or susceptibilities. These strains, either susceptible or resistant to the targeted diseases/phenotypes, are annotated with standardized RGD Disease Ontology (RDO) (4) (12) or MP Ontology (13) terms. These functional annotations are listed in the Annotation section on all the report pages for strains, genes and alleles and can be expanded to view details such as evidence codes and citations ( Figure 2E). In the expanded view, each annotation is listed with the original reference, evidence code, optional qualifier and free text notes. About 35% of the strains (∼1300 strains) are annotated with one or more disease/phenotype terms ( Figure 3A). Among strain types, over 65% of congenic strains are annotated, while less than 20% of transgenics and mutants are annotated ( Figure 3B). We anticipate more publications will be generated from the recently created gene-modified strains in the near future. These gene-modified strains carrying defined alterations in target genes are ideal tools for disease study. By strain group, the SS group carries the most mutant alleles, followed by F344 ( Figure 3C). There are some genes that have multiple alleles carried by different host rat strains. The rat Lepr gene (RGD:3001) has the highest number of mutant alleles found in different genetic backgrounds. These mutant alleles include the spontaneous mutant alleles Lepr fa , Lepr cp and Lepr m1Rll and engineered mutant alleles Lepr em1 , Lepr em2 , Lepr em3 and Lepr em2Mcwi . These allele-carrying strains are annotated with diseases and/or phenotypes, and the annotations are propagated to the parent gene Lepr.

PhenoMiner Expected Range
In addition to qualitative disease and phenotype curation, RGD also curates rat strains with quantitative phenotype annotations through the PhenoMiner tool (https://rgd. mcw.edu/rgdweb/phenominer/home.jsp) (14), which are accessed through the PhenoMiner user interface tool (15). The PhenoMiner annotations, like disease/phenotype annotations, are accessible from the Annotation section on the strain report page. Currently, there are more than 1200 strains curated with PhenoMiner annotations and more than half of these PhenoMiner strains are also annotated with disease or phenotype terms ( Figure 4A). The PhenoMiner user interface query is built from combinations of the four ontologies (16) used for curation. The tool retrieves all the matched quantitative phenotype records in the database and displays them as a bar chart with the specified clinical measurement ( Figure 4B). The result gives researchers an overview of the value range curated at PhenoMiner and can be used as a reference in study planning. To provide more utility of PhenoMiner data, RGD launched a new project, PhenoMiner Expected Ranges, to perform statistical meta-analysis on PhenoMiner data. The aim of this project is to provide an expected range of a phenotype measurement based on the records in the database. For example, the diastolic blood pressure records displayed in Figure 4B show all the diastolic blood pressure records from the WKY strain group. Statistical tests were applied to these records to decide how they could be grouped to calculate expected ranges with statistical significance. Several exploratory analyses were conducted on publications and numbers of experiments in the database to determine inclusion/exclusion of individual studies/observations in the analysis. The heterogeneity in the meta-analysis was examined by Cochrane's Q and I 2 , which were then used as the threshold to choose the appropriate meta-analysis model for each dataset. The statistics theories and computational algorithm are described in the accompanying paper by Zhao et al. in this issue. The PhenoMiner Expected Ranges tool (https://rgd.mcw.edu/ rgdweb/phenominer/phenominerExpectedRanges/views/ home.html) can be accessed from the Phenotypes and Models icon on the RGD homepage. In the tool, the meta-data are grouped by the Vertebrate Trait Ontology (17) listed on the left panel ( Figure 4C). Once a trait is selected, the table of available expected ranges of the phenotypes within the selected trait is displayed. The most comprehensive datasets are under the circulatory system trait, where 10 phenotypes are available for 11 inbred strain groups and 200 expected ranges were calculated. Diastolic blood pressure, one of the phenotypes under the trait, has expected ranges calculated from three strains groups, SHR, SHRSP and WKY, and only calculated ranges from WKY were selected to display ( Figure 4D). Each color-coded box represents an interquartile range in which the data median is shown as the line in the middle, the third quartile is presented by the top and the first quartile by the bottom. If enough records are available in the database, usually more than four records per phenotype, the expected ranges will also be calculated with different stratifications such as age, sex or methods of measurement. This graph displays the expected ranges for overall WKY group, WKY older than 100 day-old or just male or female specific ranges. The result displayed can be modified by applying filters such as age, sex and methods. At present, only measurements The 'A' icon is a link to the ontology report page (D) listing all RGD objects annotated to the disease term 'Developmental Diseases' and its child terms. The links provided in the browser page and the ontology report page allow users to navigate between the terms and annotations. Strains associated with developmental diseases can be downloaded from the download button above the species tabs. made under control conditions were used in the analysis. The data table following the graph display provides users with the details on the strains, methods, conditions, sex and age groups associated with the displayed graph. The links to original PhenoMiner data are also provided in the far right column of the data table.

Finding disease/phenotype rat models
To find data objects annotated with ontology terms, users can either search the terms by keywords or browse the selected ontology. The searching and browsing of ontologies ( Figure 5 A&B) is accessible from the Ontologies icon on the RGD homepage. The RDO is currently used for disease curation (4) (12), and MP (13) is used for rat-strain phenotype curation. (18) uses three columns to display all the parent terms on the left and all the child terms on the right of the selected term. When the disease browser is opened for browsing the Disease Ontology, the top level term 'disease' is shown in the browser with child terms listed on its right. The term Developmental Diseases (DOID: 9008582) is a direct child of 'disease' and selection of this term moves it to the center column with the parent term 'disease' on the left and all the child terms on the right as shown in Figure 5C. Terms having child terms are displayed with a '+' icon, and the 'A' icon to the right of developmental diseases means that annotations for the term itself and/or its child term(s) exist in RGD. The 'A' icon is a link to the Ontology report-annotations page listing all RGD objects annotated to that term. There are more than 6900 rat genes, 185 rat QTL and 107 rat strains annotated with 'Developmental Disease' or any of its more specific child terms ( Figure 5D). The strain-associated disease annotations are viewed under the 'Strains' tab. Users can view all the annotation-associated data such as strain symbols, chromosome positions of the genes, evidence codes used in making the annotations, references and notes by clicking the 'view all columns' box. Annotations can be downloaded in the 'view all columns' table format and analysed in a spreadsheet. The non-redundant lists of developmental disease rat models presented in Table 1 were selected annotations manually curated from publications. The superscript portion of strain symbols reveals whether the strain is a spontaneous mutant (m) such as WAG-F8 m1Ycb (RGD: 2314904) carrying a spontaneous mutation of the rat F8 gene or a genome-modified mutant such as SD/Novo-F8 em1Sage−/− (RGD: 11531091) carrying an endonucleasemediated (em) mutation of F8. The strain report page has a link to the allele report page where the details of the allele and the link to its parent gene are available.

Finding disease models by searching the MP Ontology.
Using 'abnormal lymphocyte morphology' as the search term in the ontology search ( Figure 5A) retrieved Human Phenotype terms, MP terms and their child terms. The MP term 'abnormal lymphocyte morphology (MP:0002619)' was selected to retrieve rat strains that were annotated with this ontology term and its child terms at RGD. Annotations can be viewed by going to the Ontology Report-Annotations page (MP:0002619) via the 'A' icon next to the MP term in the search results. Strains are downloaded and processed by similar mechanisms used for the developmental disease strains shown in Table 1. The non-redundant list of phenotype models is presented in Table 2. All the strains in Table 2 carry identified mutant alleles that are associated with the abnormal phenotypes. Among them, Rag1 mutant alleles cause abnormal phenotypes in B cells, T cells and NK cells. To find strains associated with the cytochrome P450 superfamily, users enter 'cyp' as the search term in the box and the search engine will retrieve all the entries matched 'cyp' in the description, object symbols, origins, synonyms and other indexed terms. The results ( Figure 6) are aggregated into groups and presented in a matrix across species, data objects, ontology terms and references. There are 23 strains matched with 'cyp' by name, symbol, synonyms or origins. These 23 strains include mutant, congenic, transgenic and consomic strains, and their tallies are shown next to strain types on the following page accessed by clicking the '23'. The data can be downloaded as an Excel file or sent to other RGD tools such as Phenominer or Variant Visualizer for further analysis. Each strain is hyperlinked to its report page where details of the strain and its associated annotations can be found. The strains tagged with 'PM' icons have PhenoMiner annotations available at RGD.

Data accessibility
RGD is actively participating in the NIH Data Commons Initiative. RGD's Application programming interface (APIs) were developed according to the SmartAPI specifications and have been registered at smart-api.info with the tag 'NIHdatacommons' to conform to the Data Commons Table 1. Rat strains with associated developmental disease terms are listed with curated disease terms, official strain symbols and the known disease genes/alleles with representative references. The downloaded strain data are processed by selecting non-redundant disease annotations with manual evidence codes. The associated alleles not included in the downloaded file can be found on the strain report pages

Conclusion
Using strain as a data object, RGD curators annotate experimental rat models with disease and phenotype terms. Curated strains can be searched by genes, diseases or phenotypes, and the results can be imported into RGD tools or downloaded for customized analysis. To facilitate data navigation, annotations are integrated among genes, strains and associated mutant alleles. Users can identify the disease strain, disease-causing allele and its parent gene beginning on one of the report pages. For more than a decade, RGD has undertaken a focused curation effort aimed at capturing comprehensive disease and phenotype data of studied rat strains. In addition to using qualitative terms to curate phenotypes, RGD has moved phenotype curation to the quantitative level by developing the PhenoMiner tool. From these curated quantitative phenotype data, Zhao et al. (accompanying paper in this issue) have developed the Expected Range tool to allow researchers to visualize the phenotype profiles across multiple strains under similar experimental conditions from multiple sources. The combination of expected range strain profiles and the disease/phenotype annotations of each individual strain can inform researchers whether the strain is a good disease model for the targeted disease and how much deviation of the phenotype the strain exhibits as compared to the normal range of rat strains. Rat disease models have evolved from spontaneous and randomly induced mutants to targeted mutants generated by advanced genome-editing techniques. The targeted mutants are powerful tools to dissect pathogenesis at the molecular level. However, the disease manifestation of a mutant allele typically also depends on other host genes. Rats in which the same mutation has been engineered in different genetic backgrounds are valuable tools to study how the genomic context plays a role in disease manifestation. Since the publication of the reference genome for the Brown Norway rat in 2004 (19), many other rat strains have also been sequenced (20). RGD houses genomic variation data, relative to Rnor3.4, Rnor5.0 and Rnor6.0 reference genomes, for more than 40 rat strains and these data are available for various browsing and analysing tools (21). The combination of annotations with genome variation profiles among rat strains provides robust utility for translational genomics.