From multiallele fish to nonstandard environments, how ZFIN assigns phenotypes, human disease models, and gene expression annotations to genes

Abstract Danio rerio is a model organism used to investigate vertebrate development. Manipulation of the zebrafish genome and resultant gene products by mutation or targeted knockdown has made the zebrafish a good system for investigating gene function, providing a resource to investigate genetic contributors to phenotype and human disease. Phenotypic outcomes can be the result of gene mutation, targeted knockdown of gene products, manipulation of experimental conditions, or any combination thereof. Zebrafish have been used in various genetic and chemical screens to identify genetic and environmental contributors to phenotype and disease outcomes. The Zebrafish Information Network (ZFIN, zfin.org) is the central repository for genetic, genomic, and phenotypic data that result from research using D. rerio. Here we describe how ZFIN annotates phenotype, expression, and disease model data across various experimental designs, how we computationally determine wild-type gene expression, the phenotypic gene, and how these results allow us to propagate gene expression, phenotype, and disease model data to the correct gene, or gene related entity.


Introduction
Understanding gene and protein function can provide insight to elucidate the intricate cellular mechanisms that are responsible for the development, growth, pathology, and senescence of organisms. Observing the results of gene mutations is the cornerstone of elucidating and understanding gene function. The zebrafish, Danio rerio, has been used in forward and reverse genetic screens to study gene function and understand the mechanisms of vertebrate development (Driever et al. 1996;Haffter et al. 1996;Golling et al. 2002;Moens et al. 2008;Varshney et al. 2013). The results of gene function studies in zebrafish are relevant to understanding human gene function due to the conservation of gene sequences and functions between zebrafish and humans (Postlethwait et al. 2000;Howe, Clark et al. 2013). Due to similarities between zebrafish and human organ functions and physiology, zebrafish have been used to model human diseases that affect the cardiovascular (Smith et al. 2009;Liu et al. 2019), nervous (Chapman et al. 2013;Hin et al. 2020), visual (Zhang et al. 2016), muscular (Majczenko et al. 2012Widrick et al. 2016), and many other systems. In addition to understanding gene function and disease pathogenesis, zebrafish are increasingly used for toxicology and drug discovery studies, as well as research that explores the effects of genotype and environment on phenotype and disease (Zon and Peterson 2005;Kaufman et al. 2009;Williams et al. 2014;Wheeler et al. 2019;Cassar et al. 2020).
The Zebrafish Information Network, ZFIN (zfin.org), is the database resource for zebrafish research that annotates, curates, and makes data available from zebrafish research that spans genetic perturbations, chemically induced phenotypes, and human disease models, as well as gene expression (Sprague et al. 2008;Ruzicka et al. 2015;Howe et al. 2017). ZFIN curates gene expression, phenotype, and human disease model data by annotating the genotypes, experimental conditions, anatomical structures, phenotype statements, and disease models reported in zebrafish research publications (Sprague et al. 2006;Howe, Bradford et al. 2013;Bradford et al. 2017). These annotations can include genotypes with one or many alleles and experimental conditions that range from standard conditions to manipulation of temperature, diet, chemicals, or other conditions. Due to the breadth of data that represent combinations of genotype and environment that produce a phenotypic outcome or human disease model, it can be challenging to determine whether a particular allele or environment is causative. To understand gene function and clarify how gene dysfunction contributes to disease, it is necessary to separate genetic phenotypes from those caused by the environment. ZFIN has developed a data model and algorithms that distinguish the genotype and environment components of an annotation to parse genetic and environmental contributors to phenotypes, using the results to infer which genes are causative of a phenotype. Here we discuss the ZFIN annotation components and computational logic used to infer wild-type gene expression, gene-phenotype and gene-human disease relationships, and the ZFIN webpages and download files (https://zfin.org/downloads) where the data are available.

ZFIN annotation components
There are three main components to ZFIN gene expression, phenotype, and human disease model annotations: (1) the genotype of the fish including gene knockdown reagents used, (2) the experimental conditions applied, and (3) an ontological representation of the results.

Fish
Gene mutation and sequence targeting reagents (STRs), which knockdown gene products, are routinely used in zebrafish to study gene function. To represent all of the genes that are affected due to either gene mutation or knockdown, ZFIN uses a data model that groups the genotype and applied STR in an object called Fish. Mutant gene loci are curated as alleles of genes and are part of a genotype together with the background strain when that information is provided. Zebrafish are also amenable to transgene insertion to knock out genes (Amsterdam et al. 2004), overexpress endogenous or other species genes (Sabaawy et al. 2006;Padanad et al. 2012), insert mutant genes (Kimelman et al. 2017;Endo et al. 2022), or express fluorescent proteins to mark anatomical structures (Lawson and Weinstein 2002;Clark et al. 2011). Transgene insertion is accomplished by the injection of DNA constructs (transgenic constructs) into zebrafish embryos, which are then raised to maturity and screened for stable germline transmission (Stuart et al. 1990;Culp et al. 1991). ZFIN creates records for transgenic constructs and makes an association with the transgenic genomic features (alleles) using a phenotypic or innocuous relationship. The phenotypic relationship is used with constructs that drive expression of either an endogenous zebrafish gene or a gene from another species (Table 1). These constructs are expected to produce protein products that can have a phenotypic effect. The innocuous relationship is used with constructs that drive the expression of fluorescent proteins or are unable to transcribe a protein product unless inserted near a native promoter, such as gene trap constructs. Information on the innocuous or phenotypic relationship between a genomic feature and a construct is available in the "Innocuous/phenotypic construct details" download file. Transgenic alleles are represented in the genotype when applicable, and genotypes are considered innocuous or phenotypic depending on the relationship between the allele and construct. Site-specific mutagenesis using CRISPRs and TALENs is also used in zebrafish to screen for candidate genes (Jao et al. 2013;Zu et al. 2013). Zebrafish crispants, F0 founder zebrafish created using CRISPRs, are also used to phenocopy loss of function mutants (Bek et al. 2021). In addition, gene function can be investigated in zebrafish using morpholinos, which knockdown the gene by targeting RNA, effectively silencing the gene product (Nasevicius and Ekker 2000;Ekker and Larson 2001). ZFIN group morpholinos, CRISPRs, and TALENs in a class called STR due to the sequence-specific nature of these reagents. Both alleles and STRs have relationships with the genes they knockout or target. ZFIN developed the Fish data model to facilitate the identification of causative genes due to the many ways in which gene function is investigated in zebrafish.

Experimental conditions
Zebrafish are used in a wide array of experimental contexts. To represent the experiments reported in research publications, the conditions applied are curated using ontology terms from the Zebrafish Experimental Conditions Ontology (ZECO; Bradford et al. 2016) (Federhen 2012). The ZECO ontology contains the main types of conditions with high-level nodes that include standard conditions for zebrafish husbandry as described in The Zebrafish Book (Westerfield 2000), control conditions (such as vehicle injections), biological treatment (such as exposure to bacteria), chemical treatment, diet alterations, housing conditions, in vitro culture, surgical manipulation, lighting conditions, temperature exposure, radiation exposure, and water quality. ZECO terms from the biological treatment branch are combined with NCBI Taxon terms to annotate conditions where another organism is added to the environment or when the zebrafish are raised in germ-free environments. The chemical treatment branch of ZECO is combined with chemicals from the ChEBI ontology to annotate the chemical that was used in the experiment. The surgical manipulation branch is combined with terms from the ZFA ontology to denote the anatomical structures that underwent ablation, resections, or other surgical manipulations. In instances when a cellular component, such as an axon, is ablated, GO-CC terms are used along with ZFA terms.

Ontological representation of results
ZFIN uses multiple ontologies to annotate gene expression, phenotype, and human disease models. Disease, expression, and phenotype annotations include the Fish and experimental conditions. To complete disease annotations, terms from the Disease Ontology (DO; Schriml et al. 2019) are added as well as evidence terms from the Evidence and Conclusion Ontology (ECO; Nadendla et al. 2022). To describe the location of the expression or phenotype annotation, terms from the ZFA, the Zebrafish Stage Ontology (ZFS; Van Slyke et al. 2014), GO-CC, and Spatial Ontology (BSPO; Dahdul et al. 2014) are used. Expression annotations include the gene that is expressed as well as the assay type using terms from the Measurement Method Ontology (Smith  Gkoutos et al. 2005) as well as tags for "normal," "abnormal," "ameliorated," or "exacerbated." Phenotype annotations that use terms from GO-BP or GO-MF only use PATO terms from the process quality branch, while anatomical entity phenotype annotations use terms from the physical object quality branch. All ZFIN annotations refer to the publication that reported the results. In summary, ZFIN gene expression, phenotype, and disease model annotations are multipartite, including the genotype and applied knockdown reagents as Fish, the experimental conditions, and the ontological representation of the results. See Tables 2-4 for examples of gene expression, phenotype, and human disease model annotations.

Database logic for gene expression, gene-phenotype, and gene-disease associations
As described in the previous section, each data type provides different information used to construct an annotation. To be able to understand the function of a single gene, it is necessary to isolate the environmental factors from the genetic interactions within an annotation and ensure correct attribution of the experimental outcome to a single gene, if appropriate. To ensure the correct representation of data sets and data displays on the gene page, ZFIN has established query logic or algorithms to parse the details of existing annotations such that the gene page only displays those data that show where a gene is normally expressed and the phenotypic results of mutation or knockdown of that specific gene, as explained in the sections below.

Wild-type gene expression
Understanding the wild-type expression profile of genes is essential to understand what systems and structures a gene contributes to developmentally and is necessary as a comparator when evaluating gene expression in mutant or gene-knockdown zebrafish. ZFIN curators annotate gene expression in both wildtype and mutant backgrounds as well as what experimental conditions are present. To determine wild-type gene expression, algorithms are designed to identify gene expression in Fish that have wild-type backgrounds, no mutant alleles, in standard or control conditions. Gene expression results that meet these criteria are displayed on the gene page ( Fig. 1) and are provided in the "Expression data for wild-type fish" download file available on the downloads page. ZFIN also provides wild-type gene expression   (Fig. 4).

Affected genes for phenotype and disease models
To determine the function of a gene, it is instructive to look at the phenotypic outcomes of mutant and gene-knockdown zebrafish. Phenotype can encompass many levels of observation from morphologic changes at the level of the whole organism to changes in gene expression and protein location within a cell. To draw conclusions about what functions a gene has in the cell or organism, it is necessary to ensure that the phenotypes attributed to the gene are solely caused by changes to that gene. ZFIN has developed algorithms to determine the total number of altered or affected genes in a Fish, with the resulting number determining if a causative gene can be inferred. The number of affected genes is determined by counting distinct genes associated with alleles and STRs that are associated with a Fish. When the affected gene count equals one and the experimental conditions are standard/generic control, the phenotype or disease association is inferred or calculated to be caused by the gene associated with the Fish either by its allele relationship or by its STR target relationship. There are various ways to arrive at gene count equals one. As illustrated in Fig. 2, Fish can have one affected gene but can be more or less complex in their genetic makeup. For example,  a Fish with a single allele with one affected gene, a Fish with multiple alleles where all alleles affect the same gene, a wild-type Fish injected with one or more STRs targeting one gene, and a nonphenotypic transgenic line injected with one or more STRs targeting one gene all have only a single affected gene. We have recently added rules to the algorithm that do not count tp53 as an affected gene in Fish, where morpholinos against tp53 were used in addition to non-tp53 morpholinos due to the way zebrafish researchers use morpholinos against tp53 to deal with nonspecific effects (Robu et al. 2007). Previously, a Fish that had two morpholinos, one of which was against tp53, would be considered to have two affected genes, and the phenotype would be excluded from gene pages. The algorithm now ignores tp53 morpholinos in the Fish and the resulting group of morpholinos is used to obtain the affected gene count, with data propagated to the gene page when the gene count equals one (Fig. 3).
In addition to counting the number of affected genes, the algorithms account for transgenic lines, both those that are treated as wild-type equivalents by the research community and those used to alter the expression of a gene. As explained in the previous Fish section, Fish containing genomic features that have a phenotypic relationship to a construct are considered phenotypic lines. These Fish are excluded by affected gene count algorithms because phenotype and disease annotations using such Fish cannot be attributed to a single gene. This is due to the lack of gene counting for genes expressed by transgenic constructs, as the algorithm does not count the genes associated with constructs, instead it solely relies on the phenotypic relationship between transgenic allele and construct. Since the algorithm does not count genes associated with transgenic constructs, it is unable to identify the number of genes a construct has. Fish that have genomic features with an innocuous relationship to a construct are considered innocuous and are counted as wild-type equivalents by the affected gene count algorithms. The resulting data allow us to determine computationally the affected gene count. In addition to gene count and innocuous or phenotypic genomic features, the experimental conditions are also taken into account when determining whether the phenotype or disease model data can be attributed to a gene. When the experimental conditions are standard or generic control and the affected gene count is one, the resulting phenotype or disease association is inferred to be caused by the one affected gene. These data are then propagated to the gene page, gene-related entity pages, and download files. Currently, only phenotype annotations that are tagged as "abnormal" are displayed in the phenotype section of gene pages, as those annotations directly relate to individual gene functions. Phenotype statements that are tagged, "ameliorated," or "exacerbated" are usually the result of genetic interactions or applied experimental conditions and do not conform to the single affected gene algorithm. Ameliorated and exacerbated annotations are displayed on the Fish page, can be found via the search interface, and in "Ameliorated phenotypes" and "Exacerbated phenotype" download files.
Similar rules are employed for determining whether a phenotype is caused by an STR or may be the result of a combination of genetic affectors. On the STR page, phenotype in Fish with only a single STR targeting a single gene in a wild-type or nonphenotypic transgenic background is displayed in the section where  Viktorin et al. (2009). b) The phenotype summary section on the emx3 gene page has a ribbon that denotes systems, stages, biological processes, and cellular components that have annotations, with individual annotations displayed in the table. Thumbnail images are displayed when available. Phenotype corresponding to Fish in A is denoted by bracket.
the label starts with "Phenotype resulting from" followed by the STR name (Fig. 4). For more complex Fish or when the STR has multiple targets, the phenotypes are displayed in a section labeled "Phenotype of all Fish created by or utilizing" followed by the STR name(s).
The algorithm for determining the number of affected genes in Fish for phenotype displays is also used to display disease model data on a gene page. Zebrafish models of human disease can be either genetic models or models induced by experimental conditions or a combination of these (Kawahara et al. 2011;Cronin and Grealy 2017;Yu et al. 2021). ZFIN curators make disease model annotations when research publications report zebrafish models of human diseases. Zebrafish disease model annotations contain Fish, experimental conditions, disease terms, ECO evidence codes,   4. STR page. Expression display is limited to Fish with a wild-type background under standard or control conditions. Phenotype display is divided into two sections, the first labeled "Phenotype resulting from MO1-vcana" contains phenotype only in wild-type or innocuous transgenic fish with standard conditions. Phenotype in more complex fish or under nonstandard conditions as well as the phenotype from the previous section is displayed in the section labeled "Phenotype of all Fish created by or utilizing MO1-vcana." and references. All annotated zebrafish models of a disease are displayed on ZFIN disease term pages (Fig. 5a). Disease models that have a Fish with a single affected gene with standard or control experimental conditions are displayed on the corresponding gene page in the human disease model table (Fig. 5b). ZFIN does not annotate when a Fish is not a model of a human disease, as this is not usually reported in the literature. Zebrafish models of human disease data are provided in the "Human disease models" download file. In addition, ZFIN provides phenotype and disease model data to the Alliance.

Conclusion
The development, growth, and senescence of organisms are the result of an elegant orchestra of gene expression, protein function, pathology, and the environment. Understanding gene and protein function is essential knowledge that provides insight into the cellular mechanisms of developmental and disease processes. Gene function has traditionally been elucidated using gene mutation and targeted gene knockdown. Genetic and experimental condition manipulation, either singly or in combination, produces phenotypic outcomes. Zebrafish have been used in forward and reverse genetic screens to study gene function, model human disease, understand toxicology, and discover drugs. ZFIN curates genetic, genomic, phenotypic, and disease model data that result from zebrafish research. The algorithms used by ZFIN support the identification of wild-type expression patterns, genes that are causative for phenotypes, and disease models from data collected in a wide variety of Fish and experimental conditions. The resulting data are presented on the gene, STR, and disease pages as well as in specialized download files. The aggregation of these data on discrete pages and download files allows users to quickly synthesize data about gene function, phenotypic outcomes, and disease models without having to manually compile the research from many genotypes, gene knockdowns, and experimental conditions.

Data availability
All relevant data are available at ZFIN, zfin.org.