The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals

Abstract Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced by integrating multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It is based exclusively on curated healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression. Curation includes very large datasets such as GTEx (re-annotation of samples as ‘healthy’ or not) as well as many small ones. Data are integrated and made comparable between species thanks to consistent data annotation and processing, and to calls of presence/absence of expression, along with expression scores. As a result, Bgee is capable of detecting the conditions of expression of any single gene, accommodating any data type and species. Bgee provides several tools for analyses, allowing, e.g., automated comparisons of gene expression patterns within and between species, retrieval of the prefered conditions of expression of any gene, or enrichment analyses of conditions with expression of sets of genes. Bgee release 14.1 includes 29 animal species, and is available at https://bgee.org/ and through its Bioconductor R package BgeeDB.


Taxon constraints
In order for automatic reasoners to determine in which taxa an anatomical entity exists, Uberon uses taxon constraints(1) (2).Bgee needs to override some of these taxon constraints.First, because there can be errors in the ontology, which we need to correct.And second, because we need to define taxon constraints for species-specific terms.Indeed, Uberon comes notably in two flavors.One is the ext.owl ontology(3), which is the core version of Uberon, containing the taxon constraint axioms.
Another one is the composite-metazoan.owlontology (4), which merges species ontologies into the structure of Uberon, merging in taxonomic equivalents, relabeling species-specific classes, and also merging in all of the cell ontology(5) (6).Bgee integrates the composite-metazoan version, in order to be able to describe precisely species-specific anatomies, with taxon constraints for generic terms from the core ontology.But the species-specific terms (e.g., ZFA terms for zebrafish) lack taxon constraints.We thus override these taxon constraints to define them for species-specific terms.

Uberon simplification
This simplification aims at keeping only the knowledge needed for Bgee, while simplifying it for an easier browsing.We keep in the ontology only some specific relation types (object properties in OWL), and their child relation types (sub-properties): is_a, part_of, develops_from, transformation_of.
We remove some terms (e.g., CARO:0000006 "material anatomical entity"), while merging their incoming edges with their outgoing edges.We remove some relations that we feel make the ontology more difficult to browse, e.g., UBERON:0000463 "organism substance" part_of UBERON:0000468 "multi-cellular organism" (as of Uberon ext release 2019-06-27, this relation is annotated as ""this relationship may be too strong and may be weakened in future").And we remove some branches we are not interested in, e.g., BFO:0000141 "immaterial anatomical entity".

Improvement of mappings to species-specific ontologies
Many terms from species-specific ontologies, used in MODs providing in situ hybridization data, lack a mapping to Uberon; we can thus not integrate them into Bgee without corrections.We correct as many as possible of these missing or incorrect mappings (see https://github.com/obophenotype/uberon/issues/664 and https://github.com/obophenotype/uberon/issues/1288for a report of the issues discovered for Bgee 14).

Term merging
To merge each species-specific ontology into the structure of Uberon, we add cross-references to associate species-specific terms to Uberon terms they can be merged with.For instance, the term from the human ontology HsapDv:0000010 "gastrula stage" has a cross-reference to the Uberon term UBERON:0000109 "gastrula stage".HsapDv:0000010 is merged with UBERON:0000109, and all its children are considered as children of UBERON:0000109 (e.g., HsapDv:0000011 "Carnegie stage 06").

Term ordering
Stages are ordered using the relations "preceded_by" or "immediately_preceded_by". For integration into Bgee, all terms need to be strictly ordered related to each other, as if all terms possessed a relation "immediately_preceded_by" to another term.This is not always the case.For instance, if the relations in the human ontology provided the information that "Carnegie stage 06" is immediately preceded by "Carnegie stage 05", and that "Carnegie stage 06a" is the first term in temporal ordering among its siblings, and "Carnegie stage 05c" the last; then the Bgee pipeline can infer that "Carnegie stage 06a" is immediately preceded by "Carnegie stage 05c".
Of note, for some species, stages can be immediately preceded by more than one term.This is the case for instance in worm, where the stage "L4" can be immediately preceded by either "Dauer", or "L3".As of Bgee 14, Bgee cannot accommodate such relations, and we conserve only one of them for insertion into the database.Also, some developmental stages can be asynchronous in an individual, so that the ontology cannot explicitly state what is their temporal ordering.This is for instance the case in fly for the stages embryonic cycle 15 and 16.As of Bgee 14, Bgee needs an absolute ordering of terms for insertion into the database.We thus add "preceded_by" relations to these terms.

Insertion into Bgee
For Bgee 14, we then transform this merged developmental stage ontology into a nested set model (we will likely abandon this approach in future releases, to avoid the issues described above).One of the main difficulties at this step is to order consistently all terms from all species-specific ontologies, related to each other.This cannot be done with a classical sort algorithm, such as merge sort (see (7)).The reason is that some developmental stages in different species cannot be ordered relative to each other.While the classical sorting algorithm considers these stages as equal, it only means that they could not be ordered, not that they truly have the same position in the stage ordering.To overcome this problem, we have implemented a bubble sort algorithm (see (8)), to compare all stages to each other (archived source code for Bgee 14 releases of the sort algorithm: https://github.com/BgeeDB/bgee_apps/blob/Bgee_v14.0/bgeepipeline/src/main/java/org/bgee/pipeline/ontologycommon/OntologyUtils.java;archived source code for Bgee 14 releases of the sorting algorithm implementation and the generation of the nested set model: https://github.com/BgeeDB/bgee_apps/blob/Bgee_v14.0/bgeepipeline/src/main/java/org/bgee/pipeline/uberon/UberonDevStage.java).

Healthy wild-type samples in normal conditions
Examples of decisions made concerning conditions which can be considered "normal".
We have defined whether we could integrate samples used to study the effect of fasting on gene expression: a reasonable fasting time can correspond to conditions that animals encounter in the wild; but an abnormally long fasting time can have adverse effects on the health of the individual, so that it cannot be considered as "healthy" anymore.As a result, we have defined 5-6 hours as being a reasonable fasting time in Mus musculus.
Lab strain animals do not exhibit the exact same phenotypes or gene expression patterns than animals harvested in the wild (9), and some lab or domesticated strains are selected to maximize a given phenotype (e.g., DBA/2J mouse model for experimental glaucoma (10)).But they also represent a part of the genetic diversity of the species studied.For this reason, and because lab strain animals represent the vast majority of the data available, we include them into Bgee, while annotating the strain used whenever possible (see below).

RNA-Seq data
We parse Bgee annotations (the files resulting from manual curation of primary databases) in order to retrieve the relevant information about SRA files to download, using the NCBI e-utils(11) (for instance, SRR IDs of runs part of a library).We download data from SRA(12) using the NCBI SRA Toolkit (13) and the Aspera software (14), and then convert them to FASTQ files (15).For GTEx data, we download FASTQ files through the dbGaP(16) Authorized Access System(17) using Aspera.GTF annotation files and genome sequence fasta files from Ensembl (18) and Ensembl metazoa (19) are used to identify sequences of genic regions, exonic regions, and intergenic regions.We also use them to generate indexed transcriptome files for all species in Bgee, using TopHat (20) and Kallisto (21).We use Kallisto to generate pseudo-counts for each transcript, which are then summed for each gene.All further analysis and reporting is done at the gene level as of Bgee 14. TPM and FPKM (or RPKM) values are computed from pseudo-counts, and the results between libraries for each experiment are normalized independently using TMM (22).The source code and documentation for RNA-Seq data integration can be found at https://github.com/BgeeDB/bgee_pipeline/tree/master/pipeline/RNA_Seq(archived version for release Bgee 14.1: https://github.com/BgeeDB/bgee_pipeline/tree/v14.1/pipeline/RNA_Seq).

Affymetrix data
We retrieve Affymetrix data preferentially as CEL files, or as MAS5 (23) processed files when the CEL files are not available.For each chip type, we retrieve the mapping of probesets to genes from Ensembl.We remove hidden redundancy as explained above.We remove low quality chips by using either the IQRray (24) quality score for CEL file data, or the percentage of genes considered as expressed for MAS5 file data.When CEL files are available for an experiment, we normalize the probeset signal intensities using gcRMA (25).Otherwise, we store the MAS5 flags of expression "present", "marginal", "absent".The source code and documentation for Affymetrix data integration can be found at https://github.com/BgeeDB/bgee_pipeline/tree/master/pipeline/Affymetrix(archived version for release Bgee 14.1: https://github.com/BgeeDB/bgee_pipeline/tree/v14.1/pipeline/Affymetrix).

EST data
For ESTs, we retrieve data from UniGene (26) and smiRNAdb(27) (both now retired; Table 1).We retrieve mappings of UniGene data to genomes, for human, mouse, zebrafish and Xenopus, from Ensembl.For fly, as a mapping is not available from Ensembl, we retrieve cDNA information in FASTA format (28) from Ensembl, and use BLAST (29) to map UniGene clusters to genes.In each library, we simply count the number of ESTs mapped to each gene.The source code and documentation for EST data integration can be found at https://github.com/BgeeDB/bgee_pipeline/tree/master/pipeline/ESTs(archived version for release Bgee 14.1: https://github.com/BgeeDB/bgee_pipeline/tree/v14.1/pipeline/ESTs).

Healthy wild type data
For each MOD, we filter out the data from animals with abnormal genetic backgrounds, e.g., animals with transgenes or knockout.We discuss with the developers of these resources to determine how to accurately filter out these data.For instance, for data in C. elegans, WormBase has provided us a file containing only healthy wild-type data (Daniela Raciti, personal communication).

Data quality
When no data quality information is provided, we assume by default that in situ hybridization data are high quality.When the information is provided, we remap the quality levels from each MOD to Bgee quality levels, and discard samples of low quality (for correspondences and filtering criteria, see https://github.com/BgeeDB/bgee_pipeline/tree/master/pipeline/In_situ;archived version for Bgee 14.1, see https://github.com/BgeeDB/bgee_pipeline/tree/v14.1/pipeline/In_situ).
For integration into Bgee, we remap these species-specific terms to Uberon, both for the anatomical localizations and the developmental stages.This is achieved by using cross-references or OWL equivalentClass axioms present in Uberon, mapping to these species-specific ontologies.We have also corrected and added many cross-references as part of our integration of Uberon, and of developmental stage ontologies (see above).Other species-specific terms, not merged to Uberon terms, are also present in the composite-metazoan version of Uberon, allowing us to integrate them when needed.
For some MODs, both the anatomy and developmental stage information are entailed in a single term annotation.It is for instance the case in GXD, that uses the EMAPS ontology.The EMAPS ontology maps to anatomical terms from EMAPA, but targets them at a specific developmental stage (see (31) for details).For integration into Bgee, we transform these annotations to retrieve the anatomical localization on the one hand, and the developmental stage on the other hand.
In some other MODs, expression patterns are captured without keeping a stage-specific identity.In WormBase, a gene could be reported as expressed, for instance, in the pharynx at L1 stage and in neurons from embryo to adult; the resulting annotation would be pharynx and neuron, in embryo, larva and adult.The stage specific information is lost.Even when there is only one stage, a gene could be reported as expressed in the pharynx, vulva, and intestine, and, at stage L1, in the nerve ring.
Wormbase would capture L1 for nerve ring, but along with the other anatomical parts (Daniela Raciti, personal communication).For this reason, we keep stage information from WomBase only when there is one single anatomical entity annotated for a gene, so that we can be certain that the annotated stages should be associated to it.For other annotations, we map the stage information to the root of our stage ontology, UBERON:0000104 "life cycle", meaning that no precise information about stage is captured.
The species-specific ontologies used by MODs also differ in their structure, as compared to Uberon, making the mapping task challenging.For instance, the EMAPA ontology makes no distinction between "myocardium" and "cardiac muscles": "myocardium" is a synonym of the term EMAPA:32688 "cardiac muscle tissues".EMAPA:32688 is mapped to 3 different terms in Uberon: UBERON:0002349 "myocardium", UBERON:0001133 "cardiac muscle tissue", and UBERON:0004493 "cardiac muscle tissue of myocardium".In this case, Bgee retains the mapping to the most specific term, which is here UBERON:0004493 "cardiac muscle tissue of myocardium".In other cases, an arbitrary choice has to be made, and only one mapping is kept in the ontology.

Verification of annotations and sex inference
Several sanity checks are automatically performed on all annotations, from Bgee and from MODs.We verify that the anatomical entity or developmental stage used in annotation is expected to exist in the species annotated.This checks both our annotations and the Uberon taxon constraints.We verify that the anatomical entity is consistent with the sex annotated, to avoid, for instance, an annotation of "ovary" in "male".This information is inferred from Uberon, and by using information captured by Bgee about the sexes existing in species (for instance, to know whether hermaphrodite individuals can exist in the species).The inference of the sexes where anatomical entities can exist is reported in our pipeline source code at https://github.com/BgeeDB/bgee_pipeline/tree/master/generated_files/uberon/uberon_sex_info.tsv(archived version for release Bgee 14 releases: https://github.com/BgeeDB/bgee_pipeline/tree/v14.0/generated_files/uberon/uberon_sex_info.tsv).
When the sex information is not available for a sample, or we did not capture it at time of annotation, we use the same information to infer the sex automatically.For instance, in species with no hermaphrodite individuals, an "ovary" annotation allows to infer that the sex is "female".We store whether the sex is inferred, or originally annotated.

Expression ranks and expression scores
As of Bgee 14.1, we compute ranks using only data annotated to a condition itself (no propagation), and only considering anatomical entity, developmental stage, and species (i.e., over all sexes and strains indifferently).The rank of a gene in an anatomical entity for a species is the minimum of its individual ranks at each developmental stage with data for it.The pipeline source code and documentation can be found at https://github.com/BgeeDB/bgee_pipeline/tree/master/pipeline/post_processing(archived version for Bgee release 14.1: https://github.com/BgeeDB/bgee_pipeline/tree/v14.1/pipeline/post_processing).
Of note, this pipeline also offers the possibility to compute ranks by pooling all data in a condition itself and its sub-conditions.But we have not yet evaluated and released these ranks, although they would allow to quantify expression levels in many more conditions.
To compute ranks for RNA-Seq data, we first identify the set of valid genes that should be considered, and that is always the same for computing ranks in all libraries: the set of all genes that received at least one read over all libraries in Bgee for this species.Then, in each library, we compute fractional ranks of genes, based on their TPM value in this library.For each gene and each condition with RNA-Seq data in the condition itself, we compute a weighted mean of the gene ranks, using all libraries annotated to this condition itself in the species.We weigh the mean by the number of distinct ranks in each library, under the assumption that libraries with a higher number of distinct ranks have a higher power for ranking genes.
To be able to normalize ranks between conditions and data types, we store the maximum fractional rank over all libraries in each condition and species (and not the maximum weighted mean).To be able to compute the global weighted mean rank of a gene over all data types considered in a condition, we store the sum of the number of distinct ranks in each library, for each condition and species.For example, if we have a set of 3 genes to rank, and 2 libraries annotated to the same condition; in library 1, gene 1 has rank 1, gene 2 has rank 2, and gene 3 has rank 3; in library 2, gene 1 has rank 1, gene 2 has rank 2.5, and gene 3 has rank 2.5; the maximum rank in this condition is 3, and the sum of the number of distinct ranks is 3 + 2 = 5.
To compute ranks for Affymetrix data, we first compute fractional ranks for each gene, for each Affymetrix chip, based on the highest signal intensity from all the probesets mapped to this gene.
Then, we normalize ranks between chips annotated to the condition itself in this species.Indeed, different chip types do not have the same probeset design, and thus the same number of genes represented on it.The ranks are then inconsistent between different chip types.
For each chip type, we compute its max rank over all conditions.We normalize ranks of each chip based on the max rank of its corresponding chip type, compared to the max of the max ranks of all chip types used in the same condition.The idea is to correct for the different genomic coverage of different chip types.We do not normalize simply based on the max rank in a given condition, but based on the max rank of the chip types represented in that condition, to not penalize conditions with a lower number of expressed genes, or higher number of ex-aequo genes and lower fractional max ranks.For example, if two chip types are represented in a condition, and the max rank of chipType 1 over all conditions is 1,000, and the max rank of chipType 2 over all conditions is 2,000; the putative rank of a gene represented on chipType 1 could then be shifted, as compared to genes on chipType 2, by a factor of (2000/1000).We thus normalize fractional ranks of genes on chipType 1 by considering the average of their actual rank and their maximal shifted rank: (fractional_rank + fractional_rank * 2000/1000)/2.
We then compute the weighted mean of the normalized ranks per gene and condition, weighting by the number of distinct ranks in each chip, under a similar assumption to RNA-Seq that chips with higher number of distinct ranks have a higher power for ranking genes.Similar to RNA-Seq, to be able to normalize ranks between conditions and data types, we store the max of max ranks of chip types represented in each condition and species.To be able to compute the global weighted mean rank of a gene over all data types considered in a condition, we store the sum of the number of distinct ranks in each chip, for each condition and species.
Of note, for RNA-Seq data, we do not normalize ranks between samples in the same condition before computing the weighted mean, as for Affymetrix data: we use all libraries to produce ranking over always the same set of genes in a given species, so the genomic coverage is always the same, and no normalization is required.The higher power at ranking genes of a library (for instance, thanks to a higher number of mapped reads) is taken into account by weighting the mean by the number of distinct ranks in the library, not by normalizing libraries with lower max ranks; this would penalize conditions with a lower number of expressed genes, and thus with more equally ranked genes, corresponding to genes receiving 0 read.

In situ hybridization
Since in situ hybridization data are not quantitative, we use an approach based on the assumption that the more often an expression information is reported, the more biologically important this expression is likely to be.The Perl script to generate ranks for in situ hybridization data is available at https://github.com/BgeeDB/bgee_pipeline/blob/master/pipeline/post_processing/ranks_in_situ.pl(archived version for Bgee release 14.1: https://github.com/BgeeDB/bgee_pipeline/tree/v14.1/pipeline/post_processing/ranks_in_situ.pl).
To compute ranks for in situ hybridization data, we compute a score for each gene in each condition, based on the detection flag of in situ evidence (present, absent), and the quality level (high quality, low quality).Each spot is given the following score: [present, high quality], 1; [present, low quality], 0.5; [absent, low quality], -0.5; [absent, high quality], -1.Then we sum these scores for each gene in each condition.We consider all experiments pooled together, because each in situ experiment usually studies a very limited number of genes.We compute a dense ranking of the genes in each condition, based on the scores computed.Again, we have considered all experiments pulled together.A fractional ranking is not appropriate, because there are many ex-aequo, leading to an artificially high max rank.This gives too much weight to in situ hybridization data when computing a global weighted mean rank over all data types.To be able to normalize ranks between conditions and data types, and to compute the global weighted mean rank of a gene over all data types considered in a condition, we store the max rank in each condition and species.
To compute ranks for EST data, we sum for each gene in each condition the count of ESTs mapped to this gene, from all EST libraries in this condition pooled.We compute a dense ranking of the genes in each condition, based on the EST count sums, considering all libraries in the condition all together.
A fractional ranking was not appropriate for the same reason as for in situ hybridization data.To be able to normalize ranks between conditions and data types, and to compute the global weighted mean rank of a gene over all data types considered in a condition, we store the max rank in each condition and species.

Normalization over all data types and conditions
Different experiments and techniques might have different power and resolution at ranking genes in a same condition.For instance, some in situ hybridization data might allow to rank 5 genes in a condition; while RNA-Seq data in the same condition might allow to rank 30,000 genes, with a max rank of 25,000, and 20,000 distinct ranks.The ranks from different experiments and data types can thus be highly heterogeneous, and difficult to compare as such.To overcome this limitation, we normalize ranks by taking into account the max rank for each data type, condition and species, as compared to the max rank over all data in this species.We thus normalize ranks by considering the average of their actual rank and their maximal shifted rank: (rank + rank * (species_max_rank/data_type_condition_max_rank))/2.The Perl script performing this normalization step is available at https://github.com/BgeeDB/bgee_pipeline/blob/master/pipeline/post_processing/normalize_ranks.pl(archived version for Bgee release 14.1: https://github.com/BgeeDB/bgee_pipeline/tree/v14.1/pipeline/post_processing/normalize_ranks.pl).
We normalize the mean ranks for each gene in each condition and for each data type, based on the max rank for this data type in this condition and species, as compared to the max rank over all conditions and data types in this species.For instance, if the max rank in mouse over all data types and conditions is 30,000, and the dense ranking of a gene in a condition by in situ hybridization data is 10, and the max rank by in situ hybridization data in this condition in mouse is 30, the dense ranking of the gene will become (10 + 10 * 30,000/30)/2 = 5,005.While this will receive a low weight if there is more informative data (see below), it allows to compute a rank in the absence of other data type in this condition.It also avoids always ranking conditions studied only by in situ hybridization data at the top.
With the information computed per data type we compute a global weighted mean rank for each gene and condition.All types can be used, or only some.This mean is computed by using the normalized mean ranks or normalized dense ranks for a gene in a condition, from each data type considered.For RNA-Seq and Affymetrix data, if they are considered and data exist in this condition for this gene, the mean rank is weighted by the sum of the number of distinct ranks in this condition and species for the corresponding data type.For in situ hybridization and EST data, if they are considered and data exist in this condition for this gene, the dense rank is weighted by the max rank in this condition and species for the corresponding data type.
We transform global weighted mean ranks into expression scores.To compute the expression score of a gene in a condition, we retrieve the max rank over all conditions and data types considered, for this species; and the global weighted mean rank for a gene in a condition, computed over all data types considered and data available, as described above.The expression score is equal to (MaxRank + 1 -meanRank) * (100/maxRank).

GTEx data into Bgee
In addition to the continuous growth of transcriptomics datasets, some specific projects produce large amounts of data, generated and accessible in a consistent manner, as, notably, the GTEx project 13 .
The GTEx project aims at building a comprehensive resource for tissue-specific gene expression in human.Here we describe how this dataset was integrated into Bgee.

Annotation process
We applied a stringent re-annotation process to the GTEx data to retain only healthy tissues and noncontaminated samples, using the information available under restricted-access (see next section GTEx filtering criteria for more details).For instance, we rejected all samples for 31% of subjects, deemed globally unhealthy from the pathology report (e.g., drug abuse, diabetes, BMI > 35), as well as specific samples from another 28% of subjects who had local pathologies (e.g., brain from Alzheimer patients).We also rejected samples with contamination from other tissues.
In total, only 50% of samples were kept; these represent a high quality subset of GTEx.All these samples were re-annotated manually to specific Uberon anatomy and aging terms.

GTEx data in Bgee
All corresponding RNA-seq were reanalyzed in the Bgee pipeline, consistently with all other healthy RNA-seq from human and other species.These data are being made available both through the website, and through BgeeDB R package (with sensitive information hidden).

GTEx data in our website
• Annotations can be retrieved from RNA-Seq human experiments/libraries info.Experiment ID of GTEx is 'SRP012682'.
• Processed expression values, from GTEx only, are available on our FTP (download file).
• Gene expression calls are included into human files • Each human gene page includes GTEx data if there is any (search a gene here).
• TopAnat analyses can be performs here, which leverage the power of the abundant GTEx data integrated with many smaller datasets to provide biological insight into gene lists.

GTEx data using BgeeDB R package
More information and examples can be found on the BgeeDB R package page.
Annotations can be retrieved from RNA-Seq human experiments/libraries information.Experiment ID of GTEx is 'SRP012682'.
dataGTEx <-getData(bgee, experimentId = "SRP012682") TopAnat analyses can be performed, which leverage the power of the abundant GTEx data integrated with many smaller datasets to provide biological insight into gene lists.
We selected subjects with biological characteristics matched by Bgee requirements.We created a set of "accepted subjects" by filtering via a subset of variables reported in "phs000424.v6.pht002742.v6.p1.c1.GTEx_Subject_Phenotypes.GRU.txt".We came up with three category tags: 1.1."YES": subjects for which we had no evidence for unmatched Bgee requirements We kept as it is the GTEx Uberon mapping for those samples filtered as "GOOD" mapping quality

5.
We built the final annotation file, with the samples to be processed into the Bgee pipeline How did we select GTEx subjects whose samples could be included in Bgee?
Since we are dealing with data from human source, it would be too conservative to only keep perfectly healthy subjects, thus we had to find a compromise.We annotated in Bgee only samples from subjects for which we had no evidence for abnormal status (eg.diseases, non-ordinary physiological parameters...).
We excluded from Bgee all samples from subjects suffering from potentially systemic diseases (eg. fungal, bacterial, viral infections), drug/medicine abuse (cocaine, heroin, prescription pills...), abnormally high BMI (above an arbitrarily set threshold of BMI=35), and other "selected" diseases following discussions with annotators (eg.gonorrhea, lupus, sex-transmitted diseases... ); we also excluded subjects for which we had no details on reason of death or if the subject was reported to be ineligible for GTEx.Samples coming from cultured cell lines were also excluded (including EBVtransformed cells).We selectively excluded tissues sampled from unhealthy organs while the subject was otherwise judged to be healthy or affected by a disease not reported to lead to systemic infections in literature: for example, if a subject was reported to be suffering from "ascites", then we excluded liver samples from this subject, but we kept all samples from other organ source for the analyzed subject.We considered as includable in Bgee all samples from individuals which could not be excluded following the given parameters: We defined a "subject status" variable, harbouring three levels: "YES", "PENDING", O".We assigned one of the levels in variable "status" to each subject provided in GTEx v6.0, depending on the subject's features.We looked up the subject features in the provided file "phs000424.v6.pht002742.v6.p1.c1.GTEx_Subject_Phenotypes.GRU.txt".
In particular, for status assignment of each subject, we were interested in the following columns: -BMI : Autocalculated field of BMI: general indicator of the body fat an individual is carrying based upon the ratio of weight to height.We excluded from Bgee all samples from subjects having a BMI value greater than 35.
-INCEXC : A verification field of whether the Donor has met the overall eligibility criteria for GTEx collection based on answers to eligibility questions.We excluded from Bgee all samples from subjects that were reported "false" for this field.
-DTHFUCOD / DTHCOD: First Underlying Cause Of Death / Immediate Cause of Death.We excluded from Bgee all samples from subjects that were reported to died from: drug/medicine abuse/intoxication, cancer, end-stage diseases (eg.kidney, liver...) -DTHHRDY: Death Classification: 4-point Hardy Scale.We excluded from Bgee all samples from subjects that were tagged with a Hardy Scale level of 4 (aka: Slow death Death after a long illness, with a terminal phase longer than 1 day -commonly cancer or chronic pulmonary disease-; deaths that are not unexpected) only if DTHFUCOD / DTHCOD and DTHHRDY are consistent (eg: a subject had pneumonia and died of pneumonia, DTHHRDY 4).
-LBHBCABM / LBHBCABT / LBHBSAB / LBHBSAG: Hepatitis B virus infection serology analysis.We excluded from Bgee all samples from subjects that were tagged as "positive" (=1) in at least one of these columns.
-LBHCV1NT / LBHBHCVAB: Flaviviridae infection serology analysis.We excluded from Bgee all samples from subjects that were tagged as "positive" (=1) in at least one of these columns.
"PENDING"; if samples from a subject had no reason to be excluded from Bgee, then the subject was assigned status "YES".
How did we retrieve status/organ information for subjects from GTEx whose samples could be included in Bgee?
For the final annotation file, we needed to retrieve information such as organ source or the subject's attributes such as age, sex, ethnicity...etc.
Useful columns reported in file "phs000424.v6.pht002742.v6.p1.c1.GTEx_Subject_Phenotypes.GRU.txt" were the following: -SUBJID : Subject ID, GTEx Public Donor ID -GENDER : The Donor's Identification of gender based upon self-report, family/next of kin, or medical record abstraction.Note: values from this variable were used to obtain the "Sex" value for the final RNAseqLibrary.tsv file.
-AGE : Elapsed time since birth in years.Note: values from this variable were used to obtain the "InfoStage" value for the final RNAseqLibrary.tsv file.
-RACE : Report of Donor's race (geographically based) as reported by Donor, family/next of kin, or medical record abstraction.Note: values from this variable were used to obtain the "Strain" value for the final RNAseqLibrary.tsv file.
-SMTSD: Tissue Type, more specific detail of tissue type note: values from these three variables were merged together to obtain the "InfoOrgan" value for the final RNAseqLibrary.tsv file.
How did we check quality for each sample that could be included in Bgee?
Only good quality samples have to be included in Bgee.Quality level for each sample was assessed for both biological quality, inferred from microscopy sample subset observations by MDs, and annotation mapping quality.

Criteria of inclusion regarding healthy wild-type data
Mass Index) from 18.5 to 25 (or less than 30) (or less than 35) Yes In the normal range or overweight (overweight is common), see https://en.wikipedia.org/wiki/Body_mass_index to exclude other ranges (Sept, 2019, Update: see comment here above about GTEx samples, we have to consider BMI value is continuously rising inside human population).Actually, weight difference between individuals may be part of the natural variability SRP046752 Cell lines (3T3-L1, Hela, MCFtime is reasonable: a mouse fasting duration of 5-6 h might offer a better comparison to humans overnight(16-18 h)Depends on the animal, if reasonable for its physiology, yes as in the example GSE23528 Low or high fat diet for short time Yes If the diet time is short (3 days), can be considered as part of the wild life info on the stage (except for Drosophila where the different maturation stages are present in the ontology) E-GEOD-3351 Placenta and extraembryonic components during development Yes Be careful whether to put it in the adult or the embryo!E-GEOD-7674 Injury No Animals selected for their behaviour (e.g.fear) Yes Part of the natural variability E-GEOD-4035 Animals from different strains (e.g.C57BL/6, BALB/c...) Yes Part of the natural variability Intestinal germ free animals No Normal animals have an intestinal flora E-GEOD-5156 Removal of the eye (monocular enucleation) or cochlea on one side; the eye, visual cortex or cochlea on the other side were analysed No E-GEOD-4265 Cell types (T-cells, stem cells...) Yes Should be included only if enough precision in the ontologies (e.g.T-cell in zebrafish).If not enough precision in the ontology, store the experiment ID in the "not_included_for_now" file Polysomal RNA only hybridized No Method that pellets the polyribosomes while leaving the mono and nonpolysomal mRNA fractions in the supernatant.E-GEOD-3962 Anesthesia No/Yes Similar to drug treatment (but sometimes anesthesia is simply not described, and anyway has to be accepted for human tissue samples) Human post-mortem tissues Yes Mainly the simple way to get human tissues Light impulse to stress the animals Yes Stress is probably usual for lab animals Killed by